Guest Post: Under the Hood in Apache Lucene 4.0

by Ostatic Staff - Aug. 22, 2011

The ApacheCon NA 2011 conference is rapidly approaching. The event takes place November 7 to 11 at the Westin Bayshore in Vancouver, Canada. Registration for the event is now open, with discounts available. You can read much more about the conference, and register, here. In conjuction with ApacheCon NA 2011, OStatic is running a series of guest posts from movers and shakers in the Apache community, and the first post in this series appeared here. In today's second series post, Simon Willnauer (shown) a core committer to the Apache Lucene search project, and PMC Chair of Apache Lucene, discusses what's under the hood in the upcoming release of Apache Lucene 4.0.

Lucene 4.0: The Revolution

By Simon Willnauer

The near-limitless innovative potential of a thriving open source community often has to be tempered by the need for a steady roadmap with version compatibility. As a result, once the decision to break backward compatibility in Lucene 4.0 had been made, it opened the floodgates on a host of step changes, which, together, will deliver a product whose performance is unrecognisable from previous 3.x releases.

One of the most significant changes in Lucene 4.0 is the full switch to using bytes (UTF8) in place of text strings for indexing within the search engine library. This change has improved the efficiency of a number of core processes: the ‘term dictionary’, used as a core part of the index, can now be loaded up to 30 times faster; it uses 10% of the memory; and search speeds are increased by removing the need for string conversion.

This switch to using bytes for indexing has also facilitated one of the main goals for Lucene 4.0, which is ‘flexible indexing’. The data structure for the index format can now be chosen and loaded into Lucene as a pluggable codec. As such, optimised codecs can be loaded to suit the indexing of individual datasets or even individual fields.

The performance enhancements through flexible indexing are highly case specific. However, flexible indexing introduces an entirely new dimension to the Lucene project. New indexing codecs can be developed and existing ones updated without the need for hard-coding within Lucene. There is no longer any need for project-level compromise on the best general-purpose index formats and data structures. A new field of specialised codec development can take place independently from development of the Lucene kernel.

Another significant change comes where multiple threads are used for indexing. In earlier versions of Lucene, once the amount of data stored in-memory reached its threshold, data from all of the threads would be merged together and transferred, or flushed, to persistent storage. The result was a highly inefficient stop-start process, which did not take full advantage of multi-core processing and I/O resources.

Lucene 4.0 now employs ‘concurrent flushing’. Once each individual thread buffer reaches its memory threshold, it can now flush its memory independently to persistent storage, whilst the others continue working. This apparently simple change in architecture does create the potential to create duplicates, through the threads indexing in isolation. However, this problem has been solved using innovative non-blocking algorithms. Initial results of implementing concurrent flushing indicate a 270% increase in Lucene’s indexing performance. (More detailed explanation and analyses are available here and here.)

Lucene 4.0 also incorporates dramatic improvements on the search side, most notably in ‘fuzzy matching’, where the engine looks for near matches to search. Prior to 4.0, Lucene’s FuzzyQuery function used a processor-intensive brute force approach. It visited every single unique term in the index and computed the edit distance from the search term (the number of insertions, deletions or substitutions necessary to match the search term). It then accepted documents if the edit distance was within the parameters set by the search.

The incorporation of more sophisticated Levenshtein Automation now enables Lucene 4.0 to compute a list of terms within a set edit distance of the search term, before matching only these against the index. The result of this apparently ‘simple’ change: A mere 20,000% increase in fuzzy matching performance. However, Levenshtein Automation is far from simple. As Mike McCandless outlines in this more detailed writeup.

Building the necessary algorithm from a complex 67 page academic paper into Lucene was an “insanely hard challenge”, eventually performed by translating an early open source implementation in Python into Java.

“This whole exciting experience is a great example of why open-source development works so well… No single person involved in this really understands all of the parts; it's truly a team effort.”

This quote is somehow now true of the entire Lucene project, with detail specialisms being pooled into a project of almost unimaginable scope. Anyone reading this is likely to be able to sum up the compound performance improvements of the few changes outlined in this short article. These changes merely scratch the surface of the full efforts of Lucene 4.0 team.