Blog Entry

Browse

Lucene: The Open Source Search Engine

Written by Reuven Lerner - Apr. 11, 2008

If you want to search for a piece of text on the Web, you probably turn to Google or a similar search engine. But how can you integrate search into your Web site? You could buy a search program, or even an appliance that handles search for you. And indeed, there's nothing wrong with that. But if you just want to include search on a small, personal site, or one that doesn't get any revenue, then you probably won't be able to afford anything commercial.

This leaves you with two basic options: You can build your own search engine, or hope to find an open source alternative. If you have ever tried to build your own search engine, then you know that this is a difficult path to take, much more than you would ever expect.

For example, if someone searches for "eat," they will probably want to find documents that contain "eats," "eating," and even "ate." Getting all of these little details right can mean the difference between a useful search engine and a disappointing one.

One increasingly popular alternative is Lucene, a program in Java that allows you to index documents and then search through them. The indexer takes documents as input, and produces a database indicating which words appear in each document (and where). The search system then takes words as input, and uses the database to identify which documents are the most appropriate match.

Lucene's indexer is considered to be particularly good, in part because it handles the sort of detail that I mentioned above. It can handle many different types of document, once they have gone through a parsing process; the Lucene project itself lists a number of pre-parsers that can be used in conjunction with Lucene to do this. Furthermore, Lucene is smart enough to handle word stems and conjugations, so "eat" and "eating" could match.

There are also a number of add-ons and extensions to Lucene. For example, Lucene lets you index documents that are already on your filesystem. But Nutch, a Lucene sub-project, combines Lucene's indexer with a Web crawler, making it possible to create your own Web search engine. Another Lucene sub-project, called Solr, which includes a number of features such as highlighted hits, caching, and a Web interface.

And while Java is certainly a popular language, many people -- including myself -- prefer to work with other, higher-level "scripting" languages such as Perl, Python, and Ruby. Sure enough, Lucene has been ported to these languages, as well as to other languages such as C and C++. This means that no matter what language you want to use, you can probably find a version of Lucene for it.

Lucene is fairly easy to install, and is designed to be easy to customize and use, as well. You can get updates at the Lucene Web site, or keep up to the Lucene blog.

Have you used Lucene, or a similar solution? 


Comments

Add Comment
  1. By asanka on Apr. 11, 2008

    The OStatic search is built on Lucene indexer with Solr. We are using the facet capabilities here to make it easier for our members to find the essential information they seek.

    Solr is very powerful and quite easy to work with. The documentation is very good as well. Many thanks to the people who build this essential OSS technology.

    0 Votes
  2. By lnxrks on Apr. 11, 2008

    For a definitive guide on Lucene, check out Otis and Hatcher's "Lucene In Action". http://www.amazon.com/Lucene-Action-Otis-Gospodnetic/dp/1932394281

    It provides a few case studies in there too.

    No, I do not have my id included in the link, and will not make any commission on any sale!

    0 Votes
Share Your Comments

If you are a member, to have your comment attributed to you. If you are not yet a member, Join OStatic and help the Open Source community by sharing your thoughts, answering user questions and providing reviews and alternatives for projects.

Trackback URL
Please use the following URL to add a trackback to this article.
http://ostatic.com/trackback/158928