If you want to search for a piece of text on the Web, you probably turn to Google or a similar search engine. But how can you integrate search into your Web site? You could buy a search program, or even an appliance that handles search for you. And indeed, there's nothing wrong with that. But if you just want to include search on a small, personal site, or one that doesn't get any revenue, then you probably won't be able to afford anything commercial.
This leaves you with two basic options: You can build your own search engine, or hope to find an open source alternative. If you have ever tried to build your own search engine, then you know that this is a difficult path to take, much more than you would ever expect.
For example, if someone searches for "eat," they will probably want to find documents that contain "eats," "eating," and even "ate." Getting all of these little details right can mean the difference between a useful search engine and a disappointing one.
One increasingly popular alternative is Lucene, a program in Java that allows you to index documents and then search through them. The indexer takes documents as input, and produces a database indicating which words appear in each document (and where). The search system then takes words as input, and uses the database to identify which documents are the most appropriate match.
Lucene's indexer is considered to be particularly good, in part because it handles the sort of detail that I mentioned above. It can handle many different types of document, once they have gone through a parsing process; the Lucene project itself lists a number of pre-parsers that can be used in conjunction with Lucene to do this. Furthermore, Lucene is smart enough to handle word stems and conjugations, so "eat" and "eating" could match.
There are also a number of add-ons and extensions to Lucene. For example, Lucene lets you index documents that are already on your filesystem. But Nutch, a Lucene sub-project, combines Lucene's indexer with a Web crawler, making it possible to create your own Web search engine. Another Lucene sub-project, called Solr, which includes a number of features such as highlighted hits, caching, and a Web interface.
And while Java is certainly a popular language, many people -- including myself -- prefer to work with other, higher-level "scripting" languages such as Perl, Python, and Ruby. Sure enough, Lucene has been ported to these languages, as well as to other languages such as C and C++. This means that no matter what language you want to use, you can probably find a version of Lucene for it.
Lucene is fairly easy to install, and is designed to be easy to customize and use, as well. You can get updates at the Lucene Web site, or keep up to the Lucene blog.
Have you used Lucene, or a similar solution?Â