Guest Post: Lucid Imagination's Co-Founder On the State of Search

by Ostatic Staff - Aug. 18, 2011

The ApacheCon NA 2011 conference is rapidly approaching. The event takes place November 7 to 11 at the Westin Bayshore in Vancouver, Canada. Registration for the event is now open with 25 percent discounts available. You can read much more about the conference, and register, here.

In conjuction with ApacheCon NA 2011, OStatic is running a series of guest posts from movers and shakers in the Apache community. In this initial guest post, Grant Ingersoll (shown), Chief Scientist and co-founder of Lucid Imagination weighs in. Lucid Imagination provides open source enterprise search tools built with Apache Solr and Apache Lucene, along with website search tools. Here, Ingersoll tackles the question of who is smarter--you or a search engine.

Are You Smarter Than a Search Engine?

By Grant Ingersoll, Chief Scientist, Co-Founder, Lucid Imagination

For years, search and Natural Language Processing  (NLP) researchers have been promising tools that can understand a user’s intent and sift through large volumes of data to return very targeted results.  Sadly, the hype has generally outpaced the hope for intelligent search systems, leaving most people simply trying to stay afloat in a sea of data.  The times, however, as Mr. Dylan says, are a changin’.   Search engines are significantly smarter today and real, practical systems are in production that go well beyond simple keyword search and link authority calculations (i.e. Google’s PageRank algorithm) to offer much deeper insights into what users are looking for.  

Most web search engines incorporate massive amounts of information beyond the crawled page and extracted links to produce results. User clicks, personal preferences, past queries and much more all factor into rankings.   In more focused applications, such as IBM’s Watson Question Answering system  (which uses a search engine) multiple relevance strategies and deep analysis of both the user’s question (or the Jeopardy answer, as it were) and the content sources plays an important role in finding correct answers.  Watson’s victory is partly attributable to a more sophisticated betting strategy than most human competitors employ, but there is little doubt that the underlying search and NLP technology bring us closer to fulfilling that early promise.

Perhaps even more interesting than the improvements themselves is how we obtained them.  The innovation is due, in part, to at least two key components: a massive spike in the volume of data available for analysis by machine learning algorithms and our ability to quickly harness shared knowledge and technology, both software and hardware, in order to process said data.

In the past, large volumes of data made many problems intractable (naturally, it still does in many cases.)  Today, however, we often want more data so we can run it through machine learning tools in hopes of accounting for as many user interactions as possible in order to accurately predict future interactions.  Previously, search engines only seemed to care about the text ingested.  Now, we care not only about the content, but also how the users interacted with the results both personally and in the aggregate and across time.  This social metadata helps bring the focus of search back on the user and their information need instead of solely on the raw data and the parsing of the language.

On the development side, the democratization of knowledge and technology via both open source implementations of key building blocks and easily accessible cluster computing capabilities like Amazon’s EC2, have made it much easier to build applications that process massive amounts of data at large scale and for fairly low costs, thus freeing us up to focus on higher order items like deep analytic capabilities that can produce meaningful results.  For example, in the search field, Apache Lucene has long implemented the key ranking algorithms used to quickly and efficiently find information in unstructured content, such that most people can fairly easily incorporate high quality search into their applications. 

Likewise, Apache projects such as Hadoop, Pig, Hive and Mahout are bringing large-scale data analysis capabilities to the masses, such that one can quickly slice and dice data, learn from it and then feed it back into the application to better enhance user experience.  More importantly than the code itself, these projects also have robust communities willing to share knowledge, such that one can very quickly get up to speed without a large learning curve or large upfront costs. 

As for the title question, will search engines ever be smarter than us?  Perhaps.  More importantly, though in the short term, they will be more focused on producing highly relevant results in real time and in ways that complement our strengths.

Apache search software --including Lucene, Hadoop, Pig, Mahout, and more-- are released under the Apache License v2.0, and are overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides each Project’s day-to-day operations, including community development and product releases. Apache source code, documentation, and related resources are available at http://www.apache.org/.

Apache technologies power half the Internet, terabytes of data, teraflops of operations, billions of objects, and enhance the lives of countless users and developers. Established in 1999 to shepherd, develop, and incubate Open Source innovations "The Apache Way", The Apache Software Foundation oversees 150+ projects led by a volunteer community of over 350 individual Members and 3,000 Committers across six continents. Learn more about Apache search innovations at ApacheCon, the official conference, trainings, and expo of The Apache Software Foundation. For more information, visit http://apachecon.com/

****************************************
About the Author:

Grant Ingersoll is the Chief Scientist, as well as a co-founder of Lucid Imagination, where he helps set strategy for next generation search and machine learning applications. Grant has also been an active member of the Apache Lucene community – a Lucene and Solr committer, co-founder of the Apache Mahout machine learning project, former chairman of the Lucene Project Management Committee (PMC) as well as a former Vice President at the Apache Software Foundation. Grant’s prior experience includes work at the Center for Natural Language Processing at Syracuse University in natural language processing and information retrieval. Grant earned his B.S. from Amherst College in Math and Computer Science and his M.S. in Computer Science from Syracuse University, NY.