By Raj Bala
Many people, even technology folks, really don't know what to make of the Semantic Web. Then there are several camps that disagree on the meaning of various Semantic Web terms. Now that the general concept is finally getting some traction, there is even some groupthink going on about moving away from the current moniker. Some people want the moniker to be "Linked Data Web," because that's supposedly a better description of the technologies and components surrounding the Semantic Web. We'll continue to refer to it as most people understand it, and the open source community is pitching in, through the Jena project.
Here's a quick summation of the Semantic Web goal: Data that is better described than other data will serve its users better. When hyperlinks are marked up with more descriptive HTML, machine agents and web crawlers can understand these descriptions and do a better job of giving users what they need.
The idea is that these descriptions are either broadly understood already, or extensions of what is broadly understood. In either case, when descriptions are published appropriately, machine agents are better informed, and thus so are their users. That's the sort of low-tech, but useful definition of the Semantic Web that XHTML and Microformat folks favor.
Then there's the camp unofficially led by Sir Tim Berners-Lee. This camp has a more formal approach to the Semantic Web, and introduces a correspondingly elevated level of complication. They think there's a better way to manage lots of marked up data near the storage level. Laying data elements out based on their relationships to each other allows for applications to infer complex relationships that could be valuable to applications users.
There's not much to the first described low-tech approach. Mark a web application's rendered HTML view with additional data that's broadly known, and you're finished.
As far as the Tim Berners Lee camp goes, there are some great open source components that developers can use to build Semantic Web applications under the formal approach. Leading the charge there is Jena, an open source Semantic Web framework written in Java. It's a project born out of HP Labs in the U.K. with strong support from them and the community of developers at large.
Jena is essentially an API that provides developers a mechanism to insert and query RDF encoded triples. The resultant data store is, creatively, called a “triple store.” The word triple comes from the three parts of the stored data.
SPARQL, part of Jena's ARQ module, provides query access to RDF encoded triples. Jena provides functionality equivalent to the SPARQL W3C recommendation, with a few extensions for added functionality.
Like most of the open source database-persisted SPARQL implementations, there are limitations. It's essentially a query interface built on top of SQL, so it's only ever as fast as the underlying database subsystem. With disk I/O still being one of the largest applications constraints, SPARQL can be very slow especially with massive amounts of data.
Jena also provides a web service implementation of SPARQL called Joseki, so that remote applications can query as if they were running locally. The idea is that application developers can open up their RDF triple stores to third parties by providing them tight integration over HTTP.
Very few companies have been able to successfully scale a Semantic Web triple store for millions of users. While the Semantic Web has some mathematically sound theoretical foundations, there are still very few (if any) real-world implementations of this stuff scaling well.
If anyone solves the scaling issues, it will likely be the open source community surrounding Jena. That community has emerged as one of the most vibrant sections of the Semantic Web world, due to an enthusiastic contributory development team. If you 're interested in the Semantic Web, definitely investigate Jena.