The semantic web didn't arrive in 2010, but it may have inched closer to reality. We look at the year's successes and setbacks.
In 2001, Tim Berners-Lee introduced the idea of the semantic web. The vision of the semantic web is of a web of data built on top of the web of documents we know today. It can be parsed, queried, and operated on by machines as well as by people. Also known as the “global graph” or “Web 3.0,” the semantic web feels tantalizingly close to reality, but it hasn’t quite materialized. Yet 2010 saw a lot of change in the semantic web sector that brought us closer to a working semantic web.
On the web as we know it, you enter a keyword into a search engine, which returns many, many documents matching your keyword. Then the web of documents hands off the work to you and goes to take a nap. It’s up to you to open those documents and read through them, one at a time, until you find the information you need. The web of documents is a total slacker.
Not the web of data; it’s hard-working. And it’s smart, too, or at least smarter than the web of documents. Say you want to go to the movies. A search engine based on the semantic web understands that a movie has showings with start and end times, that tickets cost a certain amount, that some theaters are closer to your house than others, and that movies can sell out. It uses this information to return a list of all the movies near your house that are starting in the next hour, have better than average reviews, and still have tickets available. Of course, all of this work happens behind the scenes, powered by ontologies and semantic web languages of which you are blissfully unaware. You can spend your time doing better things – like taking a nap yourself.
Smart integration of information behind the scenes is one promise of the semantic web, and 2010 saw a real movement in that direction. Several well-known, high-buzz semantic applications were bought up and integrated into more established enterprises. This continued a trend that began in 2008 when Microsoft bought semantic search startup Powerset and integrated it into Microsoft’s Bing search engine. In 2010, Google acquired Metaweb’s Freebase, and integrated it into its Refine data-wrangling service. Flipbook bought Ellerdale with the intention of using it to better discover and aggregate content based on user profiles. The semantic startup Evri acquired the semantic startup Twine, incorporating Radar Networks’ technology into its infrastructure. (Interestingly, Evri repositioned itself as a mobile app. Under the covers, it’s still a semantic content discovery solution, but for whatever reason Evri decided that that’s not its main selling point anymore.)
Most notably, Apple bought Siri for an estimated $200 million. Siri is a semantically-powered personal assistant. It uses natural language processing and artificial intelligence (AI) techniques to access Web services that answer your questions or to accomplish your tasks. In deciding what information or services to provide, Siri takes into account things like your location, personal preferences, and the quality of information received from various providers. It can understand concepts like “a fancy restaurant” or “near my office.” Siri is still a stand-alone application available in Apple’s app store; it’ll be interesting to see if it remains that way or is integrated more tightly into the iOS.
Critics of the semantic web have long pointed to its bootstrapping problem: you can’t have a web of data without, well, data. Structured data, and lots of it. But you can’t generate the masses of data needed without tools to mark it up with. One of the bright spots in 2010 was the growth of linked data, and especially linked open data. Linked data is a semantic web paradigm that aims to publish data as RDF triples. It arguably trades off some of the loftier and more cohesive goals of the overall semantic web, such as reasoning over datasets, for the more immediate and lightweight goals of putting accessible, structured data in use. There’s a big push from the semantic web community to make linked data open data as well, meaning that the data comes licensed for reuse.
Both the United States and United Kingdom governments made a big push this year to publish structured data, with many other national, state and local governments following suit. As of September 2010, the W3C estimated the size of the collective linked open dataset to be over 25 billion RDF triples, interlinked by 395 million RDF links.
2010 also saw a sharp increase in the number of enterprise websites publishing data on the semantic web. Time, The New York Times, and Overstock.com are among those marking up their HTML semantically. Best Buy added RDFa tags to its store pages and reported a 30% increase in traffic to them. In light of this, in 2011 look for SEO strategies to evolve to encompass structured data. In the content management system (CMS) world, Drupal 7 all-but-officially-released in 2010 (the official release came in early 2011), making Drupal the first CMS with native support for semantic web markup, and (hopefully) putting pressure on other CMS vendors to do the same.
In the face of all of this exciting change and growth, it feels almost too nerdy to cheerlead some recent RFCs and specifications. But we all know that the Web runs on standards, and we need the infrastructure that makes the cool stuff possible. If the semantic web is becoming less visible to end users, the changes I describe below are deep in the plumbing. None of these will ever be seen by Web users, but they enable developers to create a better semantic web experience for them.
On the standards side, the W3C released the Rule Interchange Format (RIF). RIF may not be the splashiest member of the semantic web platform, but it’s the key to aggregating data that’s been marked up differently at different sites. It also could become the basis for more sophisticated operations, such as allowing automated agents to negotiate across organizational boundaries or reason across websites. The W3C initially evaluated over 50 use cases while developing the standard; the next year or two should tell us whether or not it becomes widely adopted.
Jena, a Java-based semantic web framework originally developed at HP, has been] adopted by the Apache incubator. Ideally, this will expose semantic web technologies to a wider audience of Java developers while allowing Jena to grow and thrive as an open-source tool.
Last but not least, the Internet Engineering Task Force (IETF) is proposing a standard to create typed relationships between resources on the web. While this isn’t strictly a semantic web initiative – it’s outside the scope of the W3C and it uses a registry rather than publishing the types in an ontology – it accomplishes the goal of using markup to add structure to web pages.
Setbacks for 2010? I only counted a few. In October, Yahoo! closed down its SearchMonkey project. SearchMonkey was a framework for developers to add structured metadata to search results, similar to Google’s rich snippets. This probably says more about Yahoo!’s internal approach to search than anything larger about semantic search, but it’s sad to see it go. Facebook’s OpenGraph initiative is a mixed bag in my book. It’s a big step forward for structured data on the web, but it also has the real possibility of locking up large chunks of the social graph it creates.
This tension between private and public models and data will continue to play out this year, in debate over OpenGraph and whether we’re truly seeing the rise of open linked data or linkable, privately owned data.
2011 is off to a roaring start, with the official release of Drupal 7. What else will the year bring? I think we’ll continue to see the same, steady movement towards a working semantic web: snowballing RDFa adoption, ever more linked data, and, if we’re lucky, another smart semantically-powered app or two along the lines of Siri to make use of it in new and exciting ways. As the semantic web solves its data problem, its builders and users will move on to next level of problems. Questions like “Where did my data come from?” “How accurate is it?” and “How much do I trust its provider?” will be taken up in the next few years.