Google instead prefers a “surfacing” approach which, put simply, is making a local copy of the deep web on Google’s cluster.
Not only does this provide Google the performance and scalability necessary to use the data in their web search, but also it allows them to easily compare the data with other data sources and transform the data (e.g. to eliminate inconsistencie and duplicates, determine the reliability of a data source, simplify the schema or remap the data to an alternative schema, reindex the data to support faster queries for their application, etc.).
Google’s move away from federated search is particularly intriguing given that Udi Manber, former CEO of A9, is now at Google and leading Google’s search team. A9, started and built by Udi with substantial funding from Amazon.com, was a federated web search engine. It supported queries out to multiple search engines using the OpenSearch API format they invented and promoted. A9 had not yet solved the hard problems with federated search — they made no effort to route queries to the most relevant data sources or do any sophisticated merging of results — but A9 was a real attempt to do large scale federated web search.
If Google is abandoning federated search, it may also have implications for APIs and mashups in general. After all, many of the reasons given by the Google authors for preferring copying the data over accessing it in real-time apply to all APIs, not just OpenSearch APIs and search forms. The lack of uptime and performance guarantees, in particular, are serious problems for any large scale effort to build a real application on top of APIs.
Google has put its energies into Google Co-Op which allows users to create their own sub-Google search engines using the Google database as the datasource. This has the effect of encouraging traditionally deep web databases like museum collection databases to become spiderable, indexed and cached by Google. For individual end users this makes sense – they probably already go to Google first, but does it make sense for content providers?
Try this example.
Here is a search for ‘heater’ using the Powerhouse’s own collection search.
Top five –
B1431 Solar heater, plus base, wood/metal, Lawrence Hargrave, Australia, [1870-1915]
K693 Immersion water heater, electric, made in Australia, late 1930s (OF).
93/176/15 Light globe, heater lamp, glass/metal, British Thompson Houston, England, 1920
93/176/16 Light globe, heater lamp, glass/metal, Osram, England, 1950
85/69 Brochure, Instruction and Operating Chart for Emmco Fryside heater
Here is the same search for ‘heater’ using a Google Coop search of the same data within the same collection (using a Coop search I created).
Top five –
86/676 Gas heater – Malley’s No. 1, copper, Metters, Australia …
97/331/1 Convection heater, domestic, portable gas, metal/paint …
H7061 Water heater, “The Schwer”, constructed of copper & can be …
B1538 Water heater model, steam, “Friar”, [Australia or UK]; A A …
95/117/1 Kerosene water heater and instruction sheet, Challenger …
So which is more accurate?
Google’s Coop bases it results on a number of different factors, all of which are unknown to the searcher, and most of which are unknown to the content provider. At least with our internal search we can tweak the ordering and relevance of results using our own known variables.