December 16, 2007

Millions and Millions of Search Results... Ignored

When you put a common term into a search engine, you're likely to get hundreds of thousands, if not millions, of search results, whether you use Google, Yahoo!, or any other leading engine.

In fact, to some, the vast number of search results is used to see which search engine crawler has done the most thorough job at indexing the Web - and it's assumed that with the most results has the superior algorithm.

But did you know that regardless of how many results there might actually be for a query, both Google and Yahoo! will only let you see the first 1,000?

Sorry, can't go past page 100. You've reached the end.

This artificial limit is excused as saying the limit has been put in place to reduce software and hardware resources, and that 1,000 results is good enough for most people. So you'll never, ever, get past page 100 on Google or Yahoo!, even if a search for "Google" on either engine shows more than 1 billion results.

But it's retrieving that data somewhere, right? If Google has a mountain of results available for a term, and only delivers the top 1,000, then some database somewhere knows what are the results for positions 1,001 to 9,999 and beyond, to the tens of millions. Yet users have no recourse if they want to peer into that index. There's no option to "Show all results" or "Display the top 10,000 results". Google and Yahoo! have arbitrarily decided that 1,000 is good enough for you, and that's that.

Do you feel lucky? Some have said Google overwhelmingly optimizes for the first results, and as the company writes, "We try to make your search experience so efficient that it's not necessary to scroll past the first ten listings."

But isn't it likely that there are projects out there where it would be helpful to analyze the top 5,000 results? Or 20,000? If you were an SEO firm, there are obvious benefits to this, or if you're doing any kind of artificial intelligence research, Google would be one of the best data pools out there.

So why are they doing this? It looks like even Google, who is assumed to have one of the most redundant, robust systems known to man, is trying to save money and resources. They write, in an explanation, "It would heavily tax our system to provide these results for everyone."

While that's understood, then what data is propping up Yahoo! or Google's claims that they have the most thorough results? Could the last step from Google's algorithm state (multiply results x 2), solely to have the biggest number available? After all, if you could only see the first 1,000, why not report you got eleventy trillion? There's got to be a way to get to the rest of the data.

