t: +44 (0)1285 643 496

e: found@searchpath.co.uk

SearchPath RSS Link Feed It!

SearchPath Blog

SearchPath Internet Marketing Blog - Thoughts, ideas, humour, information and more ...

Friday, May 09, 2008

Google Tries to Tap Into the Hidden Web

The web is vast. Incredibly vast. Some estimates put the searchable web at around 11 billion pages. it would take lifetimes to view all that content. But that is just the tip of the iceberg. Behind the searchable, Google oriented web is massive amounts of content not available to search engines; this is called the hidden, or invisible web. Some estimates put the amount of data that is hidden to search engines at 15 billion + pages; much larger than what most people would normally call the web.

Pages can be hidden from the search engines for a number of reasons: the content could be unspiderably dynamic; the content could be unlinked; the content could be limited access; the content could be in an image or video or the content could be only accessed by a form. Since the early days of the web the search engines have wanted to gain access to this uncharted realm of information to enhance their reputation as having the biggest index available to surfers.

Last month Google announced in their Webmaster blog that a technological breakthrough had been made to gain access to hitherto inaccessible web content. In the past few months Google has been experimenting with using their spider (Googlebot) to fill out HTML forms in order to gain access to hidden content and URLs to index for Google users. Google's blog comments:

"Specifically, when we encounter a form element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML."

If Googlebot deems the content to be valid and interesting, it includes it in its index.

Danny Sullivan lauded Google's new technology. "The move is potentially good for searchers, in that it will open up material often referred to [as] being part of the 'deep web' or 'invisible web' as it was hidden behind forms...It should be noted that Google's not the first to do something like this. Companies like Quigo, BrightPlanet and WhizBang Labs were doing this type of work years ago." Google is the first major search engine to do this kind of exploration, though.

SEO's and webmasters need to remember that content that a form previously hid is now potentially accessible; steps need to be taken to block Googlebot using robots.txt or the no index tag. Google says these steps will still be respected.

Share It!

Click here to return to blog home

0 Comments:

Post a Comment

Links to this post:

Create a Link

Bookmark It!