Search Engines 101 - Indexing (Part 1)
When you go to Yahoo! or Google and do a web search for something, you’re not actually searching the internet. What you’re doing is searching a massive database that each has created that’s filled with information about the billions of pages on the internet.
The terms you search for are compared with the information in this database and a list of web pages that Yahoo! or Google think most closely match what you’re searching for are listed as results.
In addition to the actual data in their databases, what makes the Yahoo!, Google, MSN/Live, Ask.com, etc. search engines different is the way each one of them ‘index’ web pages.
Indexing refers to the proprietary methods a search engine uses to find web pages on the internet and how they add information about web pages to their database when they visit a web page. This database of information about web pages is called a search engine’s ‘index’.
Search engines fill their databases with information by sequentially visiting web pages, collecting information about the page’s content, and following links out of the site to discover new web pages. This procedure is called ’spidering’ and will be covered in another post.
Spidering takes resources - both in the bandwidth necessary to traverse the internet and return data about web pages and in the storage capacity necessary to store whatever information is collected. Because a spider visit is just like any other visit to a web page, bandwidth fees are incurred by the page owner as well.
In the early days of the internet when connections were slow, bandwidth expensive, and data storage at a premium, search engines typically only saved the title and location (url) of a web page and a list of keywords that described the content of the page. It was simply too expensive for a search engine to pay for the bandwidth and storage necessary to save more information about a web page, not to mention the additional time indexing more content on a page would add to the already lengthy spidering process. These expense issues were also important to site owners as more intensive or more frequent spider visits meant increased bandwidth costs for them as well.
So, a compromise was reached. Web designers would supply a list of keywords relevant to the content of the site (in the ‘keywords’ META tag hidden in the code of the page) on the main page of the site and search engines wouldn’t need to visit every web page on the site trying to find out what each was about.
This gentleman’s agreement worked well until the web started to commercialize and web designers realized they could put anything they wanted in the ‘keywords’ tag, regardless of what their web site was actually about. This resulted in sites being found in search results for keywords that had little to nothing to do with their actual content and also reduced the efficacy of the search engine as a useful way to find relevant content online.
In Part 2, how search engines responded.
You must be logged in to post a comment.