Archive for Understanding Search Engines

Search Engines 101 - Indexing (Part 2)

(Continued from Part 1)

As connection speeds increased and bandwidth and storage became more affordable, search engines were able to visit more pages on a site and record more information about each page. In addition, search engines began to move away from considering only on-page text and put more weight on off-page factors like inbound links that are not as easily manipulated by page owners.

The important thing to remember about search engines today is their continuing reliance on off-page factors to determine what a page is about. In the past, indexing the content of the page itself was enough to provide accurate data but today there are simply too many ways for site owners to manipulate the text on their pages to artificially boost their rankings.

This continues to be one of the biggest misconceptions our clients have about search engines - that simply changing ‘META’ tags like description and keywords will make much difference to a search engine. In 1997 that may have been enough, but search algorithms are much more advanced today. In fact, many of my colleagues believe that including the keywords tag on a page can actually harm a page’s rankings (more on this in other posts).

So, what you should remember about indexing is that it’s how a search engine collects and stores information about web sites. Also remember that influencing a search engine by changing on-page factors like META tags, keyword density, or any other easy to manipulate metric is much more difficult than it was in the past and certainly not a strategy to base your search engine marketing upon…

Search Engines 101 - Indexing (Part 1)

When you go to Yahoo! or Google and do a web search for something, you’re not actually searching the internet. What you’re doing is searching a massive database that each has created that’s filled with information about the billions of pages on the internet.

The terms you search for are compared with the information in this database and a list of web pages that Yahoo! or Google think most closely match what you’re searching for are listed as results.

In addition to the actual data in their databases, what makes the Yahoo!, Google, MSN/Live, Ask.com, etc. search engines different is the way each one of them ‘index’ web pages.

Indexing refers to the proprietary methods a search engine uses to find web pages on the internet and how they add information about web pages to their database when they visit a web page. This database of information about web pages is called a search engine’s ‘index’.

Search engines fill their databases with information by sequentially visiting web pages, collecting information about the page’s content, and following links out of the site to discover new web pages. This procedure is called ’spidering’ and will be covered in another post.

Spidering takes resources - both in the bandwidth necessary to traverse the internet and return data about web pages and in the storage capacity necessary to store whatever information is collected. Because a spider visit is just like any other visit to a web page, bandwidth fees are incurred by the page owner as well.

In the early days of the internet when connections were slow, bandwidth expensive, and data storage at a premium, search engines typically only saved the title and location (url) of a web page and a list of keywords that described the content of the page. It was simply too expensive for a search engine to pay for the bandwidth and storage necessary to save more information about a web page, not to mention the additional time indexing more content on a page would add to the already lengthy spidering process. These expense issues were also important to site owners as more intensive or more frequent spider visits meant increased bandwidth costs for them as well.

So, a compromise was reached. Web designers would supply a list of keywords relevant to the content of the site (in the ‘keywords’ META tag hidden in the code of the page) on the main page of the site and search engines wouldn’t need to visit every web page on the site trying to find out what each was about.

This gentleman’s agreement worked well until the web started to commercialize and web designers realized they could put anything they wanted in the ‘keywords’ tag, regardless of what their web site was actually about. This resulted in sites being found in search results for keywords that had little to nothing to do with their actual content and also reduced the efficacy of the search engine as a useful way to find relevant content online.

In Part 2, how search engines responded.