BIO150Y Optimal Information Foraging  

More about the Internet and Search Engines

Oh, what a tangled Web we weave
when first we practice to deceive.
Sir Walter Scott


What is the Internet?

Internet

The Internet is a huge network of millions of computers from many countries and organizations which are connected via a common set of protocols.

The Internet consists of:

  • World Wide Web: collection of HTML pages authored by anyone; most visible part of the Internet
  • Email: all email communications travel via the Internet
  • FTP sites: areas where programs can be downloaded from one server to another
  • Deep Web: non textual information such as images, multimedia files, and video; also includes information stored in databases which must be searched individually such as an online library catalogue or eBay
  • Library on the Web: the 14,000+ e-journals, 300+ periodical indexes, and other resources the U of T Library subscribes to

How do search engines work?

All search engines consist of 3 basic parts:

  1. Spider (or robot, worm, crawler, wanderer) : a software programme which travels the World Wide Web to build lists of words found on Web sites. This process is called Web-crawling. Its usual starting points are heavily used servers and very popular Web pages. It will index words on the page and then follow every link found within the site to travel to other pages. It can spread quickly, covering about 100 pages per second.

    Different search engines differ in the way words are found and processed. For example, Google spiders pay special attention to words occurring in the title, subtitles, and meta tags. They will not index very common words such as "a", "an", "the", etc. Other search engine spiders will index all words but pay special attention to the 100 most frequently used words on the page as well as the first 20 lines of text.

    Meta tags: Behind the scenes HTML code in which the page owner specifies keywords and concepts describing the Web page. Sometimes, the words in meta tags have little to do with the content of the page, but include popular words in order to attract the attention of spiders and show up in search engine results lists. Spiders now try to correlate meta tags with page content to avoid this.

  2. Index: The spiders store the words along with the url associated with the words in an index. But in order to add meaning, a weight or rank is assigned to each entry. This may depend on where the words came from (title, meta tag, etc) and its relation to other words on the page. Again, search engines differ in the method they use to weigh or rank information in their indexes.

    These data are encoded to save storage space. The combination of efficient indexing and efficient storage makes it possible to get results quickly even when the user enters a complicated search or when the results list is huge.

  3. Search interface: This is where you, the user, enter a query or search statement which makes the search engine match the words of the query to the words in its index and retrieve a set of results (or hits). The results list the most useful or relevant Web sites at the top.


That seems straightforward! Or is it?

Internet bias

The line between content and advertisement is not clearly distinguished on the Web. Search engines have a variety of ways that allow advertisers to buy their way onto search results pages:

  • Banner ads: graphic or textual banners are a staple feature of all major search engines

  • Content deals: the promotion of advertiser's content on search results pages. Usually, but not always, separate from the list of results.

  • Paid placement: this guarantees a particular position in main search results: above, below or alongside editorial links

  • Paid inclusion: no guarantee for a particular position in search results

See why you need to evaluate what you find? What is the result of index word matching and what is the result of paid advertising?


Return to Search Engines Next

© 2002 University of Toronto. All rights reserved.
Comments to BIO150Y staff