Login  |  Register          Free Newsletter Subscription
FirstLight
Subscribe
Email
Print
Reprint
Learn RSS

Internet searches made simple

Jack Smith, Senior Editor, Plant Engineering Magazine -- Plant Engineering, 6/1/2001

There are hundreds of millions of web pages accessible via the internet. Some pages may be easy to find; others may not. Unless you are provided with a specific universal resource locator (URL), or are extremely lucky, you must search through the masses for the specific pages that contain the information required. A search engine is needed to find the information.

Search engines are general and commercial or site specific. Site specific choices use database-querying tools or employ general-purpose external search engines. Regardless of the type, knowing how search engines work enables you to make the most of these powerful tools.

How do search engines locate information? How can you create searches that provide the information you seek? This article provides the answers.

How search engines work

A search engine uses special software tools, commonly referred to as robots or spiders, to assemble lists of the words found on web sites. Web crawling is the term used to describe when a spider is building its lists.

Spiders usually begin by looking at lists of heavily used servers and popular pages. Beginning with a popular site, a spider indexes the words on its pages and follows every link found within the site. The spider spreads out quickly across the most widely used areas of the web, much like a true spider weaves its web.

For example, when the Google.com search engine spider, which began as an academic search engine, looked at an HTML page, it made a list of the words within the page and noted where the words were found. Words with positions of relative importance, such as those occurring in the title, subtitles, and meta-tags, were noted for special consideration during subsequent user searches. This spider was designed to index every significant word on a page except for the articles "a," "an," and "the."

Other spiders work differently. Some approaches are devised to make the spider operate faster, allow more efficient searches, or both. The Lycos spider keeps track of words in the title, subheads, links, and the 100 words used most frequently on the page, along with every word in the first 20 lines of text. AltaVista indexes every word on a page, including the articles.

Other systems place importance on meta-tags. These identifiers allow page owners to specify key words and concepts that aid in indexing the page. Meta-tags can direct the search engine in selecting among several possible meanings to find the correct word or words.

The task of finding information on web pages is never actually completed. Since the web is always changing, the spiders are always crawling. Regardless, the search engine must store the retrieved information in a useful way. The factors important to making gathered data accessible to users are the information stored with the data and the method by which the information is indexed.

A search engine would be of limited use if it merely stored the word and the URL where it was found. So, other factors must also be weighed.

Was the word used in an important or trivial way? How many times was the word used? Was it used only once on the page? Does the page contain links to other pages containing the word? It is necessary to provide a list that "ranks" the most useful pages at the top of the search results list.

An index allows information to be found as quickly as possible. To build an index, a formula applies and attaches a numerical value to each word. This formula evenly distributes the entries across a number of divisions different from the distribution of words across the alphabet. This process is called hashing, and the result is a hash table.

Because there are more words that begin with some letters than with others, finding a word with a frequently used initial letter would take longer than a word with an infrequently used letter. Hashing reduces the average amount of time necessary to locate an entry and separates the index from the actual entry.

The hash table contains the calculated and assigned hash number, along with a vector to the actual data. These data can be sorted to maximize storage efficiency. This indexing/storage combination enables quick results, even with complicated searches.

Basic searches

To search through an index, you must first build a query. A query can be as simple as a single word, or it can be a complicated combination of words and operators.

To use a search engine, such as Yahoo! or Lycos, navigate to the opening page. The opening page for Yahoo! is either www.yahoo.com or just yahoo.com. Your browser should take you there regardless of whether or not you enter the www. Access Lycos by typing in either www.lycos.com or lycos.com into the window of your browser.

When the search engine portal appears, type your search word or phrase into the search window. Typing the phrase "Plant Engineering magazine" into the search window of Lycos, then clicking on search produced four web sites based on user selection traffic. However, 95,872 web pages were found in a search of the complete Lycos catalog or index .

Regardless of how efficient a plant engineer is, none has the time to visit more than 90,000 web hits. It quickly becomes necessary to limit your search.

Most search engines allow you to narrow your search. In Lycos, click the checkbox labeled search these results to "drill down" within your search criteria.

For example, checking this box, then typing "Information Engineering" into the search box and clicking the search button resulted in 447 sites from the entire Lycos catalog. A far cry from 95,872, but probably not a manageable number. In an effort to limit the number of hits, quotation marks were included around the search subject "Information Engineering" to force Lycos into returning pages containing only sites with the specific phrase in the exact order within the quotes.

Basically, a word or a phrase can initiate a simple search. And simple searches can be useful. However, to be effective, sometimes a complex or advanced search is necessary.

Boolean operators

Building a complex query requires the use of Boolean operators. Boolean operators allow you to refine and extend the terms of the search.

There are several Boolean operators often used.

  • AND—Any terms joined byANDmust appear in the pages or documents. Some search engines substitute the operator "+" for the wordAND.
  • OR—At least one of the terms joined byORmust appear in the pages or documents.
  • NOT—The term or terms followingNOTmust not appear in the pages or documents. Some search engines substitute the operator
  • "-" for the wordNOT.
  • FOLLOWED BY—One of the terms must be directly followed by the other.
  • NEAR—One of the terms must be within a specified number of words of the other.
  • Quotation Marks—The words between the quotation marks are treated as a phrase, and that phrase must be found within the document or file in the exact order.

The "+" symbol is especially helpful when you do a search and then find yourself overwhelmed with information.

Using Yahoo!, the following was typed into the search window: plant engineering magazine+ information engineering+cmms/eam

Yahoo! returned 10 web site matches, most of which were relevant (Fig. 2). This number is much more useable than 95,872, or even 447 sites that were not limited by Boolean operators.

Often, you may need a search engine to find pages that have one word on them but not another word. The "-" symbol allows this type of search.

Using Yahoo!, the following was typed into the search window:

plant engineering magazine+information engineering+eam-cmms.

Yahoo! returned nine web site matches, which included the "EAM" term and excluded the "CMMS" term. Eliminate terms you know are not of interest to get the best results from the "-"operator.

Advanced searches

For most users, general searching techniques using Boolean or symbol operators are sufficient. However, if more searching power is required, the following commands are useful.

MATCH ANY

Occasionally you want web pages that contain any of the terms for which you are searching. Some search engines do this automatically. It is not necessary to enter a special operator. Those search engines include AltaVista, Excite, GoTo, Go, LookSmart, Netscape, Snap, WebCrawler, and Yahoo!. AOL Search, HotBot, Lycos, and MSN Search have "match any" as a menu item adjacent to the search window. You must use the Boolean operator OR with Northern Light. Google does not support the MATCH ANY command.

It should be noted that most search engines automatically list pages with all your search terms first, then some of your terms.

MATCH ALL

MATCH ALL is a search term for web pages that contain all your search terms. The search engines for which MATCH ALL is automatic include AOL Search, Google, HotBot, Lycos, MSN Search, and Northern Light. You must use the "+" operator with all other engines. Almost all the major search engines support the "+" operator as a command.

Title search

Many search engines allow you to search within a web page's HTML title. For example, this page has an HTML title similar to this:

<title>Internet searches made simple</title>.

There are several ways to execute a title search, depending on the search engine used. AltaVista, GoTo, HotBot, Go, MSN Search, Northern Light, and Snap require TITLE in the search window. It is important to include the colon punctuation in the command. An identifying word, phrase, or entire title follows the colon when you type it into the search window. Yahoo! requires "t:" instead of "title:." The Lycos title search is available on a menu on its advanced search page. AOL Search, Excite, Google, LookSmart, Netscape, and WebCrawler do not support searching for HTML titles.

Site search

Sometimes you may want to control which sites are included or excluded from a search. This ability is a powerful search engine feature.

This feature allows you to:

  • See all the pages indexed from a specific domain
  • See all the pages indexed from a specific domain that contain a word or phrase
  • Use include and exclude commands along with specific domain searches
  • Include or exclude domains such as .edu for educational institutions, .gov for governmental, .org for organizational, or .us for domains located in the United States (or the appropriate code for any other country). Each country has a unique suffix. For example, the suffix for the United Kingdom is .uk.

GoTo, HotBot, MSN Search, and Snap support the domain: syntax, which enables you to specify the domain to include or exclude. AltaVista requires the term host: while Go requires site:. To perform a site search with Lycos, you must access a menu on the advanced search page.

Wildcards

(*)The asterisk (*) can be used as a wildcard for searches or certain other data operations. Wildcards are used to search for plurals or variations of words. It also comes in handy if you are unsure of the exact spelling of a word.

AOL Search, AltaVista, HotBot, MSN Search, Northern Light, Snap, and Yahoo! support wildcard searches and use the * operator. Excite, Google, GoTo, Go, LookSmart, Lycos, and Web Crawler do not support the use of wildcards.

Search engine showdown
This table compares the popular search engines and lists accepted Boolean operators, defaults, case sensitivity, and other important parameters.
Search engine Boolean Default Proximity Truncation Case Fields Limits Stop Sorting
All the Web +, - AND Phrase No No Title, URL, link, more Language, domains No Relevance
Google -, OR AND Phrase No more No domain Intitle, inurl, searches Language, on citation Yes, + Relevance,
Lycos +, - AND Phrase No No link, more Title, URL, domain Language, No Relevance
Northern Light AND, OR, NOT, (), +, - AND Phrase Yes * %, auto plurals No Title, URL, more Doc type date, more No Custom folders, date
iWon AND, OR, NOT, (), +, - AND Phrase Yes * ? Yes Title, link, domain Date Yes Relevance, site
AltaVista Simple +, - lt;5: AND; gt;4: OR Phrase Yes * Yes Title, URL, link, more Language Yes AskJeeves, RealNames, Relevance
AltaVista Adv. AND, OR, AND NOT, () Phrase Phrase, near Yes * Yes Title, URL, link, more Language, date No Relevance, if used
HotBot AND, OR, - NOT, (), +, AND Phrase Yes * Yes Title, more Language, date, more Yes Relevance, site
NBCi AND, OR, NOT, (), +, - AND Phrase Yes * Yes Title, more Language, date, more Yes Relevance
The smaller search engines
Excite AND, OR, NOT, (), +, - OR Phrase No No No Language, domain Yes Relevance, site
Magellan AND, OR, NOT, (), +, - OR Phrase No No No No Yes Relevance
WebCrawler AND, OR, OR NOT, (), +, - OR Phrase, near, adj No No No No Yes Relevance
(Courtesy of Greg R. Notess. Used with permission.)

 

Popular general-purpose search engines

The following is a list of popular general-purpose search engines and web directories.

  • AltaVista—altavista.com
  • Dogpile—dogpile.com
  • Excite—excite.com
  • FAST—alltheweb.com
  • Go—go.com
  • Google—google.com
  • HotBot—hotbot.com
  • Intelliseek—profusion.com
  • Looksmart—looksmart.com
  • Lycos—lycos.com
  • Magellan—magellan.excite.com
  • Mamma—mamma.com
  • Matilda—aaa.com.au
  • Metacrawler—metacrawler.com
  • Northern Light—northernlight.com
  • Open Directory Project—dmoz.org
  • Search.com—search.com
  • Snap—snap.com
  • Web Crawler—webcrawler.com
  • Yahoo!—yahoo.com.
Email
Print
Reprint
Learn RSS

Talkback

We would love your feedback!

Post a comment

» VIEW ALL TALKBACK THREADS

Related Content

Related Content

 

By This Author

Sponsored Links



 
Advertisement
SPONSORED LINKS

More Content

  • Blogs
  • Podcasts
  • Photos

Blogs

  • Bob Vavra
    Five Fast Things

    October 17, 2008
    A global response to global problems. In other words...
    1. We’re all in this together: Two weeks in Europe this month taught me two things. One is tha...
    More
  • Bob Vavra
    Five Fast Things

    September 16, 2008
    What do the Chinese know, and what can we learn?
    1. Reaching across a great wall: Members of the Chinese Machine Tool and Tool Builders’ Associ...
    More
  • » VIEW ALL BLOGS RSS

Photos

Advertisements





NEWSLETTERS
Click on a title below to learn more.

Plant Engineering PlantMail!
Plant Engineering Hotwire
Plant Engineering Hotwire Automation
Plant Engineering Maintenance Connection
Plant Engineering Sustainable Manufacturing
©2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites