Internet searches made simple

There are hundreds of millions of web pages accessible via the internet. Some pages may be easy to find; others may not. Unless you are provided with a specific universal resource locator (URL), or are extremely lucky, you must search through the masses for the specific pages that contain the information required.

By Jack Smith, Senior Editor, Plant Engineering Magazine June 1, 2001

There are hundreds of millions of web pages accessible via the internet. Some pages may be easy to find; others may not. Unless you are provided with a specific universal resource locator (URL), or are extremely lucky, you must search through the masses for the specific pages that contain the information required. A search engine is needed to find the information.

Search engines are general and commercial or site specific. Site specific choices use database-querying tools or employ general-purpose external search engines. Regardless of the type, knowing how search engines work enables you to make the most of these powerful tools.

How do search engines locate information? How can you create searches that provide the information you seek? This article provides the answers.

How search engines work

A search engine uses special software tools, commonly referred to as robots or spiders, to assemble lists of the words found on web sites. Web crawling is the term used to describe when a spider is building its lists.

Spiders usually begin by looking at lists of heavily used servers and popular pages. Beginning with a popular site, a spider indexes the words on its pages and follows every link found within the site. The spider spreads out quickly across the most widely used areas of the web, much like a true spider weaves its web.

For example, when the Google.com search engine spider, which began as an academic search engine, looked at an HTML page, it made a list of the words within the page and noted where the words were found. Words with positions of relative importance, such as those occurring in the title, subtitles, and meta-tags, were noted for special consideration during subsequent user searches. This spider was designed to index every significant word on a page except for the articles “a,” “an,” and “the.”

Other spiders work differently. Some approaches are devised to make the spider operate faster, allow more efficient searches, or both. The Lycos spider keeps track of words in the title, subheads, links, and the 100 words used most frequently on the page, along with every word in the first 20 lines of text. AltaVista indexes every word on a page, including the articles.

Other systems place importance on meta-tags. These identifiers allow page owners to specify key words and concepts that aid in indexing the page. Meta-tags can direct the search engine in selecting among several possible meanings to find the correct word or words.

The task of finding information on web pages is never actually completed. Since the web is always changing, the spiders are always crawling. Regardless, the search engine must store the retrieved information in a useful way. The factors important to making gathered data accessible to users are the information stored with the data and the method by which the information is indexed.

A search engine would be of limited use if it merely stored the word and the URL where it was found. So, other factors must also be weighed.

Was the word used in an important or trivial way? How many times was the word used? Was it used only once on the page? Does the page contain links to other pages containing the word? It is necessary to provide a list that “ranks” the most useful pages at the top of the search results list.

An index allows information to be found as quickly as possible. To build an index, a formula applies and attaches a numerical value to each word. This formula evenly distributes the entries across a number of divisions different from the distribution of words across the alphabet. This process is called hashing, and the result is a hash table.

Because there are more words that begin with some letters than with others, finding a word with a frequently used initial letter would take longer than a word with an infrequently used letter. Hashing reduces the average amount of time necessary to locate an entry and separates the index from the actual entry.

The hash table contains the calculated and assigned hash number, along with a vector to the actual data. These data can be sorted to maximize storage efficiency. This indexing/storage combination enables quick results, even with complicated searches.

Basic searches

To search through an index, you must first build a query. A query can be as simple as a single word, or it can be a complicated combination of words and operators.

To use a search engine, such as Yahoo! or Lycos, navigate to the opening page. The opening page for Yahoo! is either www.yahoo.com or just yahoo.com. Your browser should take you there regardless of whether or not you enter the www. Access Lycos by typing in either www.lycos.com or lycos.com into the window of your browser.

When the search engine portal appears, type your search word or phrase into the search window. Typing the phrase ” Plant Engineering magazine” into the search window of Lycos, then clicking on search produced four web sites based on user selection traffic. However, 95,872 web pages were found in a search of the complete Lycos catalog or index .

Regardless of how efficient a plant engineer is, none has the time to visit more than 90,000 web hits. It quickly becomes necessary to limit your search.

Most search engines allow you to narrow your search. In Lycos, click the checkbox labeled search these results to “drill down” within your search criteria.

For example, checking this box, then typing ” Information Engineering” into the search box and clicking the search button resulted in 447 sites from the entire Lycos catalog. A far cry from 95,872, but probably not a manageable number. In an effort to limit the number of hits, quotation marks were included around the search subject “Information Engineering” to force Lycos into returning pages containing only sites with the specific phrase in the exact order within the quotes.

Basically, a word or a phrase can initiate a simple search. And simple searches can be useful. However, to be effective, sometimes a complex or advanced search is necessary.

Boolean operators

Building a complex query requires the use of Boolean operators. Boolean operators allow you to refine and extend the terms of the search.

There are several Boolean operators often used.

  • AND—Any terms joined by AND must appear in the pages or documents. Some search engines substitute the operator “+” for the word AND .

  • OR—At least one of the terms joined by OR must appear in the pages or documents.

  • NOT—The term or terms following NOT must not appear in the pages or documents. Some search engines substitute the operator

  • “-” for the word NOT .

  • FOLLOWED BY—One of the terms must be directly followed by the other.

  • NEAR—One of the terms must be within a specified number of words of the other.

  • Quotation Marks—The words between the quotation marks are treated as a phrase, and that phrase must be found within the document or file in the exact order.

    • The “+” symbol is especially helpful when you do a search and then find yourself overwhelmed with information.

      Using Yahoo!, the following was typed into the search window: plant engineering magazine+ information engineering+cmms/eam

      Yahoo! returned 10 web site matches, most of which were relevant (Fig. 2). This number is much more useable than 95,872, or even 447 sites that were not limited by Boolean operators.

      Often, you may need a search engine to find pages that have one word on them but not another word. The “-” symbol allows this type of search.

      Using Yahoo!, the following was typed into the search window:

      plant engineering magazine+information engineering+eam-cmms.

      Yahoo! returned nine web site matches, which included the “EAM” term and excluded the “CMMS” term. Eliminate terms you know are not of interest to get the best results from the “-“operator.

      Advanced searches

      For most users, general searching techniques using Boolean or symbol operators are sufficient. However, if more searching power is required, the following commands are useful.

      MATCH ANY

      Occasionally you want web pages that contain any of the terms for which you are searching. Some search engines do this automatically. It is not necessary to enter a special operator. Those search engines include AltaVista, Excite, GoTo, Go, LookSmart, Netscape, Snap, WebCrawler, and Yahoo!. AOL Search, HotBot, Lycos, and MSN Search have “match any” as a menu item adjacent to the search window. You must use the Boolean operator OR with Northern Light. Google does not support the MATCH ANY command.

      It should be noted that most search engines automatically list pages with all your search terms first, then some of your terms.

      MATCH ALL

      MATCH ALL is a search term for web pages that contain all your search terms. The search engines for which MATCH ALL is automatic include AOL Search, Google, HotBot, Lycos, MSN Search, and Northern Light. You must use the “+” operator with all other engines. Almost all the major search engines support the “+” operator as a command.

      Title search

      Many search engines allow you to search within a web page’s HTML title. For example, this page has an HTML title similar to this:

      &title>Internet searches made simple&/title>.

      There are several ways to execute a title search, depending on the search engine used. AltaVista, GoTo, HotBot, Go, MSN Search, Northern Light, and Snap require TITLE in the search window. It is important to include the colon punctuation in the command. An identifying word, phrase, or entire title follows the colon when you type it into the search window. Yahoo! requires “t:” instead of “title:.” The Lycos title search is available on a menu on its advanced search page. AOL Search, Excite, Google, LookSmart, Netscape, and WebCrawler do not support searching for HTML titles.

      Site search

      Sometimes you may want to control which sites are included or excluded from a search. This ability is a powerful search engine feature.

      This feature allows you to:

    • See all the pages indexed from a specific domain

    • See all the pages indexed from a specific domain that contain a word or phrase

    • Use include and exclude commands along with specific domain searches

    • Include or exclude domains such as .edu for educational institutions, .gov for governmental, .org for organizational, or .us for domains located in the United States (or the appropriate code for any other country). Each country has a unique suffix. For example, the suffix for the United Kingdom is .uk.

      • GoTo, HotBot, MSN Search, and Snap support the domain: syntax, which enables you to specify the domain to include or exclude. AltaVista requires the term host: while Go requires site:. To perform a site search with Lycos, you must access a menu on the advanced search page.

        Wildcards

        (*)The asterisk (*) can be used as a wildcard for searches or certain other data operations. Wildcards are used to search for plurals or variations of words. It also comes in handy if you are unsure of the exact spelling of a word.

        AOL Search, AltaVista, HotBot, MSN Search, Northern Light, Snap, and Yahoo! support wildcard searches and use the * operator. Excite, Google, GoTo, Go, LookSmart, Lycos, and Web Crawler do not support the use of wildcards.

        Search engine showdown
        This table compares the popular search engines and lists accepted Boolean operators, defaults, case sensitivity, and other important parameters.

        Search engine Boolean Default Proximity Truncation Case Fields Limits Stop Sorting
        (Courtesy of Greg R. Notess. Used with permission.)
        All the Web +, – AND Phrase No No Title, URL, link, more Language, domains No Relevance
        Google -, OR AND Phrase No more No domain Intitle, inurl, searches Language, on citation Yes, + Relevance,
        Lycos +, – AND Phrase No No link, more Title, URL, domain Language, No Relevance
        Northern Light AND, OR, NOT, (), +, – AND Phrase Yes * %, auto plurals No Title, URL, more Doc type date, more No Custom folders, date
        iWon AND, OR, NOT, (), +, – AND Phrase Yes * ? Yes Title, link, domain Date Yes Relevance, site
        AltaVista Simple +, – lt;5: AND; gt;4: OR Phrase Yes * Yes Title, URL, link, more Language Yes AskJeeves, RealNames, Relevance
        AltaVista Adv. AND, OR, AND NOT, () Phrase Phrase, near Yes * Yes Title, URL, link, more Language, date No Relevance, if used
        HotBot AND, OR, – NOT, (), +, AND Phrase Yes * Yes Title, more Language, date, more Yes Relevance, site
        NBCi AND, OR, NOT, (), +, – AND Phrase Yes * Yes Title, more Language, date, more Yes Relevance
        The smaller search engines
        Excite AND, OR, NOT, (), +, – OR Phrase No No No Language, domain Yes Relevance, site
        Magellan AND, OR, NOT, (), +, – OR Phrase No No No No Yes Relevance
        WebCrawler AND, OR, OR NOT, (), +, – OR Phrase, near, adj No No No No Yes Relevance

        Popular general-purpose search engines

        The following is a list of popular general-purpose search engines and web directories.

        AltaVista—altavista.com

        Dogpile—dogpile.com

        Excite—excite.com

        FAST—alltheweb.com

        Go—go.com

        Google—google.com

        HotBot—hotbot.com

        Intelliseek—profusion.com

        Looksmart—looksmart.com

        Lycos—lycos.com

        Magellan—magellan.excite.com

        Mamma—mamma.com

        Matilda—aaa.com.au

        Metacrawler—metacrawler.com

        Northern Light—northernlight.com

        Open Directory Project—dmoz.org

        Search.com—search.com

        Snap—snap.com

        Web Crawler—webcrawler.com

        Yahoo!—yahoo.com.