Using a Search Engine -- Set Theory for Beginners

Introduction, Set Theory and Logic, What is a Set, New Terms, Union, Intersection

ANDs and ORs, De Morgan's Laws, Search Engine Links


One soon finds out that offering a single word to a Web Search Engine results in being buried under a mountain of information, most of it useless. Searches generally have to be refined in order to peel away extra irrelevant information. However, each Web Search Engine uses its own syntax for refining a search. A good way to cope with this is to learn how to use set theory in your own thinking about the refinement process. Then you can translate the rules defining the sets of pages in which you are interested into the syntax of the engine you are using. This page explains set theoretic concepts and shows how they would be used in a search.

Top of Page, Back

Set Theory, Logic, and Searching

It helps to have a vocabulary to use in thinking about problems. For example, one uses a search engine to retrieve a SET of pages that meet a criterion or rule we give it. How do we go about thinking through what the search engine is doing? Why should we learn about what it is doing? To use it most effectively, we need to have a kind of image of the way in which it works. That way, we can (hopefully) make it do what we want done.

The purpose of this section is to help those learning how to use a search engine. The languages of logic and set theory serve to describe what search engines do. HOW they operate is another story, one which does not concern us here. Search engine designers all seem to have worked out their own individual languages for users to employ in "talking to" the search engine and telling it what they want it to do.

A Set What is common to all search engines is that they operate in terms of SETS of pages. Learning the language of SETS, and their UNIONS and INTERSECTIONS permits you to work effectively with any of the search engines.

A search can be visualized in terms of SETS of pages, and the UNIONS and INTERSECTIONS of these sets. A page is identified by its URL, its Universal Resource Locator, address. We shall refer to a set of pages as a set of URLs. Using Venn diagrams can help to clarify the way in which an "advanced" search works. We shall use them here.

Top of Page , Back

What is a Set?

The idea of a SET of objects, a CLASS of things, is one familiar to almost everyone. Some examples are: a herd of horses, a flotilla of ships, and a deck of cards. Sets are given NAMES --- the H-BAR-H herd, the Normandy Invasion Fleet, the Yahoo collection of URL's. We shall be concerned with SETS of Web Pages.

Sometimes sets are defined by enumerating the list of objects comprising them. For example:

A = (Alex, Bob, Carl).

More commonly they are defined by a rule that everyone can understand. For example,

"all of the URLs of Web Pages in Yahoo's collection that have the word "rural" on them."

A Set

A rule includes objects in a set because the objects all appear to have some property in common that defines them as being of interest to the observer. Sets of Web pages can be defined in terms of what words appear on them. For example, the set of pages containing the word "dinosaur", or the set of pages which have the phrase "chamber music" on them.

In set A, above, there were three elements, Alex, Bob, and Carl. If N is the set of boys in set A whose names begin with either A or B, then N is a SUBSET of A, since every member of N is in A. Since there is at least one element in A which is not in N (Carl), we call N a PROPER subset of A.

Let G be defined as "all the girls who are in set A". But there are no girls in A, so G = (null), the NULL set, sometimes called the EMPTY set. This empty set is always a subset of every set. It is important to note that the empty set is not the same as zero (which is actually an element in the set of numbers).

The UNIVERSE of objects of the type we are talking about is referred to as S. For our purposes this will generally be all of the URLs in the collection being searched. We note here that the collection being searched will not necessarily be all of the URLs accessible to the engine. For instance, Yahoo has already classified its URLs according to a subject matter scheme. You can choose to search just one subject matter area.

Top of Page, Back

Some New Terms

Let us define some very specific, new terms.

The Union of Two Sets

Union Let U denote the UNION of two sets. The UNION consists of those items which meet the inclusion rule of either set or both sets. It only omits those which meet neither rule. Think of it as the "OR" of the two sets. It is denoted R OR M.

To get R OR M, the UNION of R and M, one takes all the URLs that have the word "male" on them and, from the remainder, ADDS those which have the word "rural" on them. Some URLs can be in R and also in M. Others are just in R. Still others are only in M. All three are included in the UNION. Finally, some URLS are in neither. They are excluded. If you want, you can start with the "rurals" and then add the "males".

Top of Page, Back

The Intersection of Two Sets

Let I be the INTERSECTION of two sets. The INTERSECTION consists of only those items which are present in BOTH sets. It leaves out those which meet only one of the inclusion rules. It leaves out those which meet neither rule. Think of it as the "AND" of the two sets. I is denoted R AND M.

Intersection For two sets to be DISJOINT, there can be no items which are in both sets. In our example the sets R and M are not DISJOINT.

The INTERSECTION of two DISJOINT sets is null.

Top of Page , Back

ANDs and ORs

Since EACH of our set definition rules is, by itself, exhaustive, we have:

A setLet's look at the Venn diagrams again, which illustrate the two partially overlapping subsets of our file S. The first shows the set R of "rural" URLs as the shaded area.

The partial overlap of the set of the "rural" URLs with those having the word "male" on them can be seen in the diagram.

ComplementIn the second diagram, R', the COMPLEMENT or R, is shown. There are some URLs that have the word "male", and which do NOT have the word "rural" that are part of R' Other "male" URL's are in the set R.

The third diagram displays the UNION of the two subsets. Some URLs have both characteristics and some have only one.

The fourth diagram shows the INTERSECTION. This figure shows each region of the diagram expressed as the intersection of a subset or its complement with the corresponding subset or its complement


Diagram five shows the four different parts of a Venn display of two overlapping subsets; those that are just in R, those that are in M, those that are in both, and those which are in neither.Parts

It is important to note that set R and set M are not disjoint (mutually exclusive).
Rather, they are partially overlapping; that is, the result R AND M is not (null).

Put into English, the set of URLs with the words "rural" AND the word "male" in them is not empty.
If two sets are disjoint, however, their intersection is the null set.

The UNION and INTERSECTION operators are COMMUTATIVE. The UNION of R and M is the same as the UNION of M and R.

Similarly, the INTERSECTION of R and M is the same as the INTERSECTION of M and R. Symbolically: parts

  • R OR M = M OR R
  • R AND M = M AND R
  • With three sets, these relations still hold.

    One can see that the UNION operator behaves something like the logical operator OR, and something like addition, since the areas comprising the two subsets are added together.

    The INTERSECTION operator is something like AND, and, although it is a little less obvious, has something in common with multiplication. The UNION and INTERSECTION operators are commutative and associative. But, unlike multiplication and addition with numbers, the set operators have two distributive laws, not one. Assume T is the set of URL's that have the word "tiger" on them:

  • (R OR M) OR T = R OR (M OR T)
  • (R AND M) AND T = R AND (M AND T)
  • In English, the first of these asserts that combining the set of URLs which have either the word "rural" or the word "male" with those having the word "tiger" is the same as combining the set of "rural" URLs with those having either the word "male" or the word "tiger" on them.

    In the second case, first selecting out the set of URLs having both "male" and "rural" on them and from them selecting those having "tiger" on them, gives you the same end result as if you took the "rurals" first and from them selected those that were both "male" and "tiger".

    The usefulness of the diagrams and the symbols should be clear to anyone who has tried diligently to follow the imprecise English formulation of this example. In combining sets, you do the operations inside the parentheses first.

    The empty set (null) acts like a zero, and the universal set S (which in this example is the entire corpus of URLs being searched) acts like the number 1, but the analogy is not perfect.

  • R OR (null) = R (rural + nothing = rural)
  • R AND S = R ( rural items in the S collection = themselves)
  • R AND (null) = (null) (rural items, that are also members of some other set that is empty, form an empty set).
  • R OR S = S (rural items plus the whole collection = the whole collection).
  • R OR R' = S (the union of R with its complement is everything)
  • R AND R' = (null)
  • In the collection, every URL is either "rural" or not "rural" so "rural" + not "rural" = everything. There is no URL which is both "rural" and not "rural.

    A set of retrieved documents is defined by a RULE. A rule is a proposition, or assertion about the nature of the words that are contained on the page comprising the document. The set is comprised of those URLs for whom the proposition is true.

    The logic is easily extended. A rule is written for each desired subset. These are combined using UNIONs and INTERSECTIONs to provide an expression that retrieves at least approximately the documents desired. COMPLEMENTS are retrieved using the negation operator, NOT.

    Top of Page , Back

    De Morgan's Laws

    DeMorgan's laws help us to simplify a complex set description. They are:

    The complement of the intersection of two sets is the union of the complements, taken separately. The complement of the union of two sets is the intersection of their separate complements.

    In actual practice, one needs to "OR" the various synonyms for each concept. Then, one must "AND" concepts together in order to narrow down the search to just the configuration being sought.

    Top of Page , Back

    Now, go try a few search engines!

    [Yahoo] [Lycos] [Excite] [Webcrawler] [Infoseek]

    If you have comments or suggestions, you can email me at sonquist@sscf.ucsb.edu

    This page created with Netscape Navigator Gold. Maintained by J.A.Sonquist. Last updated 1/11/2000