Introduction, Set Theory and Logic, What is a Set, New Terms, Union, Intersection

One soon finds out that offering a single word to a Web Search Engine results in being buried under a mountain of information, most of it useless. Searches generally have to be refined in order to peel away extra irrelevant information. However, each Web Search Engine uses its own syntax for refining a search. A good way to cope with this is to learn how to use set theory in your own thinking about the refinement process. Then you can translate the rules defining the sets of pages in which you are interested into the syntax of the engine you are using. This page explains set theoretic concepts and shows how they would be used in a search.

**Set Theory, Logic,
and Searching**

It helps to have a vocabulary to use in thinking about problems. For
example, one uses a search engine to retrieve a SET of pages that meet
a **criterion** or **rule** we give it. How do we go about thinking
through what the search engine is doing? Why should we learn about what
it is doing? To use it most effectively, we need to have a kind of image
of the way in which it works. That way, we can (hopefully) make it do what
we want done.

The purpose of this section is to help those learning how to use a search
engine. The languages of logic and set theory serve to describe what search
engines do. HOW they operate is another story, one which does not concern
us here. Search engine designers all seem to have worked out their own
individual languages for users to employ in "talking to" the
search engine and telling it what they want it to do.

What is common to all search engines is that they operate in terms of SETS of
pages. Learning the language of SETS, and their UNIONS and INTERSECTIONS
permits you to work effectively with any of the search engines.

A search can be visualized in terms of SETS of pages, and the UNIONS and INTERSECTIONS of these sets. A page is identified by its URL, its Universal Resource Locator, address. We shall refer to a set of pages as a set of URLs. Using Venn diagrams can help to clarify the way in which an "advanced" search works. We shall use them here.

The idea of a SET of objects, a CLASS of things, is one familiar to
almost everyone. Some examples are: a **herd** of horses, a **flotilla**
of ships, and a **deck** of cards. Sets are given NAMES --- the H-BAR-H
herd, the Normandy Invasion Fleet, the Yahoo collection of URL's. We shall
be concerned with SETS of Web Pages.

Sometimes sets are defined by enumerating the list of objects comprising
them. For example:

A = (Alex, Bob, Carl).

More commonly they are defined by a **rule** that everyone can understand.
For example,

"all of the URLs of Web Pages in Yahoo's collection that have the word "rural" on them."

A rule includes objects in a set because the objects all appear to have
some **property in common** that defines them as being of interest to
the observer. Sets of Web pages can be defined in terms of what words appear
on them. For example, the set of pages containing the word "dinosaur",
or the set of pages which have the phrase "chamber music" on
them.

In set A, above, there were three elements, Alex, Bob, and Carl. If
N is the set of boys in set A whose names begin with either A or B, then
N is a **SUBSET** of A, since every member of N is in A. Since there
is at least one element in A which is not in N (Carl), we call N a **PROPER**
subset of A.

Let G be defined as "all the girls who are in set A". But
there are no girls in A, so G = (null), the **NULL** set, sometimes
called the EMPTY set. This empty set is always a subset of every set. It
is important to note that the empty set is not the same as zero (which
is actually an element in the set of numbers).

The UNIVERSE of objects of the type we are talking about is referred
to as S. For our purposes this will generally be **all of the URLs in
the collection being searched**. We note here that the collection being
searched will not necessarily be all of the URLs accessible to the engine.
For instance, Yahoo has already classified its URLs according to a subject
matter scheme. You can choose to search just one subject matter area.

Let us define some very specific, new terms.

- S = the collection of URLs being searched.
- R = the set of all URLs in S containing the word "rural".
- M = the set of all URLs in S containing the word "male".

- M' = the COMPLEMENT of M -- that is, those NOT containing the word "male" (note that these may or may not contain the word "rural").
- R' = the COMPLEMENT of R -- those URLS NOT containing the word "rural"

(note that these may or may not contain the word "male"). - S' = (null), the COMPLEMENT of S; all the URLs NOT in the collection being searched.

Let U denote the UNION of two sets. The UNION consists of those items which meet the inclusion rule of either set or both sets. It only omits those which meet neither rule. Think of it as the "OR" of the two sets. It is denoted R OR M.

To get R OR M, the UNION of R and M, one takes all the URLs that have the word "male" on them and, from the remainder, ADDS those which have the word "rural" on them. Some URLs can be in R and also in M. Others are just in R. Still others are only in M. All three are included in the UNION. Finally, some URLS are in neither. They are excluded. If you want, you can start with the "rurals" and then add the "males".

Let I be the INTERSECTION of two sets. The INTERSECTION consists of only those items which are present in BOTH sets. It leaves out those which meet only one of the inclusion rules. It leaves out those which meet neither rule. Think of it as the "AND" of the two sets. I is denoted R AND M.

For two sets to be DISJOINT, there can be no items which are in both sets.
In our example the sets R and M are **not** DISJOINT.

The INTERSECTION of two DISJOINT sets is **null**.

Since EACH of our set definition rules is, by itself, exhaustive, we have:

- R AND R' = (null)
- M AND M' = (null)
- R OR R' = S
- M OR M' = S

Let's look at the Venn diagrams again, which illustrate the two partially overlapping subsets of our file S. The first shows the set R of "rural" URLs as the shaded area.

The partial overlap of the set of the "rural" URLs with those
having the word "male" on them can be seen in the diagram.

In
the second diagram, R', the COMPLEMENT or R, is shown. There are some URLs
that have the word "male", and which do NOT have the word "rural"
that are part of R' Other "male" URL's are in the set R.

The third diagram displays the UNION of the two subsets. Some URLs have both characteristics and some have only one.

The fourth diagram shows the INTERSECTION. This figure shows each region of the diagram expressed as the intersection of a subset or its complement with the corresponding subset or its complement

Diagram five shows the four different parts of a Venn display of two overlapping subsets; those that are just in R, those that are in M, those that are in both, and those which are in neither.

It is important to note that set R and set M are not disjoint (mutually
exclusive).

Rather, they are partially overlapping; that is, the result R AND M is
not (null).

Put into English, the set of URLs with the words "rural" AND
the word "male" in them is not empty.

If two sets are disjoint, however, their intersection is the null set.

The UNION and INTERSECTION operators are COMMUTATIVE. The UNION of R and M is the same as the UNION of M and R.

Similarly, the INTERSECTION of R and M is the same as the INTERSECTION of M and R. Symbolically:

With three sets, these relations still hold.

One can see that the UNION operator behaves something like the logical operator OR, and something like addition, since the areas comprising the two subsets are added together.

The INTERSECTION operator is something like AND, and, although it is a little less obvious, has something in common with multiplication. The UNION and INTERSECTION operators are commutative and associative. But, unlike multiplication and addition with numbers, the set operators have two distributive laws, not one. Assume T is the set of URL's that have the word "tiger" on them:

In English, the first of these asserts that combining the set of URLs which have either the word "rural" or the word "male" with those having the word "tiger" is the same as combining the set of "rural" URLs with those having either the word "male" or the word "tiger" on them.

In the second case, first selecting out the set of URLs having both "male" and "rural" on them and from them selecting those having "tiger" on them, gives you the same end result as if you took the "rurals" first and from them selected those that were both "male" and "tiger".

The usefulness of the diagrams and the symbols should be clear to anyone who has tried diligently to follow the imprecise English formulation of this example. In combining sets, you do the operations inside the parentheses first.

The empty set (null) acts like a zero, and the universal set S (which in this example is the entire corpus of URLs being searched) acts like the number 1, but the analogy is not perfect.

In the collection, every URL is either "rural" or not "rural" so "rural" + not "rural" = everything. There is no URL which is both "rural" and not "rural.

A set of retrieved documents is defined by a RULE. A rule is a proposition, or assertion about the nature of the words that are contained on the page comprising the document. The set is comprised of those URLs for whom the proposition is true.

The logic is easily extended. A rule is written for each desired subset. These are combined using UNIONs and INTERSECTIONs to provide an expression that retrieves at least approximately the documents desired. COMPLEMENTS are retrieved using the negation operator, NOT.

DeMorgan's laws help us to simplify a complex set description. They are:

- (M AND R)' = M' OR R'
- (M OR R)' = M' AND R'

The complement of the intersection of two sets is the union of the complements, taken separately. The complement of the union of two sets is the intersection of their separate complements.

In actual practice, one needs to "OR" the various synonyms for each concept. Then, one must "AND" concepts together in order to narrow down the search to just the configuration being sought.

Now, go try a few search engines!

[Yahoo] [Lycos] [Excite] [Webcrawler] [Infoseek]

If you have comments or suggestions, you can email me at *sonquist@sscf.ucsb.edu*

**This page created with Netscape Navigator Gold.**
**Maintained by J.A.Sonquist. Last updated 1/11/2000**

[BACK]