(Taken from the Syracuse University Center for Science & Technology -- http://florin.syr.edu/webarch/searchpro/boolean_tutorial.html)
Don't let the name scare you, this isn't that difficult to understand, it just requires getting used to.
There are many ways of searching for documents, papers, books, etc. on a given topic, but most of them require the information to be better organized and more standardized than documents on the web are. As a result, users like yourself are stuck with doing a lot of free-text searching, meaning, looking for documents that contain words you think will be in the document you are seeking.
Most search engines give you the option of entering a boolean search, that is, one that uses AND and OR and looks something like this: "White house" and (schedule or calendar) and (events or tours). It is really very difficult to have successful free-text searches without understanding how to build search strings like this. You really do not have enough power to narrow your search to a reasonable number of potentially useful documents without it.
Boolean concepts are often explained with Venn diagrams (the circle plots below) the Venn diagrams in this tutorial are meant to represent the information space of the web. A circle with a word in it shows the subset of web documents that contain that concept. When there are two concepts shown in a single diagram, the overlap of the circles represents documents that contain both concepts. For example:
The diagram above shows the information space of documents that talk about dogs and cats. Some documents, the blue region, talk about dogs and not cats, some documents, the yellow region, talk about cats and not dogs, and some talk about both (the green region).
Basic AND and OR
AND and OR in the context of boolean sometimes seem to mean the opposite that they do in language, so try not to think of boolean strings as English sentences.
AND means "I want only documents that contain both words."
OR means "I want documents that contain either word. I don't care which word."
So going back to the cat and dog diagram. If I want the yellow areas in each of the following diagrams I would use the search string shown next to it.
documents that talk about dogs
boolean search: dogs
documents that talk about cats
boolean search: cats
documents that talk about both cats and dogs
boolean search: cats AND dogs
documents that talk about either cats or dogs
boolean search: cats OR dogs
Parenthesis ( )
Before we can go any farther we have to introduce a concept you may have seen in algebra: using parenthesis to make sure the computer knows exactly what we mean. If we want to search for more than two concepts, or use AND and OR in a search, we have to tell the computer what part of the search to execute first. Items contained within parentesis ( ) are always intrepreted or executed first.
For Example: if I look for articles about using leashes with dogs or using leashes with cats with the search "cats OR dogs AND leashes" I may not retrieve the documents I intend. The computer does not read from left to right the way humans do. In fact it has a completely different way of looking at this search; most search engines interpret the ANDs first followed by the ORs. So what I would really get out of this search is documents that talk about dogs with leashes along with any document about cats! Why? The diagrams below will demo nstrate:
Our information space looks like this:
If we interpret the search "cats OR dogs AND leashes" the way the computer does, we would first AND the circles for dogs and leashes, the yellow area. Then we would OR that resulting area with the circle for cats resulting in documents in both the yellow and orange areas.
To solve this problem we tell the computer that the OR is to be interpreted first. So instead we use the search string "(dogs OR cats) AND leashes". This string would result in the yellow information space below being retrieved.
Even when you think the computer is going to do what you want, it is always safer to use parenthesis if there is even a chance of confusion. The parenthesis will also help you read your own searches.
There are times when parenthesis are not needed:
- Only using AND: "dogs AND fur AND fleas AND collars AND rugs"
- Only using OR: "fleas OR gnats OR tsetse OR ticks"
Intermediate AND and OR
We are going to get into some more difficult searches now using some more co mplex information spaces. Hopefully by the end of this section you will be able to construct searches for most every situation you ever encounter in your Web searches. Some of these may seem a little tough at first, but take it slow and study the diagrams. Sometimes it may help to imagine what kind of documents could fall into each part of the diagrams.
With AND the terms can appear anywhere in the document. Within a long document a lot of different words will create combinations that are not really discussed in the document. For example a certain document may have "white barn" in the first paragraph and "red wagons" twenty paragraphs later, if I do a search for "red AND barns" I will get this document even though it has nothing to do with red barns or barns that are painted red.
So instead of the above search, I really want to use "red NEAR barns". This means I will get documents with sentences like "Barry headed down the path to the red barn" or "we took some red clothes out of the barn". The tolerance of NEAR varies by search engine with a range 9 to 15 words being typical. Note that sometimes this feature is called WITHIN, but the three below all use the syntax NEAR.
- AltaVista: 10 words
- WebCrawler: user controlled ("red NEAR/15 barn")
- Open Text: 80 Characters (letters and spaces)
Quoted Strings: or how to really narrow your search
[discuss these with known item searches in four searches doc.] In the above red barn example I had the option of replacing red NEAR barn with "red barn". This really would have reduced the number of documents I retrieved. It may have reduced it too much, there are many ways of writing about red barns.
NOT, use it sparingly
What does NOT do? NOT tells the search engine to throw out any documents that contain that word. This command is usually too powerful to use.
For Example: say while I was doing my leash research I kept getting a whole bunch of documents about people who went for walks with their dogs and their llamas. Say also that all of these authors were obsessed with their new llama leashes and never seemed to get around to talking about dog leashes. I may be tempted to change my search to ignore any document that contains the word " llama". But I may be eliminating the very documents I really want to get. What if the foremost expert on leashes always dedicates her papers to her pet llama, or her name is Mildred P. Llama. The NOT directive is completely non-discriminatory; it only takes one single instance of a word to eliminate a document from your retrievede set.
Be aware of redundant terms
Sometimes you may find yourself creating searches with redundant terms. They aren't really harmful, but it is important that you understand what is happening, so if the search isn't bringing back the documents you expect, you can edit it properly.
Say I am searching in the information space shown in this diagram:
Say I want documents about collies and cats growing up together, the yellow area shown here:
I may be tempted to use the search: dogs AND cats AND collies but the word " dogs" is redundant since all of the documents about collies are also about dogs. The search cats AND collies gets me the same yellow area.