Help for Broker Queries
The Harvest
Broker handles many types of queries. The simplest query is
a single keyword, such as:
lightbulb
Searching for common words (like "computer" or "html") may take a lot of
time. It is often helpful to use more powerful queries. Harvest supports
many different index/search engines, with varying capabilities. At present,
our most powerful (and commonly used) search engine is
Glimpse, which supports:
- case-insensitive and case-sensitive matches
- match parts of words, whole words or multiple word phrases (like
"resource discovery")
- Boolean (AND/OR) combinations of keywords
- approximate matches (e.g., allowing spelling errors)
- structured queries (which allow you to constrain matches to certain
fields)
- show matched lines or entire matching records (e.g., for citations)
- specify limits on the number of matches returned
- a limited form of regular expressions (e.g., allowing "wild card"
expressions that match all words ending in a particular suffix)
- negation of selections using the NOT operator
The different types of queries (and how to use them) are discussed below.
Note that you use the same syntax regardless of what index/search engine is
running in a particular Broker, but that not all engines support all of the
above features. In particular, some of the Brokers use WAIS, which
sometimes searches faster than Glimpse but supports only Boolean keyword
queries and the ability to specify result set limits.
The different options - case-sensitivity, approximate matching,
the ability to show matched lines vs. entire matching records
and the ability to specify match count limits -
can all be specified with buttons and menus in the Broker query forms.
A structured query has the form:
tag-name : value
where tag-name is a Content Summary attribute name, and value
is the search value within the attribute. If you click on a Content
Summary, you will see what attributes are available for a particular
Broker. A list of common attributes is shown
here.
Keyword searches and structured queries can be combined using Boolean
operators (AND and OR) to form complex queries. Lacking parentheses,
logical operation precedence is based left to right.
For multiple word phrases or regular expressions, you need to enclose the
string in double quotes, e.g.,
"internet resource discovery"
or
"discov.*"
Examples
Simple keyword search query:
Arizona
This query will return all objects in the Broker containing
the word Arizona.
Boolean query:
Arizona AND desert
This query will return all objects in the Broker that contain both words
anywhere in the object in any order. For simple keywords, the Boolean
operator AND can be ommitted because AND is the default operator for
simple keywords.
Negated query:
Arizona AND NOT desert
This query will return all objects in the Broker that contain the word Arizona,
and don't contain the word dessert.
Phrase query:
"Arizona desert"
This query will return all objects in the Broker that contain
Arizona desert as a phrase. Notice that you need to put
double quotes around the phrase.
Boolean queries with phrases:
"Arizona desert" AND windsurfing
Simple Structured query:
Title : windsurfing
This query will return all objects in the Broker where the Title
attribute contains the value windsurfing.
Complex query:
"Arizona desert" AND (Title : windsurfing)
This query will return all objects in the Broker that contain the phrase
Arizona desert and where
the Title attribute of the same object
contains the value windsurfing.
Query options selected by menus or buttons
These checkboxes allow some control of the query specification.
- Case insensitive:
-
By selecting this checkbox the query will become case insensitive (lower
case and upper case letters differ). Otherwise, the query will be case
senstive. The default is case insensitive.
- Keywords match on word boundaries:
-
By selecting this checkbox, keywords will match on word boundaries.
Otherwise, a keyword will match part of a word (or phrase). For example,
"network" will matching "networking", "sensitive" will match "insensitive",
and "Arizona desert" will match "Arizona desertness". The default is to
match keywords on word boundaries.
- Number of errors allowed:
-
Glimpse allows the search to contain a number of errors. An error is
either a deletion, insertion, or substitution of a single character.
The Best Match option will find the match(es) with the least number of
errors. The default is 0 (zero) errors.
Note: The previous three options do not apply to attribute names.
Attribute names are always case insensitive and allow no errors.
Result set presentation
These checkboxes allow some control of presentation of the query return.
- Display matched lines (from content summaries):
-
By selecting this checkbox, the result set presentation will contain the
lines of the Content Summary that matched the query. Otherwise, the
matched lines will not be displayed. The default is to display the matched
lines.
- Display object descriptions (if available):
-
Some objects have short, one-line descriptions associated with them. By
selecting this checkbox, the descriptions will be presented. Otherwise,
the object descriptions will not be displayed. The default is to display
object descriptions.
- Display links to indexed content summary:
-
A link to indexed content summary will be displayed if this checkbox
is selected. This is useful for debugging Harvest. The default is not
to show links to indexed content summary.
Regular Expressions
Some types of regular expressions are supported by Glimpse.
A regular expression search can be much slower that other searches.
The following is a partial list of possible patterns.
(For more details see the
Glimpse manual pages.)
- ^joe will match "joe" at the beginning of a line.
- joe$ will match "joe" at the end of a line.
- [a-ho-z] matches any character between a and h or
between o and z.
- . matches any single character except newline.
- c* matches zero or more occurrences of the character "c"
- .* matches any number of wild cards
- \* matches the character "*" (\ escapes any of the above
special characters).
Regular expressions are currently limited to approximately 30 characters,
not including meta characters. Regular expressions will generally not
cross word boundaries (because only words are stored in the index). So,
for example, "lin.*ing" will find "linking" or "flinching," but not "linear
programming."
Each Broker can support different attributes, depending on the data it
holds. Below we list a set of the most common attributes. Clicking on a
hypertext link below will provide a brief explanation about each.
$Id: queryhelp.html,v 2.6 2002/08/30 11:32:40 sxw Exp $