posted on 3:25 PM, July 12, 2009
The search system is based on a simple search index that uses 2 tables:
The search index is constructed by finding all content that is unique to a URL (eg. the content objects that are page-specific), stripping out all mark-up, and counting the incidence of each remaining word in the text. For web pages, we also index words in filenames, titles, descriptions, and keywords.
Indexing of a web site consists of:
Make a search object:
my $s = new ExSite::Search;
You must first index your
To generate a search form:
my $form_html = $s->search_form($term,$title,$width);
The parameters are all optional.
To perform a search on the terms in a search string:
my $results_html = $s->do_search($searchstring);
To get just the list of search hits:
my $results_html = $s->display_results( $s->search($searchstring) );
The Search plug-in provides a simple interface to these functions.
The search system breaks each block of content down to a stream of plain text. All tags and non-text content (such as scripts and CSS) are removed, to leave just the human-readable words and text on the page. Then we strip out all punctuation and other non-word characters to leave just alphanumeric text and whitespace. We convert the text to lower case, and break it out into individual terms, splitting on whitespace. This has a few consequences that may be important for the developer to understand, such as:
Each term is then counted, and the count is multiplied by a weight factor for that content block. The resulting score determines how significant a hit on that term is for that URL.
Search terms can optionally be prefixed with a + or - character, which changes the search rules:
You can combine these for some extra logical control over your searches. For example:
Certain terms can be ignored entirely by the search index. These skipwords are simply not inserted into the index, no matter how often or where they appear. They are ignored in search queries, and attempts to search for just these terms will find nothing.
There are two ways to define the list of skipwords. Method 1 is to
simply list them in the configuration parameter
search.skipwords += foo
search.skipwords = skipwords.txt
This file will be sought in the
You cannot search for partial words. For example ``surf'' does not match ``surfing''.
Quotes are ignored, and any words in a quoted phrase are searched for individually.
Searches for negative numbers, eg. ``-99'' will be understood to mean ``exclude '99' from the search results''.
It does not index alt tags on images.
It does not index any plug-ins that have not been configured as a service.
Only English skipwords are provided.
best practices (5)
content management (12)
data handling (7)
graphic design (21)
html formatting (7)
plug-in modules (28)
visual tutorial (29)
web protocols (9)