POD documentation > Content Management > Search.pm
Search.pm
The search system is based on a simple search index that uses 2 tables:
- searchurl
every searchable URL is represented once here - searchterm
every indexed word at each url is represented once here, with a weight that is used for calculating relevance. The weight is based on the location of the word at the URL, and the number of occurrences.
The search index is constructed by finding all content that is unique to a URL (eg. the content objects that are page-specific), stripping out all mark-up, and counting the incidence of each remaining word in the text. For web pages, we also index words in filenames, titles, descriptions, and keywords.
Indexing of a web site consists of:
- indexing the regular content on each page
The search system finds this content and indexes it automatically. It ignores content that is not page-specific, such as text that comes from templates, menus, and so forth. - indexing the content in each plug-in
Individual plug-ins can devise their own content indexing logic. To make a plug-in search indexing tool available to the system, the plugin must be defined as a site service, and must reply with a code reference to the the ``Search'' ioctl command. This code reference is the plugin's search indexer. It will be invoked with three parameters:- an ExSite::Search object
- Using this, the plug-in can add terms into the search index.
- an ExSite::Section object
- Using this, the plug-in can constrain which section's content gets indexed.
- an ExSite::Page object
- This indicates the page that should be used to deliver the plugin's content in search results. In other words, the plug-in should index URLs that generate this page, albeit with alternate query string parameters.
- an ExSite::Search object
Usage
Make a search object:
my $s = new ExSite::Search;
You must first index your site(s) before you can perform any searches:
$s->index_site($section);
$section can be a section ID or a section datahash.
To generate a search form:
my $form_html = $s->search_form($term,$title,$width);
The parameters are all optional. $term is a term to prepopulate
the search field with. $title is a title/heading. $width is
the size of the search field (in characters).
To perform a search on the terms in a search string:
my $results_html = $s->do_search($searchstring);
To get just the list of search hits:
my $results_html = $s->display_results( $s->search($searchstring) );
The Search plug-in provides a simple interface to these functions.
Search Term Rules
The search system breaks each block of content down to a stream of plain text. All tags and non-text content (such as scripts and CSS) are removed, to leave just the human-readable words and text on the page. Then we strip out all punctuation and other non-word characters to leave just alphanumeric text and whitespace. We convert the text to lower case, and break it out into individual terms, splitting on whitespace. This has a few consequences that may be important for the developer to understand, such as:
- hyphenated words such as ``over-easy'' will be broken into two terms,
``over'' and ``easy''.
- contractions such as ``haven't'' will be broken into two terms,
``haven'' and ``t''.
- it only works on pages/sites that use languages that delimit
their words by whitespace and punctuation. Other languages, such as
Chinese, will not work.
Each term is then counted, and the count is multiplied by a weight factor for that content block. The resulting score determines how significant a hit on that term is for that URL.
Advanced Searching Options
Search terms can optionally be prefixed with a + or - character, which changes the search rules:
- term
- The term is desired, but optional, in the search results. Since at least one term must produce a hit, if only one optional term is given, then it is effectively a required term. If more than one optional term is given, at least one of them is required.
- +term
- The term is required in the search results. Results that do not contain this term will not be reported.
- -term
- The term is forbidden in the search results. Results that contain this term will not be reported.
You can combine these for some extra logical control over your searches. For example:
- foo bar
- Search for ``foo'' or ``bar''. (But pages that have both terms will tend to be more relevant.)
- +foo +bar
- Search for ``foo'' and ``bar''.
- +foo bar
- Search for ``foo'' and optionally ``bar''. (Ie. search for ``foo'', but if ``bar'' is also found, it will increase the relevance of the hit.)
- foo -bar
- Searches for pages containing ``foo'', but excludes pages containing ``bar'' from the results.
Skipwords
Certain terms can be ignored entirely by the search index. These skipwords are simply not inserted into the index, no matter how often or where they appear. They are ignored in search queries, and attempts to search for just these terms will find nothing.
There are two ways to define the list of skipwords. Method 1 is to
simply list them in the configuration parameter
$config{search}{skipwords}. You can add to this list using the
configuration file notation:
search.skipwords += foo
search.skipwords += bar
If the search.skipwords parameter is not an array of works, but is
just a scalar string, that string is understood to be a file
containing the skipwords, one per line. For example:
search.skipwords = skipwords.txt
This file will be sought in the conf subdirectory of cgi-bin.
A fairly comprehensive sample file is included with ExSite, containing
over 500 words that by themselves carry little meaning and therefore
do not help to distinguish one search topic from another. This file
may be edited or replaced as needed.
Limitations
You cannot search for partial words. For example ``surf'' does not match ``surfing''.
Quotes are ignored, and any words in a quoted phrase are searched for individually.
Searches for negative numbers, eg. ``-99'' will be understood to mean ``exclude '99' from the search results''.
It does not index alt tags on images.
It does not index any plug-ins that have not been configured as a service.
Only English skipwords are provided.