LoginRegisterCommercial SupportContact Us


POD documentation > Content Management > Search.pm

Search.pm

Working with the search index
posted on 3:25 PM, July 12, 2009

Search.pm

The search system is based on a simple search index that uses 2 tables:

  • searchurl
    every searchable URL is represented once here

  • searchterm
    every indexed word at each url is represented once here, with a weight that is used for calculating relevance. The weight is based on the location of the word at the URL, and the number of occurrences.

The search index is constructed by finding all content that is unique to a URL (eg. the content objects that are page-specific), stripping out all mark-up, and counting the incidence of each remaining word in the text. For web pages, we also index words in filenames, titles, descriptions, and keywords.

Indexing of a web site consists of:

  1. indexing the regular content on each page
    The search system finds this content and indexes it automatically. It ignores content that is not page-specific, such as text that comes from templates, menus, and so forth.

  2. indexing the content in each plug-in
    Individual plug-ins can devise their own content indexing logic. To make a plug-in search indexing tool available to the system, the plugin must be defined as a site service, and must reply with a code reference to the the ``Search'' ioctl command. This code reference is the plugin's search indexer. It will be invoked with three parameters:
    an ExSite::Search object
    Using this, the plug-in can add terms into the search index.

    an ExSite::Section object
    Using this, the plug-in can constrain which section's content gets indexed.

    an ExSite::Page object
    This indicates the page that should be used to deliver the plugin's content in search results. In other words, the plug-in should index URLs that generate this page, albeit with alternate query string parameters.

Usage

Make a search object:

    my $s = new ExSite::Search;

You must first index your site(s) before you can perform any searches:

    $s->index_site($section);

$section can be a section ID or a section datahash.

To generate a search form:

    my $form_html = $s->search_form($term,$title,$width);

The parameters are all optional. $term is a term to prepopulate the search field with. $title is a title/heading. $width is the size of the search field (in characters).

To perform a search on the terms in a search string:

    my $results_html = $s->do_search($searchstring);

To get just the list of search hits:

    my $results_html = $s->display_results( $s->search($searchstring) );

The Search plug-in provides a simple interface to these functions.

Search Term Rules

The search system breaks each block of content down to a stream of plain text. All tags and non-text content (such as scripts and CSS) are removed, to leave just the human-readable words and text on the page. Then we strip out all punctuation and other non-word characters to leave just alphanumeric text and whitespace. We convert the text to lower case, and break it out into individual terms, splitting on whitespace. This has a few consequences that may be important for the developer to understand, such as:

Each term is then counted, and the count is multiplied by a weight factor for that content block. The resulting score determines how significant a hit on that term is for that URL.

Advanced Searching Options

Search terms can optionally be prefixed with a + or - character, which changes the search rules:

term
The term is desired, but optional, in the search results. Since at least one term must produce a hit, if only one optional term is given, then it is effectively a required term. If more than one optional term is given, at least one of them is required.

+term
The term is required in the search results. Results that do not contain this term will not be reported.

-term
The term is forbidden in the search results. Results that contain this term will not be reported.

You can combine these for some extra logical control over your searches. For example:

foo bar
Search for ``foo'' or ``bar''. (But pages that have both terms will tend to be more relevant.)

+foo +bar
Search for ``foo'' and ``bar''.

+foo bar
Search for ``foo'' and optionally ``bar''. (Ie. search for ``foo'', but if ``bar'' is also found, it will increase the relevance of the hit.)

foo -bar
Searches for pages containing ``foo'', but excludes pages containing ``bar'' from the results.

Skipwords

Certain terms can be ignored entirely by the search index. These skipwords are simply not inserted into the index, no matter how often or where they appear. They are ignored in search queries, and attempts to search for just these terms will find nothing.

There are two ways to define the list of skipwords. Method 1 is to simply list them in the configuration parameter $config{search}{skipwords}. You can add to this list using the configuration file notation:

    search.skipwords += foo
search.skipwords += bar

If the search.skipwords parameter is not an array of works, but is just a scalar string, that string is understood to be a file containing the skipwords, one per line. For example:

    search.skipwords = skipwords.txt

This file will be sought in the conf subdirectory of cgi-bin. A fairly comprehensive sample file is included with ExSite, containing over 500 words that by themselves carry little meaning and therefore do not help to distinguish one search topic from another. This file may be edited or replaced as needed.

Limitations

You cannot search for partial words. For example ``surf'' does not match ``surfing''.

Quotes are ignored, and any words in a quoted phrase are searched for individually.

Searches for negative numbers, eg. ``-99'' will be understood to mean ``exclude '99' from the search results''.

It does not index alt tags on images.

It does not index any plug-ins that have not been configured as a service.

Only English skipwords are provided.

Filed under: POD