Content Indexing

Collections

The Search Cartridge indexes documents into so-called collections. Being able to index a particular document into a particular collection can be used to accelerate the search later on. On websites in two or more languages, for example, only the content for a particular language needs to searched if the language is a search criterion. However, it is also possible to search several or all collections.

As CMS Fiona is installed, two collections are created, one for the editorial and one for the live side. Further collections can be created using a Tcl command of the Search Server. When doing this, a pre-set configuration is used for the new collection. It includes – among other things – country-specific settings such as the character set of the indexed documents and the names of the document fields whose contents are to be returned in search results. The structure of a collection cannot be changed once it has been created.

If the Template Engine is used on the live server, a collection pair is used for indexing and searching. While search requests are served from the collection that is currently online, updated content is indexed into the offline collection. Such a collection pair is called a switchable collection.

If the Template Engine is not available on the live server, the live server collections can be created by the Content Manager. Documents that are exported using the exportSubtree command are automatically indexed if this option has been enabled in the system configuration.

Indexed Data

In the editorial system, the Search Cartridge indexes the versions of files as well as some important file fields. You can configure the kinds of versions to be indexed – edited, released and archived versions may be combined as desired.

On the live server, the Template Engine indexes the exported, UTF-8 encoded documents including all their meta-data before applying the configured export encoding to them. If a file has been exported using data from other files (like frame sets, layouts for the main content), only the meta-data of the main file are indexed.

Frame sets and their frames are jointly indexed as a single document. Therefore, the frame set is included in the search results, if an associated frame matches the search query. If the Template Engine is not available on the live system, the Content Manager can create the indexes during the static export.

Since the Search Cartridge not only indexes the version fields but also the most important file fields, even searching for fields such as the file name or file format may produce search results.

In the values of fields containing HTML text (such as body), the SGML comments <!-- noindex --> and <!-- /noindex --> have the effect that the text between these comments is not indexed. Comments themselves are never indexed.

Each time a value of an indexed file or version field is altered or the file status changes due to workflow actions such as Release or Unrelease, the Content Manager indexes the respective version for the search in the editorial system. For the search on the live system, either the Template Engine or the Content Manager carry out the indexing of the web documents during the export.

Document Zones and Fields

The Search Engine Server gets the data of a version or web document to be indexed as an XML document in a request. Each attribute in such a document corresponds to an XML element. The custom version field abstract, for example, is stored in the XML file as follows:

<abstract>Summary of the document</abstract>

After the Search Engine Server has passed the document that is to be indexed to the search module and the latter has indexed it, all the indexed fields of the CMS file have become so called zones. Zones are named document areas that can be searched.

You can explicitely restrict search queries to one or more zones in order to search for documents that contain the search term in these zones. Such a search is called attributed, because it is not applied to the whole document but only to selected areas. If a search request is not explicitely restricted to particular zones, all zones are searched.

While document zones enable you to search through specific document parts, document fields are used for enriching the search result of each found document with the information you want to display on the results pages. In the standard configuration, for example, the version field title not only becomes a zone, but in addition to this, its content is stored in the document field title. This enables the Search Cartridge to include the titles of the documents in the search results, for clients to use them as intended.

Document zones and fields can be configured as desired (see Configuring Collections). A version of a CMS file can have any number of version fields. During indexing the fields are transformed into the same number of document zones, provided that the configuration does not exclude zones from indexing or restricts indexing to certain zones.

In contrast to this, all indexed documents always have all document fields. If the configuration defines, for example, that the content of a zone is to be stored in a particular field, this field is always included in the indexed document, even if the respective zone is not present in the document. In this case, the field remains empty.

A client that sends a search request to the Search Engine Server can explicitely name the fields whose content it wants to be included in the search result (see Search Requests). The zones and fields available in the standard configuration can be found in section Content Search.

Pre-Processing During Indexing

As the Search Engine Server indexes documents, you can have every document preprocessed by a script or a program. A pre-processor can be used, for example, to add information not included in the CMS file versions themselves to the documents to be indexed. Pre-processing can also be helpful if you need to alter the encoding of the documents, i. e. to change it to the required format (UTF-8).

The Search Engine Server passes the documents to be indexed to the pre-processor without prior modification, i.e. the script or program receives the original indexing request. The pre-processor modifies the data as required and returns it to the Search Engine Server that sends it to the Autonomy search module for indexing.

Identification of Indexed Contents

In the process of indexing, the search module assigns each document a unique identifier, the document ID. This ID is stored as a field in the search index and is returned in the search results. The search module receives the identifiers from the Search Engine Server which in turn has received them from the respective client. Thus, the client decides which CMS file or version field should be used as the document IDs.

While the Content Management Server uses version IDs as document identifiers, the Template Engine uses file IDs. Thus, in the editorial system, the ID of an indexed document corresponds to a version ID, whereas in the live system, the document ID is a file ID.

The document ID enables the client (e.g. the user interface of the editorial system, and the Template Engine) to retrieve additional information about the document. Next to this ID, other important items are also indexed as document fields by default, the individual CMS file paths, and the titles of the relevant CMS file versions, for example. Thus, a client can extract the file paths from the search results to create result pages on which the titles of the retrieved documents are linked to the corresponding web pages.