Integrating an External Preprocessor

General Notes

By means of an external preprocessor, documents can be modified before they are indexed. This makes it possible to convert binary data to text, or to generate or extract meta data (from images, for example) for the purpose of indexing. As a result, searches will (better) find the documents concerned. You can define as many preprocessors as you require.

Documents of any MIME type can be associated with a preprocessor. This can be done by means of the indexing section in the system configuration. Any suitable program can be used as an external preprocessor. Optionally, arguments can be passed to such a program.

Functionality

The preprocessor program receives the document to be indexed via stdin from the Search Server. The document passed to the preprocessor is a serialized XML document. The preprocessor modifies it in the desired way and returns it to the Search Server via stdout. The Search Server then indexes the modified document. An example:

Original data:

<ses-indexDoc docId="2148" collection="cm-contents"
    mimeType="application/vnd.ms-excel">
<title encoding="plain">Ein Beispiel mit Excel-Daten</title>
<keyword encoding="plain">Beispiel</keyword>
<blob encoding="stream" mimeType="application/vnd.ms-excel">
  /Fiona_671/instance/default/tmp/externalPreprocessor/1.dat
</blob>
</ses-indexDoc>

Modified data:

<ses-indexDoc docId="2148" collection="cm-contents"
    mimeType="application/vnd.ms-excel">
<title encoding="plain">Excel-Daten als Text</title>
<keyword encoding="plain">Beispiel</keyword>
<blob encoding="stream" mimeType="text/plain">
  /Fiona_671/instance/default/tmp/text_data.dat
</blob>
</ses-indexDoc>

The XML document contains the fields to be indexed (the names of the XML elements) as well as their values (the content of the XML elements). A field value may either be contained directly in the element's content (encoding: plain) or it may have been encoded. The encoding can be determined by means of the encoding tag attribute of the field element. Its value can be one of:

  • plain: The field value is the content of the XML element.
  • base64: The field value can be determined by base64-decoding the content of the XML element.
  • stream: The field value is contained in the file whose path is specified in the content of the XML element.

For all encodings except plain the MIME type of the document is provided as the value of the mimeType tag attribute of the field element. If the MIME type is changed during preprocessing, the mimeType attribute must be set to the MIME type of the resulting field value. If the encoding is not plain, a field value will only be indexed if its MIME type matches text/*. In other words: if a preprocessor produces base64-encoded or streamed field values, it must set their MIME type to a text type.

Configuration

The preprocessor to be used, the MIME types to which it is applied, and the arguments to be passed to it can be specified in the indexing.xml configuration file. The corresponding section might look like this, for example:

  ...
  <contentPreprocessors type="list">
    <preprocessor>
      <mimeTypes type="list">
        <mimeType>application/pdf</mimeType>
      </mimeTypes>
      <processor type="external">
        bin/tclsh
      </processor>
      <processorArguments type="list">
        <argument>/Fiona_671/instance/default/script/custom/pdf2TxtWrapper.tcl</argument>
      </processorArguments>
    </preprocessor>
    ...
  </contentPreprocessors>
  ...

Here, the Tcl interpreter was specified as the preprocessor program to use. To this program the name of the script to be executed is passed as an argument in the processorArguments element. Since the script cannot be loaded during server startup, it should not be placed into the serverCmds or clientCmds directory.

The following sample script, pdf2TxtWrapper.tcl, demonstrates how a PDF document, which is containd as the blob field in the XML document, can be read and converted to text. Please note that no preprocessor is required for the Search Server to index PDF documents.

# Libraries
package require dom
package require base64
proc safeInterp {args} {}
source [file join [file dirname [info script]]\
    ../../../share/script/common/clientCmds/util.tcl]

# Read Data
set xmlRequest [read stdin]

# Parse XML
set docNode [::dom::DOMImplementation parse $xmlRequest]
set rootNode [::dom::document cget $docNode -documentElement]

# Select and handle element "blob"
set blobElement [lindex [::dom::selectNode $rootNode descendant::blob] 0]
array set attributes [array get [$blobElement cget -attributes]]
set blobTextNode [$blobElement cget -firstChild]
if {$blobTextNode ne ""} {
  set value [$blobTextNode cget -nodeValue]
  if {$value ne ""} {
    switch $attributes(encoding) {
      plain {
        # shouldn't happen with pdf
        set blob $value
      }
      base64 {
        set blob [::base64::decode $value]
      }
      stream {
        set blobFile $value
      }
    }
    set deletePdfFile 0
    if {![info exists blobFile]} {
      set blobFile "/tmp/convert_me_[pid].pdf"
      writeFile $blobFile $blob
      set deletePdfFile 1
    }
    set textFile "/tmp/converted_[pid].txt"
    # convert using ps2ascii
    if {![catch {
      exec ps2ascii $blobFile $textFile
    }]} {
      # modify the dom tree
      $blobTextNode configure -nodeValue $textFile
      ::dom::element setAttribute $blobElement mimeType "text/plain"
      ::dom::element setAttribute $blobElement encoding stream
    }
    if {$deletePdfFile} {
      file delete -force $blobFile
    }
  }
}
set xmlToReturn [string trimright [::dom::DOMImplementation serialize $docNode] "\n"]
set lines [split $xmlToReturn "\n"]
if {[string match "<!D*" [lindex $lines 1]]} {
  set xmlToReturn [join [lreplace $lines 1 1] "\n"]
}
# return the (modified) xml data
puts -nonewline $xmlToReturn