Using external parsers
======================

Since version 2.1 indexer can use external parsers to index
different file types (mime types).

Parser is any executable program which converts one of the mime
types to text/plain or text/html. For example, if you have postscript
files, you can use ps2ascii parser (filter), which reads postscript
file from stdin and produces ascii to stdout.

We assume parser sends output to stdout. If it is not true, you have to write
a little shell script to put results to stdout. Please feel free to
contribute your scripts and parsers configuration to devel@search.udm.net.

Many parsers could not operate on stdin and requires a file. In this case ndexer creates a temporary file in /tmp and will remove it when parser
exits. Use $1 macro in parser command line to substitute file name. For example,
command line for "catdoc" MS Word to ASCII converters may look like this:
/usr/bin/catdoc -a $1

Some parsers could produce output in other charset than input one.
Specify charset to make indexer convert parser's output to proper charset.

Parser's command line might be optional. In this case you can change
charset or mime type. For example, change mime text/tab-separated-values
to text/plain:

# Note - we do not use parser command line
Mime	text/tab-separated-values text/plain


How to setup parsers
====================

1. Configure web server
-----------------------

Configure your web server to send appropriate "Content-Type" header.
For apache, have a look at mime.types file, most mime types are already
defined there.

2. Edit indexer.conf
--------------------

Uncomment or add lines with parsers definitions.
Lines have the following format:

# Parser definition format
Mime <from_mime> <to_mime>[;charset] ["command line [$1]"]
         \           \         \            \         `- temporary file name
          \           \         \            `- full UNIX command line       
           \           \         `- parser's output character set
            \           `- output mime type. text/plain or text/html
             `- source mime type

For example, the following line defines parser for man pages:

# I use deroff for parsing man pages ( *.man )
Mime application/x-troff-man	text/plain	"deroff"

One more example:

# I like catdoc, but sometimes it produces garbage.
Mime application/msword	text/plain;cp1251	"catdoc -a $1"
