
...Table of Contents...


1 Introduction
**************

1.1 Background
==============

   The _mairix_ program arose from a need to index and search 100's or
1000's of email messages in an efficient way.  It started off
supporting just Maildir format folder, but now MH format is also
supported.

   I use the _mutt_ email client.  There are many features I like about
mutt, including

   * Speed (typical of many command line tools versus GUI counterparts)

   * Threaded message display

   * Customizability (e.g. varying my signature depending on who I'm
     replying to)

   * Little things, like if I reply to a message I wrote, it starts a
     new message to the same recipients (obvious, but which other
     mailers do this?)

   _mutt_ has a feature called _limit_, where the display of messages in
the current folder can be filtered based on matching regular
expressions in particular parts of the messages.  I find this really
useful.  But there is a snag - it only works on the current folder.  If
you have messages spread across many folders, you're out of luck with
limit.  OK - so why not keep all messages in a single folder?  The
problem is that the performance drops badly.  (I think this is true
regardless of folder format - mbox, maildir etc, though probably worse
for some formats than others.)

   So on the one hand, we want small folders to keep the performance
high.  But on the other hand, we want useful searching.

   I use the maildir format for my folders.  This scheme has one file
per message.  On my inboxes(1), I like this for 2 reasons :

   * Fast deletion of messages I don't want to keep (spam, circulars,
     mailing list threads I'm not interested in etc).  (Compare mbox,
     where the whole file would need to be rewritten.)

   * No locking issues whatever.  Maybe I'm over cautious, but I don't
     really trust all that locking stuff to protect a single mbox file
     in all cases, and a single file seems just too vulnerable to
     corruption.)

   Since I'm using maildir for inboxes, I just use it for all my
folders, for uniformity.

   So, I hear you ask, if you use a one-file-per-message format, why
not just use find + egrep to search for messages?  I saw the following
problems with this:

   * What if I want to find all messages to/cc me, from Homer Simpson,
     dated between 1 and 2 months ago, with the word "wubble" in the
     body?  This would involve a pretty nasty set of regexps in a
     pipeline of separate egreps (and bear in mind, headers could be
     split over line boundaries...)

   * What if the message body has quoted-printable (or worse, base64)
     transfer encoding?  The egrep for "wubble" could come very unstuck.

   * How would the matching messages be conveniently arranged into a new
     folder to allow browsing with mutt?

   * What if I wanted to see all messages in the same threads as those
     matching the above condition?

   * If I had 1000's of messages, this wasn't going to be quick,
     especially if I wanted to keep tuning the search condition.(2).

   So find + egrep was a non-starter.  I looked around for other
technology.  I found _grepmail_, but this only works for mbox format
folders, and involved scanning each message every time (so lost on the
speed issue).

   I decided that this was going to be my next project, and mairix was
born.  By the way, the name comes by abbreviating _MAildIR IndeX_.

   ---------- Footnotes ----------

   (1) of which I have many, because I (naturally) use _procmail_ to
split my incoming mail

   (2) This may be a non-issue for people with the lastest technology
under their desk, but I have a 1996 model 486 at home

2 Installation
**************

   There is not much to this.

   Edit the `Makefile' to set `CC', `CFLAGS' and `prefix' as you want.

   Type `make'.

   Type `make install' (for which you may need to be root)

   Type `make docs' (or `make mairix.txt', `make mairix.html' or
whatever.)

   Create a `~/.mairixrc' file.  An example is included in the file
`mairixrc.eg'.  Just copy that to `~/.mairixrc' and edit it.

3 Use
*****

3.1 Overview of use
===================

   _mairix_ has two modes of use : index building and searching.  The
searching mode runs whenever the command line contains any expressions
to search for.  Otherwise, the indexing mode is run.

   The output of the search mode is usually placed in a _virtual
folder_.  This is just a normal maildir directory (i.e. containing
`new', `tmp' and `cur') subdirectories, or a MH directory, so you can
open it as a normal folder in your mail program.  You configure the
path for this virtual folder in your `~/.mairixrc' file.  mairix will
populate the virtual folder with symbolic links pointing to the paths
of the real messages that were matched by the search expression.(1)

   If desired, mairix can produce just a list of files that match the
search expression and omit the building of the virtual folder.  This
mode of operation may be useful in communicating the results of the
search to other programs.

   ---------- Footnotes ----------

   (1) Although symlinks use up more inodes than hard links, I decided
they were more useful because it makes it possible to see the filenames
of the original messages via `ls -l'.

3.2 Indexing strategy and search capabilities
=============================================

   _mairix_ works exclusively in terms of _words_.  The index that's
built in non-search mode contains a table of which words occur in which
messages.  Hence, the search capability is based on finding messages
that contain particular words.  _mairix_ defines a word as any string of
alphanumeric characters + underscore.  Any whitespace, punctuation,
hyphens etc are treated as word boundaries.

   _mairix_ has special handling for the To:, Cc: and From: headers.
Besides the normal word scan, these headers are scanned a second time,
where the characters `@', `-' and `.' are also treated as word
characters.  This allows most (if not all) email addresses to appear in
the database as single words.  So if you have a mail from
wibble@foobar.zzz, it will match on both these searches

     mairix f:foobar
     mairix f:wibble@foobar.zzz

   It should be clear by now that the searching cannot be used to find
messages matching general regular expressions.  Personally, I don't
find that much use anyway for locating old messages - I'm far more
likely to remember particular keywords that were in the messages, or
details of the recipients, or the approximate date.

   It's also worth pointing out that there is no 'locality' information
stored, so you can't search for messages that have one words 'close' to
some other word.  For every message and every word, there is a simple
yes/no condition stored - whether the message contains the word in a
particular header or in the body.  So far this has proved to be
adequate.  mairix has a similar feel to using an Internet search engine.

   There are three further searching criteria that are supported
(besides word searching):

   * Searching for messages whose Date: header is in a particular range

   * Searching for messages whose size is in a particular range.  (I
     see this being used mainly for finding 'huge' messages, as you're
     most likely to want to cull these to recover disc space.)

   * Searching for messages with a particular substring in their paths.
     You can use this feature to limit the search to particular
     folders in your mail hierarchy, for example.

3.3 The `~/.mairixrc' file
==========================

   This file contains information about where you keep your maildir
folders, where you want the index file to be stored and where you want
the virtual folder to be, into which the search mode places the
symlinks.

   mairix searches for this file at `~/.mairixrc' unless you specify the
`-f' command line option.

   If a # character appears in the file, the rest of that line is
ignored.  This allows you to specify comments.

   There are 3 entries (`base', `vfolder' and `database') that must
appear in the file.  Also, either `folders' or `mh_folders' (or both)
must appear.  Optionally, the `vfolder_format' entry may appear.  An
example illustrates:

     base=/home/richard/mail
     folders=new-mail:new-chrony:new-lojban:new-jbofihe
     folders=recent...:ancient...
     mh_folders=an_mh_folder
     vfolder=vfolder
     vfolder_format=maildir
     database=/home/richard/.mairix_database

   The keys are as follows:

base
     This is the path to the common parent directory of all your
     maildir folders.

folders
     This is a colon-separated list of the Maildir folders (relative to
     `base') that you want indexed.  Any entry that ends `...' is
     recursively scanned to find any Maildir folders underneath it.

     More than one line starting with `folders' can be included.  In
     this case, mairix joins the lines together with colons as though a
     single list of folders had been given on a single very long line.

     If a folder name contains a colon, you can write this by using the
     sequence `\:' to escape the colon.  Otherwise, the backslash
     character is treated normally.  (If the folder name actually
     contains the sequence `\:', you're out of luck.)

mh_folders
     This is a colon-separated list of the MH folders (relative to
     `base') that you want indexed.  Any entry that ends `...' is
     recursively scanned to find any MH folders underneath it.

     More than one line starting with `mh_folders' can be included.  In
     this case, mairix joins the lines together with colons as though a
     single list of folders had been given on a single very long line.

vfolder
     This defines the name of the _virtual_ folder (within the directory
     specified by `base') into which the search mode writes its output.
     If the vfolder_format used is `raw', then this setting is not used
     and may be excluded.

vfolder_format
     This defines the type of folder used for the _virtual folder_
     where the search results go.  There are three valid settings for
     this `mh', `maildir' or `raw'.  If the `raw' setting is used then
     mairix will just print out the path names of the files that match
     and no virtual folder will be created.  `maildir' is the default
     if this option is not defined.  The setting is case-insensitive.

database
     This defines the path where mairix's index database is kept.  You
     can keep this file anywhere you like.

   It is illegal to have a folder listed twice.  Once mairix has built
a list of all the messages currently in your folders, it will search
for duplicates before proceeding.  If any duplicates are found (arising
from the same folder being specified twice), it will give an error
message and exit.  This is to prevent corrupting the index database
file.

3.4 Setting up the virtual folder
=================================

   The virtual folder needs to exist before you can run the search mode.

   If you've got `vfolder_format=maildir' (the default), you can just
create the necessary directory structure:

     mkdir -p /home/richard/Mail/vfolder
     mkdir /home/richard/Mail/vfolder/new
     mkdir /home/richard/Mail/vfolder/cur
     mkdir /home/richard/Mail/vfolder/tmp

   If you've got `vfolder_format=mh', the best strategy probably
depends on your mail client.  For mutt, you could either do
     mkdir -p /home/richard/Mail/vfolder
     touch /home/richard/Mail/vfolder/.mh_sequences

   which seems to work.  Or, within mutt, you could set MBOX_TYPE to
`mh' and save a message to `+vfolder' to have mutt set up the structure
for you.

   If you use Sylpheed, the best way seems to be to create the new
folder from within Sylpheed.  This seems to be all you need to do.

3.5 Command line options
========================

   The command line syntax is

     mairix [-f path] [-p] [-v] [-t] [-a] [-o vfolder] [expr1] ... [exprn]

   The `-f' or `--rcfile' flag allows a different path to the
`mairixrc' file to be given, replacing the default of `~/.mairixrc'.

   The `-p' or `--purge' flag is used in indexing mode.  Indexing works
incrementally.  When new messages are found, they are scanned and
information about the words they contain is appended onto the existing
information.  When messages are deleted, holes are normally left in the
message sequence.  These holes take up space in the database file.
This flag will compress the deleted paths out of the database to save
space.

   The `-v' or `--verbose' flag is used in indexing mode.  It causes
more information to be shown during the indexing process.  In search
mode, it causes debug information to be shown if there are problems
creating the symlinks.  (Normally this would be an annoyance.  If a
message matches multiple queries when using `-a', mairix will try to
create the same symlink multiple times.  This prevents the same message
being shown multiple times in the virtual folder.)

   The `-t' or `--threads' option applies to search mode.  Normally,
only the messages matching all the specified expressions are included
in the _virtual folder_ that is built.  With the `-t' flag, any message
in the same thread as one of the matched messages will be included too.
Note, the threading is based on processing the Message-ID, In-Reply-To
and References headers in the messages.  Some mailers don't generate
these headers in a co-operative way and will cause problems with this
threading support.  (Outlook seems to be one culprit.)

   The `-a' or `--augment' option also applies to search mode.
Normally, the first action of the search mode is to clear any existing
message links from the virtual folder.  With the `-a' flag, this step is
suppressed.  It allows the folder contents to be built up by matching
with 2 or more diverse sets of match expressions.  If this mode is
used, and a message matches multiple queries, only a single symlink
will be created for it.

   The `-o' or `--vfolder' option is used in search mode to specify a
virtual folder different to the one specified in the `mairixrc' to be
used.  The path given by the `vfolder' argument after this flag is
relative to the folder base directory given in the `mairixrc' file, in
the same way as the directory in the vfolder specification in that file
is.  So if your `mairixrc' file contains

     base=/home/foobar/Mail

   and you run mairix like this

     mairix -o vfolder2 make+money+fast

   mairix will find some of your saved junk emails and put the results
into `/home/foobar/Mail/vfolder2'.

   The search mode runs when there is at least one search expression.
Search expressions can take forms such as (in increasing order of
complexity):

   * A date expression.  This matches all messages whose `Date:' header
     lies within the given range.  Note, the time of day and timezone
     of the `Date:' header are ignored for simplicity.  For example, to
     match all messages sent between 3 months ago and 1 month ago the
     following command can be used:

          mairix d:3m-1m

     To match all messages older than 2 years, the following command
     can be used:

          mairix d:-2y

     To match all messages newer than 2 weeks, the following command
     can be used:

          mairix d:2w-

   * A size expression.  This matches all messages whose size in bytes
     is in a particular range.  For example, to match all messages
     bigger than 1 Megabyte the following command can be used

          mairix z:1m-

     To match all messages between 10kbytes and 20kbytes in size, the
     following command can be used:

          mairix z:10k-20k

   * A word, e.g. `pointer'.  This matches any message with the word
     `pointer' in the To, Cc, From or Subject headers, or in the
     message body.(1)

   * A word in a particular part of the message, e.g. `s:pointer'.  This
     matches any message with the word `pointer' in the subject.  The
     qualifiers for this are :

    t:pointer
          to match `pointer' in the To: header,

    c:pointer
          to match `pointer' in the Cc: header,

    a:pointer
          to match `pointer' in the To:, Cc: or From: headers (`a'
          meaning `address'),

    f:pointer
          to match `pointer' in the From: header,

    s:pointer
          to match `pointer' in the Subject: header,

    b:pointer
          to match `pointer' in the message body.

     Multiple fields may be specified, e.g. sb:pointer to match in the
     Subject: header or the body.

   * A negated word, e.g. `s:~pointer'.  This matches all messages that
     don't have the word `pointer' in the subject line.

   * A substring match, e.g. `s:point='.  This matches all messages
     containing a word in their subject line where the word has `point'
     as a substring, e.g. `pointer', `disappoint'.

   * An approximate match, e.g. `s:point=1'.  This matches all messages
     containing a word in their subject line where the word has `point'
     as a substring with at most one error, e.g. `jointed' contains
     `joint' which can be got from `point' with one letter changed.  An
     error can be a single letter changed, inserted or deleted.

   * A disjunction, e.g. `s:pointer,dereference'.  This matches all
     messages with one or both of the words `pointer' and `dereference'
     in their subject lines.

   * Each disjunction may be a conjunction, e.g.
     `s:null+pointer,dereference=2' matches all messages whose subject
     lines either contain both the words `null' and `pointer', or
     contain the word `dereference' with up to 2 errors (or both).

   * A path expression.  This matches all messages with a particular
     substring in their path.  The syntax is very similar to that for
     words within the message (above), and all the rules for `+', `,',
     approximate matching etc are the same.  The word prefix used for a
     path expression is `p:'.  Examples:

          mairix p:/archive/

     matches all messages with `/archive/' in their path, and

          mairix p:wibble=1 s:wibble=1

     matches all messages with `wibble' in their path and in their
     subject line, allowing up to 1 error in each case (the errors may
     be different for a particular message.)

     Path expressions always use substring matches and never exact
     matches (it's very unlikely you want to type in the whole of a
     message path as a search expression!)  There is a limit of 32
     characters on the match expression.


   The binding order of the constructions is:

  1. Individual command line arguments define separate conditions which
     are AND-ed together

  2. Within a single argument, the letters before the colon define which
     message parts the expression applies to.  If there is no colon,
     the expression applies to all the headers listed earlier and the
     body.

  3. After the colon, commas delineate separate disjuncts, which are
     OR-ed together.

  4. Each disjunct may contain separate conjuncts, which are separated
     by plus signs.  These conditions are AND-ed together.

  5. Each conjunct may start with a tilde to negate it, and may be
     followed by a slash to indicate a substring match, optionally
     followed by an integer to define the maximum number of errors
     allowed.


   Now some examples.  Suppose my email address is
<richard@doesnt.exist>.

   The following will match all messages newer than 3 months from me
with the word `chrony' in the subject line:

     mairix d:3m- f:richard+doesnt+exist s:chrony

   Suppose I don't mind a few spurious matches on the address, I want a
wider date range, and I suspect that some messages I replied to might
have had the subject keyword spelt wrongly (let's allow up to 2 errors):

     mairix d:6m- f:richard s:chrony=2

   ---------- Footnotes ----------

   (1) Message body is taken to mean any body part of type text/plain
or text/html.  For text/html, text within meta tags is ignored.  In
particular, the URLs inside <A HREF="..."> tags are not currently
indexed.  Non-text attachments are ignored.  If there's an attachment
of type message/rfc822, this is parsed and the match is performed on
this sub-message too.  If a hit occurs, the enclosing message is
treated as having a hit.

