.TH ANNOYANCE-FILTER 1 "4 AUG 2004" .UC 4 .SH NAME annoyance-filter \- automatically detect junk mail .nh .SH SYNOPSIS .B annoyance-filter [ .I options ] .SH DESCRIPTION .B annoyance-filter uses Bayesian statistics to determine the probability an E-mail message is junk based on an analysis of its contents compared to collections of known junk and legitimate E-mail. .PP The current version of this program is always posted at: .ce 1 http://www.fourmilab.ch/annoyance-filter/ Please visit this page for news about the program and to download the latest version. .PP The project is hosted on SourceForge, where you will find the CVS source code repository and release archives: .ce 1 http://sourceforge.net/projects/annoyancefilter/ .SH USAGE .B annoyance-filter has a multitude of options which permit it to be used in many different ways, but the most common application involves .I training the program with collections of legitimate and junk mail in order to create a .I dictionary which indicates the probability that words identify a message as junk or non-junk (legitimate). Training must be done before the program is used to classify incoming mail, but need be done subsequently only when adding messages to the training collections. As long as the overall content of the mail, junk and legitimate, which you receive remains pretty much the same, there's no need to retrain, but the ability to do so allows the program to automatically adapt to evolving message content, which is particularly characteristic of junk mail. .PP Suppose you have a collection of legitimate mail (in other words, mail you wish to read) in a file named .I m\-good and a collection of junk mail (that which you don't wish to read) in file .IR m\-junk . These collections may be in ``Unix mail folder'' format, which is simply the text of one or more E-mail messages concatenated together in a single text file, or may be the names of directories containing files, each of which may be a single E-mail message or a Unix mail folder. In either case, if a message file is compressed with .BR gzip , it will be automatically uncompressed on the fly. Directories of messages may not, however, contain other directories of messages. .PP To train .B annoyance-filter with these collections and create a dictionary, use a command like: .PP .ce 1 .BI "annoyance-filter \-\-mail " m-good " \-\-junk " m-junk " \-\-prune \-\-write " dict.bin .PP where .I dict.bin is the name of the dictionary file you wish to create. .PP Now that the dictionary has been created, you can use it on subsequent runs to compute the probability a message is junk and classify it accordingly. Suppose you have an E-mail message in the file .IR mail.txt . To compute its junk priority and display it on standard output, use the command: .PP .ce 1 .BI "annoyance-filter \-\-read " dict.bin " \-\-test " mail.txt .PP To integrate .B annoyance-filter into a mail processing system such as .BR procmail , you'll usually want to run it as a .I filter which reads incoming messages from standard input (piped there by the mail processing system), classifies them and adds annotations to the message header indicating the classification, then writes the message with header annotations to standard output. The mail processing system may then examine the header annotations and route the message accordingly. To filter a message, again assuming the dictionary created by the training run is in the file .IR dict.bin , use the command: .PP .ce 1 .BI "annoyance-filter \-\-read " dict.bin " \-\-transcript \- \-\-test \-" .PP Here the .B \-\-transcript option is used to request the input message be copied to an output file, in this case standard output, specified by .RB `` \- '', with the message read from standard input, the .RB `` \- '' argument to the .B \-\-test option. .SH OPTIONS \"#include "annoyance-filter.w" "Options." .SH "EXIT STATUS" The program exits with a status of 0 when processing is successfully completed, 1 when an error (I/O or file access in most cases) occurs, and 2 to indicate a command line syntax error. If the .B \-\-classify option is specified, an exit status of 0 identifies the message tested as legitimate mail, 3 marks it as junk, and a status of 4 is returned for messages which cannot be confidently classified as either mail or junk. .SH FILES Files are read or written as requested by options on the command line; all options which read or write files take a .I fname argument which gives the file name. The .BR \-\-classify , .BR \-\-junk , .BR \-\-mail , .BR \-\-test , and .B \-\-transcript options interpret an argument of .RB `` \- '' as denoting standard input or output. .PP On systems which provide the required services and utilities, arguments to the .B \-\-junk and .B \-\-mail options may be compressed files or the name of a directory containing one or more messages which will be read as if logically concatenated. Messages in the directory may be compressed or uncompressed. .PP Error messages and diagnostic output generated when the .B \-\-verbose option is specified are written to standard error. .SH BUGS Millions, doubtless. This is a program which must cope with whatever garbage is fed to it from mail folders, trying to make the best of it. When it messes up, your efforts in identifying the message which caused the problem and submitting a verbatim copy of it with your bug report are much appreciated. .PP Please report bugs to .BR bugs @ fourmilab.ch and include .B annoyance-filter in the Subject line. Thanks in advance. .ne 10 .SH AUTHOR .ce 2 John Walker http://www.fourmilab.ch/ .PP This software is in the public domain. Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, without any conditions or restrictions. This software is provided ``as is'' without express or implied warranty. .SH "SEE ALSO" .BR gnuplot (1), .BR gs (1), .BR gzip (1), .BR netpbm (1), .BR procmail (1), .BR xpdf (1) .PP .B annoyance-filter is written using the .I "Literate Programming" http://www.literateprogramming.com/ methodology; the user manual, program, and internal documentation are developed together, closely interlinked. Whenever the program is modified, the documentation is automatically updated, reducing the risk of divergence between what the manual says and what the program does. .PP This .B man page is intended as a reference for the command line options and most common applications of the program. For comprehensive documentation, including details of how to integrate .B annoyance-filter with the .B procmail mail processing system, please refer to the complete documentation published in PDF format, available on the Web at: .ce 1 http://www.fourmilab.ch/annoyance-filter/annoyance-filter.pdf .PP If you have downloaded the .B annoyance-filter source distribution, the corresponding version of .B \%annoyance-filter.pdf is included in the archive. You can read PDF files with Acrobat reader (a free download from http://www.adobe.com/acrobat/readstep.html) or the .B xpdf or Ghostscript .RB ( gs ) utilities.