Data filters in the AAFID(tm) system.
$Revision: 1.4 $ $Date: 1999/09/03 17:22:54 $

 ======================================================================
 This file is Copyright 1998,1999 by the Purdue Research Foundation and
 may only be used under license.  For terms of the license, see the
 file named COPYRIGHT included with this software release.
 AAFID is a trademark of the Purdue Research Foundation.
 All rights reserved.
 ======================================================================

Introduction.
-------------

One problem in the AAFID system as it was originally implemented is that
each agent is responsible for getting the data it needs from wherever
it can get it. Because data sources (even for the same kind of data)
vary very much from system to system, and even among different
configurations of the same operating system, this can lead to
difficulties in implementing agents that can easily be run in a number
of different operating systems. To be able to do that, each agent
would need to incorporate the capability to read configuration
information or somehow detect the kind of system it is running on, and
adjust its data sources accordingly.

Even worse, to obtain the methods used to obtain the data may be
conceptually different under different operating systems, even for the
same kind of data. One example of this may be the differences between
Unix and Windows NT. Information that under Unix may be obtainable
simply by reading from a log file may need to be obtained by very
different means (system calls, execution of external programs, etc.)
under NT.

However, all of these problems relate only to the step of obtaining
the data. Once the data is obtained, it can be processed pretty much
in the same way, not matter what the operating system.

The idea of filters allows the data collection step to be performed by
a separate entity than the one that actually processes the data. We
call "filter" an entity that reads a specific kind of data, puts it in
an appropriate format, and feeds it to the agents that need it.

Another problem that the filters address is that several agents may
need to read from the same data source, but need different parts of
the data, depending on some matching criteria. If every agent has to
read and process the data, the filtering process (getting the data and
discarding the parts that are not needed) has to be done by each
agent. If that work is done by a filter, then the filtering is done
only once, hopefully reducing the work load on the machine where the
agents are running.

Thus, AAFID Filters are intended to perform as:

- A layer of abstraction between the physical origins of the data (log
  files, command outputs, etc.) and the agents.

- A filtering layer that allows each agent to specify a criteria by
  which to select the data it needs, and then to receive only data
  that match that criteria.


How AAFID filters work
----------------------

AAFID filters are a new type of AAFID entity that performs the
functions described above. Each filter is identified by a name
(possibly describing the type of data it produces).

Each agent should now carry as one of its parameters the names of the
filters it needs. When a transceiver is going to start a new agent, it
looks for the names of the filters that the agent needs. Then it
starts those filters, and passes information to the newly-started
agent that allows it to locate the filter. The agent contacts the
filter, "subscribes" to it by telling it the matching criteria for the
data it needs, and receives a file handle, from which it can read from
then on as if it were a regular file.

If a filter that an agent needs has been started already (because
another agent that started previously needed the same filter), then
the transceiver simply passes the contact information to the
agent. Only one filter of each type should exist simultaneously in
each host.


Record semantics
----------------

One of the purposes of the filters is to implement an abstraction
layer between the physical origins of the data and the data that is
used by the agents. For example, a "Network service accesses" filter
should allow an "FTP access monitor" agent to work the same on a
Unix machine or on an NT machine. On the Unix machine, the data may be
obtained from the TCP-Wrappers log, whereas on the NT machine the data
may be obtained by some other means. However, the filters would take
care of that, and provide the agent with a consistent data
format.

Thus, each filter also knows something about the semantics of the data
is produces. Internally, the data is handled as fields, although it is
given to the agent as one text line. This also allows the agent to
specify the matching criteria in terms of the fields of the data,
which makes for a more powerful filtering mechanism.

For example, a "Network service accesses" filter (see the
classes/Filter/Ftcpw.pm file) could produce the following output:

Aug 17 18:13:05 narnia.cs.purdue.edu in.telnetd 4768 connect 723 narnia.cs.purdue.edu

Which is the information that can be directly obtained from the
TCP-Wrappers log. The filter, however, knows that the line contains
the following fields:

MONTH DAY TIME DESTHOST DAEMON PID RESULT UID SRCHOST

The agent knows this format (or the output could be produced in a
self-defined format, similar to the CIDF format [1]), therefore being
able to interpret it accordingly. The field-oriented interpretation,
however, would allow a Windows NT version of the same filter to
produce the same output, even if it has to obtain the information
through very different means.

Furthermore, the field-oriented interpretation would allow the agent
to give the following command to the filter to set the matching
criteria:

SETPATTERN Daemon => "telnetd", SrcHost => ".*\.cs\.purdue\.edu$"

This is, provide a regular expression that has to be matched for
specific fields. This example would cause the filter to only send to
the agent the records for telnet requests that originated within the
Purdue CS department.


Implementation specifics
------------------------

The current implementation works as follows:

- Each agent that needs information from a filter will have to define
  a new parameter (in its %PARAMETERS hash) called "FiltersNeeded"
  that contains a list of the filter names that it needs, the initial
  pattern that will be set for each one, and the function to call when
  data from the filter is received.

  In the AAS format, this is specified using the FILTERS keyword, with 
  the following format:

  FILTERS:
    FilterName1 => [ { Initial pattern }, 'SubroutineName1' ],
    FilterName2 => [ { Initial pattern }, 'SubroutineName2' ]

  The corresponding pattern is sent to each filter upon startup. When
  data is received from the filter, the subroutine is called with the
  file handle for the filter (a Unix-domain socket in the current
  implementation) and the message received from it as arguments.

- Filters use Unix-domain sockets to communicate with agents.

- When the transceiver is going to start an agent, it examines the
  agent's FiltersNeeded parameter.

- Any filters that have not been started yet are started by the
  transceiver. As part of their startup, each transceiver sends the
  path of its Unix-domain server socket to the transceiver. The
  transceiver caches that information.

- The transceiver gives the path of the Unix-domain server socket of
  each filter to the agent. This is done by setting the information in
  the FilterPaths parameter of the agent before running it.

- When it runs, the agent contacts the filter and sends it a
  SETPATTERN command that contains regular expressions defining the
  matching criteria for the data that the agent needs. From then on,
  the agent reads from the socket that it opened to the filter, as if
  it were reading from a regular file. As data that matches its
  criteria becomes available, the filter will provide it.

- The filter needs to keep track of all the agents that have contacted
  it, together with the patterns that each agent has
  provided. Whenever new data becomes available, the filter splits the
  new data into fields, compares it against the patterns of all the
  registered agents, and send the data to those that match.

The base AAFID::Filter class provides all the base functionality, so
that new filters for specific types of data can be implemented with as
little coding as possible.

See the file How_to_write_filters.txt for more details on writing new
filters, and the file How_to_use_filters.txt for more details on how
an agent can use a filter.

Future ideas
------------

Some ideas that may be interesting to explore:

- Client-driven filters. This is, filters that don't send data to the
  agents as it becomes available, but that detect when the agent reads
  from its corresponding file handle, and then sends the current data
  to that agent. This may be useful for filters whose data is
  available continuously (for example, process information).


References
----------

[1] Common Intrusion Detection Framework. http://seclab.cs.ucdavis.edu/cidf/
