################################################################################
#                                                                              #
#  CYRILLIC SUPPORT LIBRARY (C) 2001-2003 Pavel Novikov (pavel@ext.by)         #
#                                                                              #
################################################################################

NOTES:

  This is a library that is used to automatically detect russian text encodings*
  and to perform different service tasks on the text like conversion between the
  known charsets.

  All most frequently used russian charsets are supported (windows-1251, koi8-r,
  koi8-u, iso-8859-5, x-mac-cyrillic and ibm866). Some nice features are to be
  done. Unicode support is also planned.

  This library is not "string-aware". You _MUST_ specify block size in all cases
  when you like to perform some library tasks. This library has some wrappers to
  its own functions to simplify some tasks.

  Like any other piece of software the library comes with NO WARRANTY.

  * - frequently "encoding" means "charset" in this document and comments.

INSTALL:

  --8<----------------------------------------------------------------------
  tar xvzf cyrillic.tar.gz
  cd cyrillic
  make install
  --8<----------------------------------------------------------------------

  or

  --8<----------------------------------------------------------------------
  aier.tar.gz
  cd cyrillic
  install PFX=/usr/local
  --8<----------------------------------------------------------------------

DEINSTALL:

  --8<----------------------------------------------------------------------
  tar xvzf cyrillic.tar.gz
  cd cyrillic
  make deinstall
  --8<----------------------------------------------------------------------

  or

  --8<----------------------------------------------------------------------
  tar xvzf cyrillic.tar.gz
  cd cyrillic
  make deinstall PFX=/usr/local
  --8<----------------------------------------------------------------------

USAGE:

  To be able to use this library use the following headers:

  --8<----------------------------------------------------------------------
  include <cyrillic.h>
  include <cyrillic_export.h>
  --8<----------------------------------------------------------------------

  Compile your software as follows:

  --8<----------------------------------------------------------------------
  cc ... -lcyrillic
  --8<----------------------------------------------------------------------

FUNCTIONALITY:

  char *_cyr_convert(char *buffer,unsigned long size,const char *table)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Converts <size> bytes inside the <buffer> using <table> as mapping.

    Returns always pointer to the <buffer>.

  unsigned int _cyr_convert_char(unsigned int c,const char *table)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Converts <c> using <table> as mapping.

    Returns converted value.

  int cyr_translate_src_encoding(const char *table)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Returns the numeric value of the source charset for the symbolic value
    named <table> or "CYR_TABLE_UNKNOWN" when the charset is unknown. This value
    is later used to select proper conversion mapping.

  int cyr_translate_dst_encoding(const char *table)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Returns the numeric value for the destination charset for the symbolic value
    named <table> or "CYR_TABLE_UNKNOWN" when the charset is unknown. This value
    is later used to select proper conversion mapping.

  char *cyr_convert(char *buffer,unsigned long size,int table)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This is a wrapper for "_cyr_convert" that uses numeric <table> values.

  unsigned int cyr_convert_char(unsigned int c,int table)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This is a wrapper for "_cyr_convert_char" that uses numeric <table> values.

  char *cyr_convert_dual(char *buffer,unsigned long size,const char *table_src,const char *table_dst)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Converts <size> bytes inside the <buffer> from charset <table_src> to
    charset <table_dst>. It used global flags as options that define the
    conversion behavior for unknown charsets*.

  char *cyr_convert_dualSE(char *buffer,unsigned long size,const char *table_src)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This is a wrapper for "cyr_convert_dual" that uses saved charset*.

  const char *cyr_convert_dualA(const char *buffer,unsigned long size,const char *table_src,const char *table_dst)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This is a wrapper for "cyr_convert_dual" that allocates a new
    memory chunk because some buffers can not be modified. This is
    useful when adding cyrillic support into already existing code.

    If you don't like "malloc" use primary functions.

    You _MUST_ free the returned pointer after usage.

  const char *cyr_convert_dualASE(const char *buffer,unsigned long size,const char *table_src)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This is the same like "cyr_convert_dualSE" but the wrapper for "cyr_convert_dualA".

  const char *cyr_getrfc2047charset(const char *buffer)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Service function that returns a pointer to the symbolic charset name in rfc2047
    encoded string (e.g. =?koi8-r... or "=?koi8-r...) so it can be feed into some of
    conversion functions.

  unsigned long _cyr_score_stats(const char *table)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Calculates the score for the collected statistics for a specified <table>.

  int _cyr_detect_encoding()
  ~~~~~~~~~~~~~~~~~~~~~~~~~~

    Detects charset according to the collected statistics.

    Returns numeric value.

  int _cyr_detect_buffer_encoding(const char *buffer,unsigned long size)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Makes the same for <size> bytes in the <buffer>. All previously collected
    statistics flushed before the data is being analyzed. This is basically
    a wrapper for "_cyr_detect_encoding".

  const char *cyr_detect_encoding()
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    The same as "_cyr_detect_encoding" except that it returns symbolic value.

  const char *cyr_detect_buffer_encoding(const char *buffer,unsigned long size)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    The same as "_cyr_detect_buffer_encoding" except that it returns symbolic value.

  void cyr_flush_encoding_stats()
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Resets collected statistics.

  void cyr_collect_encoding_stats(const char *buffer,unsigned long size)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Collects statistics data for <size> bytes of the <buffer>.

  * - see "BEHAVIOR" section.

  More TBD.

BEHAVIOR:

  This library contains a number of functions to operate with russian texts in different
  charsets. Operations like converting data from one encoding to another are supported
  along with the nice feature to detect the encoding of any russian text block.

  The current behavior when you like to convert the data from one encoding to another
  is to convert data into "dos" table 1st and then to convert the data into desired table.
  Later we should reconsider this and speed up by using a bit more complicated static
  conversion tables that will allow avoiding such a double conversion. Although in this
  case a number of tables will dramatically grow. It will be 6^2 instead of 6*2 number
  of conversion tables and will require a bit more complex analysis mode. When you like
  you can use "low level" functions instead of "high level" wrappers to avoid a behavior
  like this.

  The library has a feature for collecting statistics from a series of russian text
  chunks (using "cyr_collect_encoding_stats"). "cyr_collect_encoding_stats" uses global
  statistics buffer "_CYR_ENCODING_STATS" so this is not thread-safe even if you use
  "high level" functions to collect statistics and then detect encoding. However all
  "low level" and some "high level" functions that don't use globals are thread-safe.
  Usage of unsafe functions w/ threading may lead to improper data conversion due to
  override of global charset flags and options.

  For block level processing some global options are used as follows:

  const char *cyr_tmpl_encoding;
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This option used in some software that is not released into public domain. You
    may use it if your software (e.g. web interface) supports templates and you like to
    convert them from their original encoding into user specified one. This option
    can be set according to the configuration of your software.

    It is not used in the library code.

    Undefined by default.

  const char *cyr_mime_encoding;
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This option used in some software that is not released into public domain. You
    may use it in e-mail processing when the current mime block has charset field
    in its headers. You can set it in the beginning of the mime block body processing
    and unset it when the processing is done.

    When the block charset is unknown this option is used as a last line of defense
    before defaulting to "cyr_src_encoding". When "cyr_mime_encoding" is not set we
    use "cyr_src_encoding" instead.

    This option is used by "cyr_convert_dual" and dependants.

    Undefined by default.

  const char *cyr_src_encoding;
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This is a very last line of defense that is used for blocks with either unknown
    or undetectable charset. This option can be set according to the configuration of
    your software.

    This option is used by "cyr_convert_dual" and dependants.

    Undefined by default.

  const char *cyr_dst_encoding;
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This option used in some software that is not released into public domain. You
    may use it to keep in mind the "user defined" output charset. This option can be
    set according to the configuration of your software.

    This option is used by "*SE" functions and dependants.

    Undefined by default.

  const char *cyr_det_encoding;
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This is an option that can be set according to the configuration of your software.

    When this option is set to "_CYR_DET_ENCODING_AUTO" then a try is performed to guess
    the source encoding of the block within the "cyr_convert_dual". A bit after if it
    is set to "_CYR_DET_ENCODING_SOFT" and the source charset is unknown another guess try
    is done. A bit after a second try if the source charset is still unknown and either
    "cyr_mime_encoding" or "cyr_src_encoding" is specified then source charset is set
    according to the values of "cyr_mime_encoding" and "cyr_src_encoding" (look above).

    Default is "_CYR_DET_ENCODING_AUTO".

TODO:

  Look into the comments at the top of cyrillic.c :-]
