
                               EMBOSS: siggen
     _________________________________________________________________
   
                                Program siggen
                                       
Function

   Generates a sparse protein signature
   
Description

   siggen parses a multiple structure alignment generated by the EMBOSS
   application scopalign and corresponding files of residue contact data
   generated by the EMBOSS application contacts and generates a protein
   signature of a specified sparsity.
   
   Each position in the alignment is scored on the basis of a single or
   any combination of up to 3 scoring schemes. A signature of, for
   example, 10% sparsity would include data from the top 10% highest
   scoring alignment positions.
   
   The resulting protein signature file is used by the application
   sigscan to find examples of the signature in other proteins.
   
  Signatures
  
   Signatures extend the comcept of the motif as a tool for
   characterizing protein families. They consist of a set of N key
   residue postitions (A1, A2 ...An) preceeded by gaps (G) thus
   G1A1G2A2...GnAn. Both a residue and a gap can be variable. A signature
   is matched to a protein sequence and scored using a dynamic
   programming algorithm which permits variability in gap distance and
   residue type. Generating a signature involves identifying residues
   associated with points of contact in interactions between secondary
   structure alements. A raw signature consists of a set of positions
   with potential key structural roles sampled from a sequence alignment
   constructed with reference to this contact data. Raw signatures are
   refined by samplinfg different gap-residue pairs until the specificity
   of a signature for the family cannot be further improved.
   
Usage

   Here is a sample session with siggen:

% siggen
Generates a sparse protein signature
Location of alignment files for input [./]: ./jontest
Extension of alignment files for input [.align]:
Location of contact files for input [./]: ./jontest
Extension of contact files [.con]:
% sparsity of signature [10]:
Generate a randomized signature [N]:
Substitution matrix to be used [./EBLOSUM62]:
Score alignment on basis of residue conservation [Y]:
Score alignment on basis of number of contacts [Y]:
Score alignment on basis of conservation of contacts [Y]: N
Score alignment on a combined measure of number and conservation of contacts [N
]:
Ignore alignment postitions with post_similar value of 0 [Y]:
Name of signature file for output [sig.sig]:

Command line arguments

   Mandatory qualifiers (* if not always prompted):
  [-algpath]           string     Location of alignment files for input
  [-algextn]           string     Extension of alignment files for input
  [-sparsity]          integer    % sparsity of signature
  [-randomise]         bool       Generate a randomised signature
*  -seqoption          menu       Select number
*  -datafile           matrixf    Substitution matrix to be used
*  -conoption          menu       Select number
*  -filtercon          bool       Ignore alignment positions making less than
                                  a threshold number of contacts
*  -conthresh          integer    Threshold contact number
*  -conpath            string     Location of contact files for input
*  -conextn            string     Extension of contact files
*  -cpdbpath           string     Location of coordinate files for input
                                  (embl-like format)
*  -cpdbextn           string     Extension of coordinate files (embl-like
                                  format)
*  -filterpsim         bool       Ignore alignment postitions with
                                  post_similar value of 0
  [-sigpath]           string     Location of signature files for output
  [-sigextn]           string     Extension of signature files for output

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   

   Mandatory qualifiers Allowed values Default
   [-algpath]
   (Parameter 1) Location of alignment files for input Any string is
   accepted ./
   [-algextn]
   (Parameter 2) Extension of alignment files for input Any string is
   accepted .align
   [-sparsity]
   (Parameter 3) % sparsity of signature Any integer value 10
   [-randomise]
   (Parameter 4) Generate a randomised signature Yes/No No
   -seqoption Select number
   1 (Substitution matrix)
   2 (Residue class)
   3 (None)
   3
   -datafile Substitution matrix to be used Comparison matrix file in
   EMBOSS data path ./EBLOSUM62
   -conoption Select number
   1 (Number)
   2 (Conservation)
   3 (Number and conservation)
   4 (None)
   4
   -filtercon Ignore alignment positions making less than a threshold
   number of contacts Yes/No No
   -conthresh Threshold contact number Any integer value 10
   -conpath Location of contact files for input Any string is accepted
   /data/contacts/
   -conextn Extension of contact files Any string is accepted .con
   -cpdbpath Location of coordinate files for input (embl-like format)
   Any string is accepted /data/cpdbscop/
   -cpdbextn Extension of coordinate files (embl-like format) Any string
   is accepted .pxyz
   -filterpsim Ignore alignment postitions with post_similar value of 0
   Yes/No No
   [-sigpath]
   (Parameter 5) Location of signature files for output Any string is
   accepted ./
   [-sigextn]
   (Parameter 6) Extension of signature files for output Any string is
   accepted .sig
   Optional qualifiers Allowed values Default
   (none)
   Advanced qualifiers Allowed values Default
   (none)
   
Input file format

   siggen reads in multiple structure alignment generated by the EMBOSS
   application scopalign and corresponding files of residue contact data
   generated by the EMBOSS application contacts.
   
Output file format

   The output file (Figure 1) uses the following records. The four SCOP
   classification records are taken from the alignment input file:
   
    1. CL - Domain class. It is identical to the text given after 'Class'
       in the scop classification file (see documentation for the EMBOSS
       application scope).
    2. FO - Domain fold. It is identical to the text given after 'Fold'
       in the scop classification file (see scope documentation).
    3. SF - Domain superfamily. It is identical to the text given after
       'Superfamily' in the scop classification file (see scope
       documentation).
    4. FA - Domain family. It is identical to the text given after
       'Family' in the scop classification file (see scope
       documentation).
    5. NP - Number of signature positions.
    6. NN - Signature position number. The number given in brackets after
       this record indicates the start of the data for the relevent
       signature positi on.
    7. IN - Informative line about signature position. The number of
       different amino acid residues seen for this position is given
       after 'NRES', the number of different sizes of gap follows 'NGAP',
       and the window size after 'WSIZ'. When a signature is aligned to a
       protein sequence, the permissible gaps between two signature
       positions is determined by the empirical gaps and the window size
       for the C-terminal position (see sigscan.c) Two rows of data for
       the emprical residues and gaps are then given:
    8. AA - The identifier of a residue seen in this position and the
       frequency of its occurence are delimited by ';'.
    9. GA - The size of a gap seen in this position and the frequency of
       its occurence are delimited by ';'.
   10. // - used to delimit data for each signature. The last line of a
       file always contains '//' only.
       
   Example excerpt from an output signature file:
     _________________________________________________________________
   
CL   All beta proteins
XX
FO   Lipocalins
XX
SF   Lipocalins
XX
FA   Fatty acid binding protein-like
XX
NP   2
XX
NN   [1]
XX
IN   NRES 3 ; NGAP 2 ; WSIZ 2
XX
AA   A ; 2
AA   V ; 1
AA   L ; 4
XX
GA   1 ; 5
GA   2 ; 2
XX
NN   [2]
XX
IN   NRES 2 ; NGAP 2 ; WSIZ 5
XX
AA   F ; 1
AA   Y ; 5
XX
GA   12 ; 3
GA   10 ; 2
XX
//
     _________________________________________________________________
   
   Important
   
    1. In the case a signature file is generated by hand, it is essential
       that the gap data given is listed in order of increasing gap size.
    2. In the current implementation, window size records always have the
       value of 0. These should be changed manually unless a very rigid
       pattern is required. A future implementation will provide a range
       of methods for generating values of window size depending upon the
       alignment (window size is identified by the WSIZ record in the
       signature output file).
    3. Siggen presumes that standard SCOP domain identifiers are given in
       the input alignment if the id is 7 characters long and the first
       character is a 'd' or 'D'. In this case the contact data for that
       chain will be parsed. Otherwise contact data for chain 1 will be
       parsed.
       
Data files

   siggen reads in a protein residue comparison matrix. By default, this
   is EBLOSUM62.
   
   EMBOSS data files are distributed with the application and stored in
   the standard EMBOSS data directory, which is defined by the EMBOSS
   environment variable EMBOSS_DATA.
   
   To see the available EMBOSS data files, run:
   
% embossdata -showall

   To fetch one of the data files (for example 'Exxx.dat') into your
   current directory for you to inspect or modify, run:

% embossdata -fetch -file Exxx.dat

   Users can provide their own data files in their own directories.
   Project specific files can be put in the current directory, or for
   tidier directory listings in a subdirectory called ".embossdata".
   Files for all EMBOSS runs can be put in the user's home directory, or
   again in a subdirectory called ".embossdata".
   
   The directories are searched in the following order:
     * . (your current directory)
     * .embossdata (under your current directory)
     * ~/ (your home directory)
     * ~/.embossdata
       
Notes

   None.
   
References

   Ison JC, Blades MJ, Bleasby AJ, Daniel SC, Parish JH "Key residues
   approach to the definition of protein families and analysis of sparse
   family signatures" (2000) PROTEINS: Structure, Function and Genetics
   40:330-341
   
Warnings

   None.
   
Diagnostic Error Messages

   None.
   
Exit status

   It always exits with status 0.
   
Known bugs

   None.
   
See also

   Program name Description
   contacts Reads coordinate files and writes contact files
   dichet Parse dictionary of heterogen groups
   interface Reads coordinate files and writes inter-chain contact files
   psiblasts Runs PSI-BLAST given scopalign alignments
   scopalign Generate alignments for SCOP families
   seqsort Removes ambiguities from a set of hits resulting from a
   database search
   sigscan Scans a sparse protein signature against swissprot
   
Author(s)

   This application was written by Jon Ison (jison@hgmp.mrc.ac.uk)
   
History

   Written (June 2001) - Jon Ison.
   
Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.
   
Comments
