
                              EMBOSS: sigscan
     _________________________________________________________________
   
                                Program sigscan
                                       
Function

   Scans a sparse protein signature against swissprot
   
Description

   sigscan scans a signature such as that generated by the EMBOSS
   application siggen against a protein sequence database and generates
   files of scored hits and corresponding alignments. An (optionally
   grouped) scop families file can be provided in which case a
   classification of hits is provided in the signature hits output file.
   See documentation for the EMBOSS application psiblasts for an
   explanation of the scop families file and groups for information on
   how to group it.
   
  Signatures
  
   Signatures extend the comcept of the motif as a tool for
   characterizing protein families. They consist of a set of N key
   residue postitions (A1, A2 ...An) preceeded by gaps (G) thus
   G1A1G2A2...GnAn. Both a residue and a gap can be variable. A signature
   is matched to a protein sequence and scored using a dynamic
   programming algorithm which permits variability in gap distance and
   residue type. Generating a signature involves identifying residues
   associated with points of contact in interactions between secondary
   structure alements. A raw signature consists of a set of positions
   with potential key structural roles sampled from a sequence alignment
   constructed with reference to this contact data. Raw signatures are
   refined by samplinfg different gap-residue pairs until the specificity
   of a signature for the family cannot be further improved.
   
Usage

   Here is a sample session with sigscan:

% sigscan

Command line arguments

   Mandatory qualifiers:
  [-sigin]             infile     Name of signature file for input
  [-database]          seqall     Name of sequence database to search
  [-targetf]           infile     Name of (optionally grouped) scop families
                                  file for input
  [-thresh]            integer    Minimum length (residues) of overlap
                                  required for two hits with the same code to
                                  be counted as the same hit.
  [-sub]               matrixf    Residue substitution matrix
  [-gapo]              float      The gap insertion penalty is the score taken
                                  away when a gap is created. The best value
                                  depends on the choice of comparison matrix.
                                  The default value assumes you are using the
                                  EBLOSUM62 matrix for protein sequences, and
                                  the EDNAMAT matrix for nucleotide sequences.
  [-gape]              float      The gap extension, penalty is added to the
                                  standard gap penalty for each base or
                                  residue in the gap. This is how long gaps
                                  are penalized. Usually you will expect a few
                                  long gaps rather than many short gaps, so
                                  the gap extension penalty should be lower
                                  than the gap penalty. An exception is where
                                  one or both sequences are single reads with
                                  possible sequencing errors in which case you
                                  would expect many single base gaps. You can
                                  get this result by setting the gap open
                                  penalty to zero (or very low) and using the
                                  gap extension penalty to control gap
                                  scoring.
   -nterm              menu       Select number
  [-nhits]             integer    Number of hits to output
  [-hitsf]             outfile    Name of signature hits file for output
  [-alignf]            outfile    Name of signature alignments file for output

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   

   Mandatory qualifiers Allowed values Default
   [-sigin]
   (Parameter 1) Name of signature file for input Input file test.sig
   [-database]
   (Parameter 2) Name of sequence database to search Readable sequence(s)
   ./test.seq
   [-targetf]
   (Parameter 3) Name of (optionally grouped) scop families file for
   input Input file test.fam
   [-thresh]
   (Parameter 4) Minimum length (residues) of overlap required for two
   hits with the same code to be counted as the same hit. Any integer
   value 20
   [-sub]
   (Parameter 5) Residue substitution matrix Comparison matrix file in
   EMBOSS data path ./EBLOSUM62
   [-gapo]
   (Parameter 6) The gap insertion penalty is the score taken away when a
   gap is created. The best value depends on the choice of comparison
   matrix. The default value assumes you are using the EBLOSUM62 matrix
   for protein sequences, and the EDNAMAT matrix for nucleotide
   sequences. Floating point number from 1.0 to 100.0 10.0 for any
   sequence
   [-gape]
   (Parameter 7) The gap extension, penalty is added to the standard gap
   penalty for each base or residue in the gap. This is how long gaps are
   penalized. Usually you will expect a few long gaps rather than many
   short gaps, so the gap extension penalty should be lower than the gap
   penalty. An exception is where one or both sequences are single reads
   with possible sequencing errors in which case you would expect many
   single base gaps. You can get this result by setting the gap open
   penalty to zero (or very low) and using the gap extension penalty to
   control gap scoring. Floating point number from 0.0 to 10.0 0.5 for
   any sequence
   -nterm Select number
   1 (Align anywhere and allow only complete signature-sequence fit)
   2 (Align anywhere and allow partial signature-sequence fit)
   3 (Use empirical gaps only)
   1
   [-nhits]
   (Parameter 8) Number of hits to output Any integer value 100
   [-hitsf]
   (Parameter 9) Name of signature hits file for output Output file
   test.hits
   [-alignf]
   (Parameter 10) Name of signature alignments file for output Output
   file test.align
   Optional qualifiers Allowed values Default
   (none)
   Advanced qualifiers Allowed values Default
   (none)
   
Input file format

   Excerpts from a signature hits (Figure 1) are shown. The records used
   are are as follows:
   
    1. DE - bibliographic information. The text 'Results of signature
       search' is always given.
    2. Four SCOP classification records are given:
          + CL - Domain class. It is identical to the text given after
            'Class' in the scop classification file (see documentation
            for the EMBOSS application scope).
          + FO - Domain fold. It is identical to the text given after
            'Fold' in the scop classification file (see scope
            documentation).
          + SF - Domain superfamily. It is identical to the text given
            after 'Superfamily' in the scop classification file (see
            scope documentation).
          + FA - Domain family. It is identical to the text given after
            'Family' in the scop classification file (see scope
            documentation).
    3. HI - hit data. The data are as follows (column numbers are given
       in parentheses).
          + (i) HI is always given.
          + (ii) Rank-order of the hit.
          + (iii) Database identifier code.
          + (iv) The group number of the protein if a grouped scop
            families file was provided or '.' otherwise.
          + (v) The primary classification of the hit. For true hits
            (genuine relatives to the signature) one of 'TRAIN',
            'PSIBLAST' or 'OTHER' is given. Otherwise, 'CROSS', 'FALSE'
            or 'UNKNOWN' is given ('.' is given if a scop families file
            was not provided).
          + (vi) The secondary classification of the hit, either 'FALSE',
            'TRUE' or 'UNKNOWN' ('.' is given if a scop families file was
            not provided). The classes of hits are defined below.
          + (vii) Score of sequence-signature match.
          + (viii) E-value (see below).
    4. XX - used for spacing.
    5. // - The file ends with a line containing '//' only.
       
   Example excerpt from a signature hits file:
     _________________________________________________________________
   
DE   Results of signature search
XX
CL   All alpha proteins
XX
FO   Globin-like
XX
SF   Globin-like
XX
FA   Globins
XX
XX
HI   1    1RBPDFG   1    TRUE     TRUE    234  0.0001
HI   2    1GFT35J   3    TRUE     TRUE    234  0.0008
HI   3    1KJUFGH   1    TRUE     TRUE    224  0.0108
HI   4    1GYU15R   2    CLOSE    TRUE    220  0.1876
HI   5    1LKI89O   2    CLOSE    TRUE    203  0.6787
HI   6    1QRTY58   1    TRUE     TRUE    199  0.9978
HI   7    2IOM78G   1    FALSE    FALSE   198  1.0844
HI   8    1SZR234   1    CLOSE    TRUE    198  1.4343
HI   9    3PONI57   1    DISTANT  FALSE   197  2.8849
HI  10    1PHDJBS   3    CLOSE    TRUE    190  2.9872
HI  11    1HIOHDW   1    UNKNOWN  UNKNOWN 160  5,8676
HI  12    199976T   1    CLOSE    TRUE    140  8.8346
XX
//
     _________________________________________________________________
   
Output file format

   Excerpts from an alignment file are shown (Figure 2). The records used
   are are as follows:
   
   (1) The DE, CL, FO, SF, FA, XX and // records have the same meaning as
   in the hits file (above).
   
   (2) Other lines contain either a fragment of protein sequence
   preceeded by an accession number, or a fragment of an alignment of a
   signature to the protein sequence (signature positions are marked with
   a '*'). The two numbers on either side of the sequence are begin and
   end residue numbers for that line.
   
   Example excerpt from a signature alignment file
     _________________________________________________________________
   
DE   Results of signature search
XX
CL   Alpha and beta proteins (a/b)
XX
FO   alpha/beta-Hydrolases
XX
SF   alpha/beta-Hydrolases
XX
FA   Acetylcholinesterase-like
XX
OPSD_HUMAN      1        MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMF 45
SIGNATURE       -        ---------*------------*---------------*------
OPSD_XENLA      1        MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMF 45
SIGNATURE       -        --------*-------------*----------------*-----
XX
OPSD_HUMAN      46       LLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGG 90
SIGNATURE       -        --------------*--------------*------------*--
OPSD_XENLA      46       LLILLGLPINFMTLFVTIQHKKLRTPLNYILLNLVFANHFMVLCG 90
SIGNATURE       -        --------------*--------------*------------*--
XX
OPSD_HUMAN      91       FTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIER 135
SIGNATURE       -        ---------*--*--------------------------**----
OPSD_XENLA      91       FTVTMYTSMHGYFIFGPTGCYIEGFFATLGGEVALWSLVVLAVER 135
SIGNATURE       -        ---------*----*-------------------------**---
XX
//
     _________________________________________________________________
   
   Definition of classes of hit
   
   The primary classification is an objective definition of the hit and
   has one of the following values:
   
   TRAIN - the sequence was included in the original alignment from which
   the signature was generated.
   
   PSIBLAST - A protein which was detected by psiblast (see psiblasts.c)
   to be a homologue to at least one of the proteins in the family from
   which the signature was derived. Such proteins are identified by the
   'PSIBLAST' record in the scop families file.
   
   OTHER - A true member of the family but not a homologue as detected by
   psi-blast. Such proteins may have been found from the literature and
   manually added to the scop families file or may have been detected by
   the EMBOSS program swissparse (see swissparse.c). They are identified
   in the
   
   SCOP families file by the 'OTHER' record.
   
   CROSS - A protein which is homologous to a protein of the same fold,
   but differnt family, of the proteins from which the signature was
   derived.
   
   FALSE - A homologue to a protein with a different fold to the family
   of the signature.
   
   UNKNOWN - The protein is not known to be CROSS, FALSE or a true hit
   (TRAIN, PSIBLAST or OTHER).
   
   The secondary classification is provided for convenience and a value
   as follows:
   
   Hits of TRAIN, PSIBLAST and OTHER classification are all listed as
   TRUE.
   
   Hits of CROSS, FALSE or UNKNOWN objective classification are listed as
   CROSS, FALSE or UNKNOWN respectively.
   
   The subjective column allows for hand-annotation of the hits files so
   that proteins of UNKNOWN objective classification can re-classified by
   a human expert as TRUE, FALSE, CROSS or otherwise left as UNKNOWN for
   the purpose of generating signature performance plots with the EMBOSS
   application sigplot.
   
Data files

   None.
   
Notes

   Important - sigscan presumes that SCOP family names are unique. If
   this were not the case, changes to ajXyzClassifyHits would have to be
   made.
   
   Important - In the case where a signature file is generated by hand,
   it is essential that the gap data given is listed in order of
   increasing gap size.
   
References

   Ison JC, Blades MJ, Bleasby AJ, Daniel SC, Parish JH "Key residues
   approach to the definition of protein families and analysis of sparse
   family signatures" (2000) PROTEINS: Structure, Function and Genetics
   40:330-341
   
Warnings

   None.
   
Diagnostic Error Messages

   None.
   
Exit status

   It always exits with status 0.
   
Known bugs

   None.
   
See also

   Program name Description
   contacts Reads coordinate files and writes contact files
   dichet Parse dictionary of heterogen groups
   interface Reads coordinate files and writes inter-chain contact files
   psiblasts Runs PSI-BLAST given scopalign alignments
   scopalign Generate alignments for SCOP families
   seqsort Removes ambiguities from a set of hits resulting from a
   database search
   siggen Generates a sparse protein signature
   
Author(s)

   This application was written by Jon Ison (jison@hgmp.mrc.ac.uk)
   
History

   Written (July 2001) - Jon Ison
   
Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.
   
Comments
