
                              EMBOSS: nrscope
     _________________________________________________________________
   
                                Program nrscope
                                       
Function

   Converts redundant EMBL-format SCOP file to non-redundant one
   
Description

   Nearly all proteins have structural similarities with other proteins
   and, in some of these cases, share a common evolutionary origin. A
   knowledge of these relationships is crucial to our understanding of
   the evolution of proteins and of development. It will also play an
   important role in the analysis of the sequence data that is being
   produced by worldwide genome projects.
   
   The SCOP database aims to provide a detailed and comprehensive
   description of the structural and evolutionary relationships between
   all proteins whose structure is known, including all entries in the
   Protein Data Bank (PDB).
   
   nrscope reads in the EMBL-like format SCOP classification file
   generated by the EMBOSS application scope, and writes a file of
   non-redundant domains in the same format. Domain sequences are
   extracted from the clean domain coordinate files generated by the
   EMBOSS application domainer.
   
   The current version of nrscope removes redundancy at the level of the
   SCOP family, i.e. entries belonging to the same family will be
   non-redundant. All permutations of pair-wise sequence alignments are
   calculated for each SCOP family in turn, using the EMBOSS
   implementation of the Needleman and Wunsch global alignment algorithm.
   If a pair of proteins achieve greater than a threshold percentage
   sequence identity (specified by the user) the shortest sequence is
   discarded. The user must specify gap insertion and extension penalties
   and a residue substitution matrix for use in the alignments.
   
Usage

   Here is a sample session with nrscope:
   
% nrscope
Converts redundant EMBL-format SCOP file to non-redundant one
Name of scop file for input (embl-like format) [Escop.dat]: /data/scop/Escop.da
t
Name of non-redundant scop file for output (embl-like format) [EscopNR.dat]: Es
copNR.test
Location of clean domain coordinate files for input (embl-like format) [./]: /d
ata/cpdbscop/
File extension of clean domain coordinate files [.pxyz]:
The % sequence identity redundancy threshold [95]: 95
Residue substitution file [EBLOSUM62]:
Gap insertion penalty [10]: 20
Gap extension penalty [0.5]: 1
Name of log file for the build [nrscope.log]: EscopNR.log
D3SDHA_
D3SDHB_
D3HBIA_
D3HBIB_
D4SDHA_
D4SDHB_
D4HBIA_
D4HBIB_
D5HBIA_
D5HBIB_

Command line arguments

   Mandatory qualifiers:
  [-scopin]            infile     Name of scop file for input (embl-like
                                  format)
  [-dpdb]              string     Location of clean domain coordinate files
                                  for input (embl-like format)
  [-extn]              string     File extension of clean domain coordinate
                                  files
  [-thresh]            float      The % sequence identity redundancy threshold
  [-datafile]          matrixf    Residue substitution matrix
  [-gapopen]           float      The gap insertion penalty is the score taken
                                  away when a gap is created. The best value
                                  depends on the choice of comparison matrix.
                                  The default value assumes you are using the
                                  EBLOSUM62 matrix for protein sequences, and
                                  the EDNAFULL matrix for nucleotide
                                  sequences.
  [-gapextend]         float      The gap extension, penalty is added to the
                                  standard gap penalty for each base or
                                  residue in the gap. This is how long gaps
                                  are penalized. Usually you will expect a few
                                  long gaps rather than many short gaps, so
                                  the gap extension penalty should be lower
                                  than the gap penalty. An exception is where
                                  one or both sequences are single reads with
                                  possible sequencing errors in which case you
                                  would expect many single base gaps. You can
                                  get this result by setting the gap open
                                  penalty to zero (or very low) and using the
                                  gap extension penalty to control gap
                                  scoring.
  [-scopout]           outfile    Name of non-redundant scop file for output
                                  (embl-like format)
  [-errf]              outfile    Name of log file for the build

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   

   Mandatory qualifiers Allowed values Default
   [-scopin]
   (Parameter 1) Name of scop file for input (embl-like format) Input
   file Escop.dat
   [-dpdb]
   (Parameter 2) Location of clean domain coordinate files for input
   (embl-like format) Any string is accepted ./
   [-extn]
   (Parameter 3) File extension of clean domain coordinate files Any
   string is accepted .pxyz
   [-thresh]
   (Parameter 4) The % sequence identity redundancy threshold Any integer
   value 95.0
   [-datafile]
   (Parameter 5) Residue substitution matrix Comparison matrix file in
   EMBOSS data path EBLOSUM62
   [-gapopen]
   (Parameter 6) The gap insertion penalty is the score taken away when a
   gap is created. The best value depends on the choice of comparison
   matrix. The default value assumes you are using the EBLOSUM62 matrix
   for protein sequences, and the EDNAFULL matrix for nucleotide
   sequences. Floating point number from 1.0 to 100.0 10.0 for any
   sequence
   [-gapextend]
   (Parameter 7) The gap extension, penalty is added to the standard gap
   penalty for each base or residue in the gap. This is how long gaps are
   penalized. Usually you will expect a few long gaps rather than many
   short gaps, so the gap extension penalty should be lower than the gap
   penalty. An exception is where one or both sequences are single reads
   with possible sequencing errors in which case you would expect many
   single base gaps. You can get this result by setting the gap open
   penalty to zero (or very low) and using the gap extension penalty to
   control gap scoring. Floating point number from 0.0 to 10.0 0.5 for
   any sequence
   [-scopout]
   (Parameter 8) Name of non-redundant scop file for output (embl-like
   format) Output file EscopNR.dat
   [-errf]
   (Parameter 9) Name of log file for the build Output file nrscope.log
   Optional qualifiers Allowed values Default
   (none)
   Advanced qualifiers Allowed values Default
   (none)
   
Input file format

   The EMBL-like format SCOP classification file generated by the EMBOSS
   application scope is as follows:
   
   Records (4) to (8) are used to describe the position of the domain in
   the scop hierarchy.
   
    1. ID - Domain identifier code. This is a 7-character code that
       uniquely identifies the domain in scop. It is identical to the
       first 7 characters of a line in the scop classification file. The
       first character is always 'D', the next four characters are the
       PDB identifier code, the fifth character is the PDB chain
       identifier to which the domain belongs (a '.' is given in cases
       where the domain is composed of multiple chains, a '_' is given
       where a chain identifier was not specified in the PDB file) and
       the final character is the number of the domain in the chain (for
       chains comprising more than one domain) or '_' (the chain
       comprises a single domain only).
    2. EN - PDB identifier code. This is the 4-character PDB identifier
       code of the PDB entry containing the domain.
    3. OS - Source of the protein. It is identical to the text given
       after 'Species' in the scop classification file.
    4. CL - Domain class. It is identical to the text given after 'Class'
       in the scop classification file.
    5. FO - Domain fold. It is identical to the text given after 'Fold'
       in the scop classification file.
    6. SF - Domain superfamily. It is identical to the text given after
       'Superfamily' in the scop classification file.
    7. FA - Domain family. It is identical to the text given after
       'Family' in the scop classification file.
    8. DO - Domain name. It is identical to the text given after
       'Protein' in the scop classification file.
    9. NC - Number of chains comprising the domain (usually 1). If the
       number of chains is greater than 1, then the domain entry will
       have a section containing a CN and a CH record (see below) for
       each chain.
   10. CN - Chain number. The number given in brackets after this record
       indicates the start of the data for the relevent chain.
   11. CH - Domain definition. The character given before CHAIN is the
       PDB chain identifier (a '.' is given in cases where a chain
       identifier was not specified in the scop classification file), the
       strings before START and END give the start and end positions
       respectively of the domain in the PDB file (a '.' is given in
       cases where a position was not specified). Note that the start and
       end positions refer to residue numbering given in the original pdb
       file and therefore must be treated as strings.
       
Output file format

   The output file has the same format as the input EMBL-like format SCOP
   classification file generated by the EMBOSS application scope except
   that it contains non-redundant domains, as explained in the
   Description section.
   
   nrscope generates a log file, an excerpt of which is shown below. The
   first two lines describe the level in the SCOP hierarchy at which
   redundancy was removed (always 'FAMILIES' for the current
   implementation) and the value of the redundancy threshold. The file
   then contains a section for each SCOP family. Each section contains a
   line with the record '//' immediately followed by the name of the SCOP
   family, and two lines containing 'Retained' and 'Rejected'
   respectively. Domain identifier codes of domains that were discarded
   by nrscope are listed under 'Rejected' while those that appear in the
   output file are listed under 'Retained'. The text 'WARN filename not
   found' is given in cases where a clean domain coordinate file could
   not be found and 'WARN Empty family' where no files for an entire
   family could be found. 'ERROR filename file read error' will be given
   when an error was encountered during a file read.
     _________________________________________________________________
   
FAMILIES are non-redundant
95% redundancy threshold
// Homeodomain
Retained
D2HDDA_
D1AKHA_
D1MNMC_
Rejected
D2HDDB_
D1ENH__
D3HDDA_
WARN  d3hdda_.pxyz not found
// Di-haem cytohrome c peroxidase
WARN  ds005__.pxyz not found
WARN  Empty family
// Nuclear receptor coactivator Src-1
Retained
D2PRGC_
Rejected
     _________________________________________________________________
   
Data files

   None.
   
Notes

   None.
   
References

   None.
   
Warnings

   None.
   
Diagnostic Error Messages

   None.
   
Exit status

   It always exits with status 0.
   
Known bugs

   None.
   
See also

   Program name Description
   cutgextract Extract data from CUTG
   domainer Build domain coordinate files
   pdbtosp Convert raw swissprot:pdb equivalence file to embl-like format
   printsextract Extract data from PRINTS
   prosextract Builds the PROSITE motif database for patmatmotifs to
   search
   rebaseextract Extract data from REBASE
   scope Convert raw scop classification file to embl-like format
   scopparse Reads raw-, and writes EMBL-like, scop classification files
   seqnr Converts redundant database results to a non-redundant set of
   hits
   tfextract Extract data from TRANSFAC
   
Author(s)

   This application was written by Jon Ison (jison@hgmp.mrc.ac.uk)
   
History

   Written (Jan 2001) - Jon Ison.
   
Target users

   This program is intended to be run by EMBOSS site maintainers or those
   responsible for setting up and maintaining protein 3D structural data
   for use by others.
   
Comments
