
                              EMBOSS: domainer
     _________________________________________________________________
   
                               Program domainer
                                       
Function

   Build domain coordinate files
   
Description

   Nearly all proteins have structural similarities with other proteins
   and, in some of these cases, share a common evolutionary origin. A
   knowledge of these relationships is crucial to our understanding of
   the evolution of proteins and of development. It will also play an
   important role in the analysis of the sequence data that is being
   produced by worldwide genome projects.
   
   The SCOP database aims to provide a detailed and comprehensive
   description of the structural and evolutionary relationships between
   all proteins whose structure is known, including all entries in the
   Protein Data Bank (PDB).
   
   domainer reads in an EMBL-like format SCOP classification file
   generated by the EMBOSS applications scope or nrscope, and EMBL-like
   format clean protein coordinate files generated by the coorde
   application. (not currently in EMBOSS, email Jon Ison
   jison@hgmp.mrc.ac.uk) For each domain in the scop classification file
   domainer writes clean domain coordinate files in EMBL-like and PDB
   formats. Each of these output files contains coordinates for a single
   SCOP domain. In cases where multiple models were determined, the data
   in the domain files correspond to the first model. In the rare cases
   where a domain is comprised of more than one chain, the data will be
   presented as belonging to a single chain (i.e. a single sequence,
   chain identifier etc will be given).
   
Usage

   Here is a sample session with domainer:
   

% domainer
Build domain coordinate files
Name of scop file for input (embl-like format) [Escop.dat]: /data/scop/Escop.da
t
Location of coordinate files for input (embl-like format) [./]: /data/cpdb/
Location of coordinate files for output (embl-like format) [./]:
Extension of coordinate files (embl-like format) [.pxyz]:
Location of coordinate files for output (pdb format) [./]:
Extension of coordinate files (pdb format) [.ent]:
Name of log file for the embl-like format build [domainer.log1]: log.1
Name of log file for the pdb format build [domainer.log2]: log.2
D3SDHA_
D3SDHB_
D3HBIA_
D3HBIB_
D4SDHA_
D4SDHB_
D4HBIA_
D4HBIB_
D5HBIA_
D5HBIB_
D7HBIA_
D7HBIB_

Command line arguments

   Mandatory qualifiers:
  [-scop]              infile     Name of scop file for input (embl-like
                                  format)
  [-cpdb]              string     Location of coordinate files for input
                                  (embl-like format)
  [-cpdbscop]          string     Location of coordinate files for output
                                  (embl-like format)
  [-cpdbextn]          string     Extension of coordinate files (embl-like
                                  format)
  [-pdbscop]           string     Location of coordinate files for output (pdb
                                  format)
  [-pdbextn]           string     Extension of coordinate files (pdb format)
  [-cpdberrf]          outfile    Name of log file for the embl-like format
                                  build
  [-pdberrf]           outfile    Name of log file for the pdb format build

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   

   Mandatory qualifiers Allowed values Default
   [-scop]
   (Parameter 1) Name of scop file for input (embl-like format) Input
   file Escop.dat
   [-cpdb]
   (Parameter 2) Location of coordinate files for input (embl-like
   format) Any string is accepted ./
   [-cpdbscop]
   (Parameter 3) Location of coordinate files for output (embl-like
   format) Any string is accepted ./
   [-cpdbextn]
   (Parameter 4) Extension of coordinate files (embl-like format) Any
   string is accepted .pxyz
   [-pdbscop]
   (Parameter 5) Location of coordinate files for output (pdb format) Any
   string is accepted ./
   [-pdbextn]
   (Parameter 6) Extension of coordinate files (pdb format) Any string is
   accepted .ent
   [-cpdberrf]
   (Parameter 7) Name of log file for the embl-like format build Output
   file domainer.log1
   [-pdberrf]
   (Parameter 8) Name of log file for the pdb format build Output file
   domainer.log2
   Optional qualifiers Allowed values Default
   (none)
   Advanced qualifiers Allowed values Default
   (none)
   
Input file format

   The EMBL-like format used for the input clean protein data (and the
   output domain format) uses the following records:
   
   (1) ID - Either the 4-character PDB identifier code (for clean protein
   coordinate files) or the 7-character domain identifier code taken from
   scop (for domain coordinate files; see documentation for the EMBOSS
   application scope for further info.)
   
   (2) DE - compound information. Text from the COMPND records from the
   original pdb file are given.
   
   (3) OS - protein source information. Text from the SOURCE records from
   the original pdb file are given.
   
   (4) EX - experimental information. The text 'nmr_or_model' (for
   nuclear magnetic resonance and model structures) or 'xray' (for
   structures determined by X-ray crystallography) appears as appropriate
   after the text 'METHOD'. The resolution of X-ray structures, or '0'
   for structures of type 'nmr_or_model', is given after 'RESO'. The
   number of models and number of polypeptide chains are given after
   'NMOD' and 'NCHA' respectively. For domain coordinate files a 1 is
   always given. Following the EX record, the file will have a section
   containing a CN, IN and SQ records (see below) for each chain.
   
   (5) CN - chain number. The number given in brackets after this record
   indicates the start of a section of chain-specific data.
   
   (6) IN - chain specific data. The character given after ID is the PDB
   chain identifier (a '.' is given in cases where a chain identifier was
   not specified in the pdb file or, for domain coordinate files, the
   domain is comprised of more than one domain). The number of amino acid
   residues comprising the chain (or the chains from which a domain is
   comprised) is given after NR. The number of atoms in heterogens and
   water molecules are given after NH and NW respectively. Domain
   coordinate files do not include coordinates for these groups so a
   value of 0 is always given.
   
   (7) SQ - protein sequence. The number of residues is given before AA
   on the first line. The protein sequence is given on subsequent lines.
   
   (8) CO - coordinate data. The columns of the records are as follows.
    1. CO is always given.
    2. Model number (always 1 for domain coordinate files).
    3. Chain number (always 1 for domain coordinate files).
    4. Either P (a protein atom), H (a heterogen atom) or W (an atom in a
       water molecule).
    5. Position of the residue in the protein sequence given in the SQ
       record (for protein atoms) or a sequential count of the atoms (for
       heterogens and water).
    6. Residue number according to the original pdb file, or or a
       sequential count of the atoms (for heterogens and water).
    7. Single character amino acid code or a '.' (for heterogens and
       water).
    8. 3-character residue identifier code.
    9. Atom type.
   10. The x orthogonal coordinate.
   11. The y orthogonal coordinate.
   12. The z orthogonal coordinate.
   13. Occupancy.
   14. Temperature factor.
       
   (9) XX - Used for spacing.
   
   (10) // - Given on the last line of the file only.
   
Output file format

   The PDB format is explained at:
   http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html
   domainer writes the following records in PDB format:
   
   (1) HEADER - bibliographic information. The text 'CLEANED-UP PDB FILE
   FOR SCOP DOMAIN XXXXXXX' is always given (where XXXXXXX is a
   7-character domain identifier code).
   
   (2) TITLE - bibliographic information. The text ' THIS FILE IS MISSING
   MOST RECORDS FROM THE ORIGINAL PDB FILE' is always given.
   
   (3) COMPND - compound information. The COMPND records from the
   original pdb file are given.
   
   (4) SOURCE - protein source information. The SOURCE records from the
   original PDB file are given.
   
   (5) REMARK - remark records. Remark records are used for spacing. One
   REMARK line containing the protein resolution is always given.
   
   (6) SEQRES - protein sequence.
   
   (7) ATOM - atomic coordinates.
   
   (8) TER - indicates the end of a chain.
   
   The following is an example of an excerpt from an output clean domain
   coordinate file (PDB format):
     _________________________________________________________________
   
HEADER     CLEANED-UP PDB FILE FOR SCOP DOMAIN D1HBBA_
TITLE      THIS FILE IS MISSING MOST RECORDS FROM THE ORIGINAL PDB FILE
COMPND     HEMOGLOBIN A (DEOXY, LOW SALT, 100MM CL)
SOURCE     HUMAN (HOMO SAPIENS)
REMARK
REMARK     RESOLUTION. 1.90  ANGSTROMS.
REMARK
SEQRES   1 A  141  VAL LEU SER PRO ALA ASP LYS THR ASN VAL LYS ALA ALA
SEQRES   2 A  141  TRP GLY LYS VAL GLY ALA HIS ALA GLY GLU TYR GLY ALA
SEQRES   3 A  141  GLU ALA LEU GLU ARG MET PHE LEU SER PHE PRO THR THR
SEQRES   4 A  141  LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER
SEQRES   5 A  141  ALA GLN VAL LYS GLY HIS GLY LYS LYS VAL ALA ASP ALA
SEQRES   6 A  141  LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN
SEQRES   7 A  141  ALA LEU SER ALA LEU SER ASP LEU HIS ALA HIS LYS LEU
SEQRES   8 A  141  ARG VAL ASP PRO VAL ASN PHE LYS LEU LEU SER HIS CYS
SEQRES   9 A  141  LEU LEU VAL THR LEU ALA ALA HIS LEU PRO ALA GLU PHE
SEQRES  10 A  141  THR PRO ALA VAL HIS ALA SER LEU ASP LYS PHE LEU ALA
SEQRES  11 A  141  SER VAL SER THR VAL LEU THR SER LYS TYR ARG
ATOM      1  N   VAL A   1       7.155  17.725   4.424  1.00 37.82           N
ATOM      2  CA  VAL A   1       7.854  18.800   3.718  1.00 35.10           C
ATOM      3  C   VAL A   1       9.366  18.565   3.754  1.00 31.92           C
ATOM      4  O   VAL A   1       9.861  17.961   4.721  1.00 35.01           O
ATOM      5  CB  VAL A   1       7.529  20.168   4.360  1.00 47.63           C
ATOM      6  CG1 VAL A   1       7.806  21.300   3.369  1.00 62.84           C
ATOM      7  CG2 VAL A   1       6.136  20.244   4.936  1.00 54.85           C
ATOM      8  N   LEU A   2      10.032  19.062   2.731  1.00 27.38           N
ATOM      9  CA  LEU A   2      11.496  18.967   2.657  1.00 23.24           C
ATOM     10  C   LEU A   2      12.077  20.110   3.496  1.00 22.99           C
ATOM     11  O   LEU A   2      11.672  21.259   3.289  1.00 25.22           O
ATOM     12  CB  LEU A   2      11.924  19.005   1.204  1.00 18.04           C
ATOM     13  CG  LEU A   2      11.563  17.855   0.286  1.00 17.80           C
ATOM     14  CD1 LEU A   2      12.166  18.109  -1.097  1.00 20.08           C
ATOM     15  CD2 LEU A   2      12.116  16.542   0.839  1.00 13.84           C
ATOM     16  N   SER A   3      12.979  19.784   4.391  1.00 22.22           N
ATOM     17  CA  SER A   3      13.652  20.792   5.257  1.00 20.53           C
ATOM     18  C   SER A   3      14.871  21.318   4.505  1.00 18.31           C
ATOM     19  O   SER A   3      15.273  20.709   3.496  1.00 17.73           O
ATOM     20  CB  SER A   3      14.084  20.042   6.534  1.00 17.61           C
     _________________________________________________________________
   
   domainer also writes out the clean domain coordinate files in
   EMBL-like format. The format for this EMBL-like data is described in
   the Input File format section of this document as it used the same
   format for the input clean protein EMBL-like data and the output clean
   domain EMBL-like data.
   
   The following is an example of an excerpt from an output clean domain
   coordinate file (EMBL-like format):
     _________________________________________________________________
   
ID   D1HBBA_
XX
DE   Co-ordinates for SCOP domain D1HBBA_
XX
OS   See Escop.dat for domain classification
XX
EX   METHOD xray; RESO 1.90; NMOD 1; NCHA 1;
XX
CN   [1]
XX
IN   ID A; NR 141; NH 0; NW 0;
XX
SQ   SEQUENCE   141 AA;  15127 MW;  5EC7DB1E CRC32;
     VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH GSAQVKGHGK
     KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL LSHCLLVTLA AHLPAEFTPA
     VHASLDKFLA SVSTVLTSKY R
XX
CO   1    1    P    1     1     V    VAL    N      7.155   17.725 4.424     1.0
0    37.82
CO   1    1    P    1     1     V    VAL    CA     7.854   18.800 3.718     1.0
0    35.10
CO   1    1    P    1     1     V    VAL    C      9.366   18.565 3.754     1.0
0    31.92
CO   1    1    P    1     1     V    VAL    O      9.861   17.961 4.721     1.0
0    35.01
CO   1    1    P    1     1     V    VAL    CB     7.529   20.168 4.360     1.0
0    47.63
CO   1    1    P    1     1     V    VAL    CG1    7.806   21.300 3.369     1.0
0    62.84
CO   1    1    P    1     1     V    VAL    CG2    6.136   20.244 4.936     1.0
0    54.85
CO   1    1    P    2     2     L    LEU    N     10.032   19.062 2.731     1.0
0    27.38
CO   1    1    P    2     2     L    LEU    CA    11.496   18.967 2.657     1.0
0    23.24
CO   1    1    P    2     2     L    LEU    C     12.077   20.110 3.496     1.0
0    22.99
CO   1    1    P    2     2     L    LEU    O     11.672   21.259 3.289     1.0
0    25.22
     _________________________________________________________________
   
   domainer generates a log file, an excerpt of which is shown below. If
   there is a problem in processing a domain, three lines containing the
   record '//', the domain identifier code and an error message
   respectively are written. The text 'WARN filename not found' is given
   in cases where a clean coordinate file could not be found. 'ERROR
   filename file read error' or 'ERROR filename file write error' will be
   reported when an error was encountered during a file read or write
   respectively. Various other error messages may also be given (in case
   of difficulty email Jon Ison, jison@hgmp.mrc.ac.uk).
     _________________________________________________________________
   
//
DS002__
WARN  Could not open for reading cpdb file s002.pxyz
//
DS003__
WARN  Could not open for reading cpdb file s003.pxyz
     _________________________________________________________________
   
Data files

   The ready-made input and output data for domainer may be downloaded
   from the HGMP:
   
   EMBL-like format clean protein coordinate files
   EMBL-like format clean domain coordinate files
   PDB-format clean domain coordinate files
   
Notes

   None.
   
References

   None.
   
Warnings

   None.
   
Diagnostic Error Messages

   None.
   
Exit status

   It always exits with status 0.
   
Known bugs

   None.
   
See also

   Program name Description
   cutgextract Extract data from CUTG
   nrscope Converts redundant EMBL-format SCOP file to non-redundant one
   pdbtosp Convert raw swissprot:pdb equivalence file to embl-like format
   printsextract Extract data from PRINTS
   prosextract Builds the PROSITE motif database for patmatmotifs to
   search
   rebaseextract Extract data from REBASE
   scope Convert raw scop classification file to embl-like format
   scopparse Reads raw-, and writes EMBL-like, scop classification files
   seqnr Converts redundant database results to a non-redundant set of
   hits
   tfextract Extract data from TRANSFAC
   
Author(s)

   This application was written by Jon Ison (jison@hgmp.mrc.ac.uk)
   
History

   Written (Jan 2001) - Jon Ison.
   
Target users

   This program is intended to be run by EMBOSS site maintainers or those
   responsible for setting up and maintaining protein 3D structural data
   for use by others.
   
Comments
