
                            EMBOSS: patmatmotifs
     _________________________________________________________________
   
                             Program patmatmotifs
                                       
Function

   Search a PROSITE motif database with a protein sequence
   
Description

   patmatmotifs takes a protein sequence and compares it to the PROSITE
   database of motifs.
   
   For a description of PROSITE, we can do no better than to quote the
   PROSITE user's documentation: 
   
   PROSITE is a method of determining what is the function of
   uncharacterized proteins translated from genomic or cDNA sequences. It
   consists of a database of biologically significant sites and patterns
   formulated in such a way that with appropriate computational tools it
   can rapidly and reliably identify to which known family of protein (if
   any) the new sequence belongs.
   
   In some cases the sequence of an unknown protein is too distantly
   related to any protein of known structure to detect its resemblance by
   overall sequence alignment, but it can be identified by the occurrence
   in its sequence of a particular cluster of residue types which is
   variously known as a pattern, motif, signature, or fingerprint. These
   motifs arise because of particular requirements on the structure of
   specific region(s) of a protein which may be important, for example,
   for their binding properties or for their enzymatic activity. These
   requirements impose very tight constraints on the evolution of those
   limited (in size) but important portion(s) of a protein sequence. To
   paraphrase Orwell, in Animal Farm, we can say that "some regions of a
   protein sequence are more equal than others" !
   
   The use of protein sequence patterns (or motifs) to determine the
   function(s) of proteins is becoming very rapidly one of the essential
   tools of sequence analysis. This reality has been recognized by many
   authors, as it can be illustrated from the following citations from
   two of the most well known experts of protein sequence analysis, R.F.
   Doolittle and A.M. Lesk:
      "There are  many short  sequences  that  are  often  (but  not  always)
      diagnostics of certain binding properties or active sites. These can be
      set into a small subcollection and searched against your sequence (1)".

      "In some  cases, the structure and function of an unknown protein which
      is too  distantly related  to any  protein of known structure to detect
      its affinity  by overall  sequence alignment  may be  identified by its
      possession of  a particular  cluster of  residues types classified as a
      motifs. The  motifs, or  templates, or  fingerprints, arise  because of
      particular  requirements  of  binding  sites  that  impose  very  tight
      constraint on the evolution of portions of a protein sequence (2)."

   The home web page of PROSITE is: http://www.expasy.ch/prosite/
   
   It is common to find that a search of the PROSITE database against a
   protein sequence will report many matches to the short motifs that are
   indicative of the post-translational modification sites, such as
   glycolsylation, myristylation and phosphorylation sites. These reports
   are often unwanted and are not normally reported. You can turn
   reporting of these short motifs on by giving the '-noprune' option on
   the command-line.
   
   Your EMBOSS administrator must have set up the local EMBOSS PROSITE
   database using the utility 'prosextract' before this program will run.
   
Usage

   Here is a sample session with patmatmotifs.
   
% patmatmotifs -full
Matching Prosite Motif Database to a single sequence.
Input sequence: sw:12s1_arath
Output file [12s1_arath.patmatmotifs]:

Command line arguments

   Mandatory qualifiers:
  [-sequence]          sequence   Sequence USA
  [-outfile]           report     (no help text) report value

   Optional qualifiers:
   -full               bool       Provide full documentation for matching
                                  patterns
   -[no]prune          bool       Ignore simple patterns. If this is true then
                                  these simple post-translational
                                  modification sites are not reported:
                                  myristyl, asn_glycosylation,
                                  camp_phospho_site, pkc_phospho_site,
                                  ck2_phospho_site, and tyr_phospho_site.

   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   

   Mandatory qualifiers Allowed values Default
   [-sequence]
   (Parameter 1) Sequence USA Readable sequence Required
   [-outfile]
   (Parameter 2) (no help text) report value Report file
   Optional qualifiers Allowed values Default
   -full Provide full documentation for matching patterns Yes/No No
   -[no]prune Ignore simple patterns. If this is true then these simple
   post-translational modification sites are not reported: myristyl,
   asn_glycosylation, camp_phospho_site, pkc_phospho_site,
   ck2_phospho_site, and tyr_phospho_site. Yes/No Yes
   Advanced qualifiers Allowed values Default
   (none)
   
Input file format

   A protein sequence USA.
   
Output file format

   The output is a standard EMBOSS report file.
   
   The results can be output in one of several styles by using the
   command-line qualifier -rformat xxx, where 'xxx' is replaced by the
   name of the required format. The available format names are: embl,
   genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel,
   feattable, motif, regions, seqtable, simple, srs, table, tagseq
   
   See:
   http://www.uk.embnet.org/Software/EMBOSS/Themes/ReportFormats.html for
   further information on report formats.
   
   By default patmatmotifs writes a 'dbmotif' report file.
   
   The output from the above example follows:
     _________________________________________________________________
   
########################################
# Program: patmatmotifs
# Rundate: Thu Apr 11 13:53:51 2002
# Report_file: 12s1_arath.patmatmotifs
########################################

#=======================================
#
# Sequence: 12S1_ARATH     from: 1   to: 472
# HitCount: 1
#
# Full: Yes
# Prune: Yes
# Data_file: /packages/emboss_dev/gwilliam/emboss/emboss/emboss/data/PROSITE/pr
o
site.lines
#
#=======================================

Length = 23
Start = position 282 of sequence
End = position 304 of sequence


Motif = 11S_SEED_STORAGE

HGRHGNGLEETICSARCTDNLDDPSRADVYKPQ
     |                     |
   282                     304


#---------------------------------------
#
# Motif: 11S_SEED_STORAGE
# Count: 1
#
# **********************************************
# * 11-S plant seed storage proteins signature *
# **********************************************
#
# Plant seed storage proteins, whose  principal function appears to be the majo
r
# nitrogen  source for the developing plant,  can be classified, on the basis o
f
# their structure, into different families.  11-S are non-glycosylated  protein
s
# which form hexameric structures [1,2].  Each of the subunits in the hexamer i
s
# itself composed of an acidic and a basic chain derived from a single precurso
r
# and linked  by a  disulfide bond.   This  structure is  shown in the followin
g
# representation.
#
#                    +-------------------------+
#                    |                         |
#         xxxxxxxxxxxCxxxxxxxxxxxxxxxxxxxxxxNGxCxxxxxxxxxxxxxxxxxxxxxxx
#                                           *********
#         <------Acidic-subunit-------------><-----Basic-subunit------>
#         <-----------------About-480-to-500-residues----------------->
#
# 'C': conserved cysteine involved in a disulfide bond.
# '*': position of the pattern.
#
# Proteins that belong to the 11-S family are: pea and broad bean legumins, rap
e
# cruciferin, rice glutelins,  cotton beta-globulins, soybean glycinins, pumpki
n
# 11-S globulin, oat globulin, sunflower helianthinin G3, etc.
#
# As a signature  pattern  for  this  family of proteins we used the region tha
t
# includes the  conserved  cleavage  site between  the acidic and basic subunit
s
# (Asn-Gly) and a  proximal cysteine residue which is involved in the interchai
n
# disulfide bond.
#
# -Consensus pattern: N-G-x-[DE](2)-x-[LIVMF]-C-[ST]-x(11,12)-[PAG]-D
#                     [C is involved in a disulfide bond]
# -Sequences known to belong to this class detected by the pattern: ALL.
# -Other sequence(s) detected in SWISS-PROT: NONE.
# -Last update: June 1994 / Pattern and text revised.
#
# [ 1] Hayashi M., Mori H., Nishimura M., Akazawa T., Hara-Nishimura I.
#      Eur. J. Biochem. 172:627-632(1988).
# [ 2] Shotwell M.A., Afonso C., Davies E., Chesnut R.S., Larkins B.A.
#      Plant Physiol. 87:698-704(1988).
#
# ***************
#
#
#---------------------------------------
     _________________________________________________________________
   
Data files

   Data and documentation from PROSITE files is automatically read. This
   has been generated and formatted by running prosextract before running
   patmatmotifs.
   
Notes

   Program is only useful when prosextract is used beforehand.
   
References

   If you want to refer to PROSITE in a publication you can do so by
   citing:
   
   Bairoch A., Bucher P., Hofmann K. The PROSITE datatase, its status in
   1997. Nucleic Acids Res. 24:217-221(1997).
   
   Other references:
   
    1. Bairoch, A., Bucher P. (1994) PROSITE: recent developments.
       Nucleic Acids Research, Vol 22, No.17 3583-3589.
    2. Bairoch, A., (1992) PROSITE: a dictionary of sites and patterns in
       proteins. Nucleic Acids Research, Vol 20, Supplement, 2013-2018.
    3. Peek, J., O'Reilly, T., Loukides, M., (1997) Unix Power Tools, 2nd
       Edition.
    4. Doolittle R.F. (In) Of URFs and ORFs: a primer on how to analyze
       derived amino acid sequences., University Science Books, Mill
       Valley, California, (1986).
    5. Lesk A.M. (In) Computational Molecular Biology, Lesk A.M., Ed.,
       pp17-26, Oxford University Press, Oxford (1988).
       
Warnings

   Your EMBOSS administrator must have set up the local EMBOSS PROSITE
   database using the utility 'prosextract' before this program will run.
   
Diagnostic Error Messages

   The error message:
   
"Either EMBOSS_DATA undefined or PROSEXTRACT needs running"

   indicates that your local EMBOSS administrator has not yet correctly
   set up the local EMBOSS PROSITE database using the utility
   'prosextract'.
   
Exit status

   It always exits with status 0
   
Known bugs

   None.
   
See also

    Program name                        Description
   antigenic      Finds antigenic sites in proteins
   digest         Protein proteolytic enzyme or reagent cleavage digest
   fuzzpro        Protein pattern search
   fuzztran       Protein pattern search after translation
   helixturnhelix Report nucleic acid binding motifs
   oddcomp        Finds protein sequence regions with a biased composition
   patmatdb       Search a protein sequence with a motif
   pepcoil        Predicts coiled coil regions
   preg           Regular expression search of a protein sequence
   pscan          Scans proteins using PRINTS
   sigcleave      Reports protein signal cleavage sites
   
Author(s)

   This application was written by Sinead O'Leary
   (soleary@hgmp.mrc.ac.uk)
   
History

   Completed May 13 1999.
   
Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.
   
Comments
