
                                EMBOSS: preg
     _________________________________________________________________
   
                                 Program preg
                                       
Function

   Regular expression search of a protein sequence
   
Description

   This searches for matches of a regular expression to a protein
   sequence.
   
   A regular expression is a way of specifying an ambiguous pattern to
   search for. Regular expressions are commonly used in some computer
   programming languages and may be more familiar to some users than to
   others.
   
   The following is a short guide to regular expressions in EMBOSS:
   
   ^
          use this at the start of a pattern to insist that the pattern
          can only match at the start of a sequence. (eg. '^M' matches a
          methionine at the start of the sequence)
          
   $
          use this at the end of a pattern to insist that the pattern can
          only match at the end of a sequence (eg. 'R$' matches an
          arginine at the end of the sequence)
          
   ()
          groups a pattern. This is commonly used with '|' (eg.
          '(ACD)|(VWY)' matches either the first 'ACD' or the second
          'VWY' pattern )
          
   |
          This is the OR operator to enable a match to be made to either
          one pattern OR another. There is no AND operator in this
          version of regular expressions.
          
   The following quantifier characters specify the number of time that
   the character before (in this case 'x') matches:
   
   x?
          matches 0 or 1 times (ie, '' or 'x')
          
   x*
          matches 0 or more times (ie, '' or 'x' or 'xx' or 'xxx', etc)
          
   x+
          matches 1 or more times (ie, 'x' or 'xx' or 'xxx', etc)
          
   Quantifiers can follow any of the following types of character
   specification:
   
   x
          any character (ie 'A')
          
   \x
          the character after the backslash is used instead of its normal
          regular expression meaning. This is commonly used to turn off
          the special meaning of the characters '^$()|?*+[]-.'. It may be
          especially useful when searching for gap characters in a
          sequence (eg '\.' matches only a dot character '.')
          
   [xy]
          match one of the characters 'x' or 'y'. You may have one or
          more characters in this set.
          
   [x-z]
          match any one of the set of characters starting with 'x' and
          ending in 'y' in ASCII order (eg '[A-G]' matches any one of:
          'A', 'B', 'C', 'D', 'E', 'F', 'G')
          
   [^x-z]
          matches anything except any one of the group of characters in
          ASCII order (eg '[^A-G]' matches anything EXCEPT any one of:
          'A', 'B', 'C', 'D', 'E', 'F', 'G')
          
   .
          the dot character matches any other character (eg: 'A.G'
          matches 'AAG', 'AaG', 'AZG', 'A-G' 'A G', etc.)
          
   Combining some of these features gives these examples from the PROSITE
   patterns database:
'[STAGCN][RKH][LIVMAFY]$'

   which is the 'Microbodies C-terminal targeting signal'.
'LP.TG[STGAVDE]'

   which is the 'Gram-positive cocci surface proteins anchoring
   hexapeptide'.
   
   Regular expressions are case-sensitive. The pattern 'AAAA' will not
   match the sequence 'aaaa'.
   
Usage

   Here is a sample session with preg.
   
% preg
regular expression search of a protein sequence
Input sequence(s): sw:*_rat
Output file [100k_rat.preg]: stdout
Regular expression pattern: IA[QWF]A

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
  [-pattern]           regexp     Regular expression pattern
  [-outfile]           outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   

   Mandatory qualifiers Allowed values Default
   [-sequence]
   (Parameter 1) Sequence database USA Readable sequence(s) Required
   [-pattern]
   (Parameter 2) Regular expression pattern Any regular epression pattern
   is accepted Required
   [-outfile]
   (Parameter 3) Output file name Output file <sequence>.preg
   Optional qualifiers Allowed values Default
   (none)
   Advanced qualifiers Allowed values Default
   (none)
   
Input file format

   Any protein sequence.
   
Output file format

   Here is the output from the example run:
     _________________________________________________________________
   
preg search of sw:*_rat with pattern IA[QWF]A
Matches in 100K_RAT
       100K_RAT   390 IAQA
Matches in 5H6_RAT
        5H6_RAT   289 IAQA
Matches in ACDS_RAT
       ACDS_RAT   282 IAQA
Matches in ANX2_RAT
       ANX2_RAT    70 IAFA
Matches in APB3_RAT
       APB3_RAT   336 IAQA
Matches in AQP9_RAT
       AQP9_RAT    44 IAQA
Matches in ATHA_RAT
       ATHA_RAT   122 IAFA
Matches in CD14_RAT
       CD14_RAT   178 IAQA
Matches in CIKE_RAT
       CIKE_RAT   231 IAFA
Matches in CLCB_RAT
       CLCB_RAT    90 IAQA
Matches in CTR1_RAT
       CTR1_RAT   590 IAFA
Matches in CYGF_RAT
       CYGF_RAT   359 IAQA
Matches in DPY2_RAT
       DPY2_RAT   264 IAQA
Matches in ENOB_RAT
       ENOB_RAT   327 IAQA
Matches in ERBP_RAT
       ERBP_RAT    40 IAFA
Matches in GLPK_RAT
       GLPK_RAT   392 IAFA
Matches in GPV_RAT
        GPV_RAT   529 IAQA
Matches in IRKB_RAT
       IRKB_RAT    93 IAFA
Matches in KGP2_RAT
       KGP2_RAT   477 IAFA
Matches in NPX1_RAT
       NPX1_RAT   407 IAWA
Matches in NTDO_RAT
       NTDO_RAT   160 IAWA
Matches in NTSE_RAT
       NTSE_RAT   180 IAWA
Matches in PAX8_RAT
       PAX8_RAT   188 IAQA
Matches in SRA4_RAT
       SRA4_RAT   491 IAWA
Matches in SYNP_RAT
       SYNP_RAT    43 IAFA
Matches in TGN3_RAT
       TGN3_RAT   330 IAFA
Matches in TGR3_RAT
       TGR3_RAT   792 IAFA
Matches in UDB2_RAT
       UDB2_RAT   325 IAWA
Matches in UDB3_RAT
       UDB3_RAT   325 IAWA
Matches in UDB6_RAT
       UDB6_RAT   325 IAWA
Matches in UDBC_RAT
       UDBC_RAT   325 IAWA
Matches in VMT2_RAT
       VMT2_RAT   462 IAFA
     _________________________________________________________________
   
Data files

   None.
   
Notes

   None.
   
References

   None.
   
Warnings

   Regular expressions are case-sensitive. The pattern 'AAAA' will not
   match the sequence 'aaaa'.
   
Diagnostic Error Messages

   None.
   
Exit status

   It always exits with a status of 0. Always returns 0.
   
Known bugs

   None.
   
See also

    Program name                        Description
   antigenic      Finds antigenic sites in proteins
   digest         Protein proteolytic enzyme or reagent cleavage digest
   fuzzpro        Protein pattern search
   fuzztran       Protein pattern search after translation
   helixturnhelix Report nucleic acid binding motifs
   oddcomp        Finds protein sequence regions with a biased composition
   patmatdb       Search a protein sequence with a motif
   patmatmotifs   Search a PROSITE motif database with a protein sequence
   pepcoil        Predicts coiled coil regions
   pscan          Scans proteins using PRINTS
   sigcleave      Reports protein signal cleavage sites
   
   Other EMBOSS programs allow you to search for simple patterns and may
   be easier for the user who has never used regular expressions before:
   
     * fuzznuc - Nucleic acid pattern search
     * fuzzpro - Protein pattern search
     * fuzztran - Protein pattern search after translation
       
Author(s)

   This application was written by Peter Rice (pmr@sanger.ac.uk)
   Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus,
   Hinxton, Cambridge, CB10 1SA, UK.
   
History

   Written (1999) - Peter Rice
   
Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.
   
Comments
