
                               EMBOSS: getorf
     _________________________________________________________________
   
                                Program getorf
                                       
Function

   Finds and extracts open reading frames (ORFs)
   
Description

   This program finds and outputs the sequences of open reading frames
   (ORFs).
   
   The ORFs can be defined as regions of a specified minimum size between
   STOP codons or between START and STOP codons.
   
   The ORFs can be output as the nucleotide sequence or as the
   translation.
   
   The program can also output the region around the START or the initial
   STOP codon or the ending STOP codons of an ORF for those doing
   analysis of the properties of these regions.
   
   The START and STOP codons are defined in the Genetic Code tables. A
   suitable Genetic Code table can be selected for the organism you are
   investigating.
   
Usage

   Here is a sample session with getorf.

% getorf -minsize 300
Input sequence: embl:eclaci
Output sequence [eclaci.orf]:

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
  [-outseq]            seqoutall  Output sequence(s) USA

   Optional qualifiers:
   -table              menu       Code to use
   -minsize            integer    Minimum nucleotide size of ORF to report
   -find               menu       This is a small menu of possible output
                                  options. The first four options are to
                                  select either the protein translation or the
                                  original nucleic acid sequence of the open
                                  reading frame. There are two possible
                                  definitions of an open reading frame: it can
                                  either be a region that is free of STOP
                                  codons or a region that begins with a START
                                  codon and ends with a STOP codon. The last
                                  three options are probably only of interest
                                  to people who wish to investigate the
                                  statistical properties of the regions around
                                  potential START or STOP codons. The last
                                  option assumes that ORF lengths are
                                  calculated between two STOP codons.

   Advanced qualifiers:
   -[no]methionine     bool       START codons at the beginning of protein
                                  products will usually code for Methionine,
                                  despite what the codon will code for when it
                                  is internal to a protein. This qualifier
                                  sets all such START codons to code for
                                  Methionine by default.
   -circular           bool       Is the sequence circular
   -[no]reverse        bool       Set this to be false if you do not wish to
                                  find ORFs in the reverse complement of the
                                  sequence.
   -flanking           integer    If you have chosen one of the options of the
                                  type of sequence to find that gives the
                                  flanking sequence around a STOP or START
                                  codon, this allows you to set the number of
                                  nucleotides either side of that codon to
                                  output. If the region of flanking
                                  nucleotides crosses the start or end of the
                                  sequence, no output is given for this codon.

   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   

   Mandatory qualifiers Allowed values Default
   [-sequence]
   (Parameter 1) Sequence database USA Readable sequence(s) Required
   [-outseq]
   (Parameter 2) Output sequence(s) USA Writeable sequence(s)
   <sequence>.format
   Optional qualifiers Allowed values Default
   -table Code to use
   0 (Standard)
   1 (Standard (with alternative initiation codons))
   2 (Vertebrate Mitochondrial)
   3 (Yeast Mitochondrial)
   4 (Mold, Protozoan, Coelenterate Mitochondrial and
   Mycoplasma/Spiroplasma)
   5 (Invertebrate Mitochondrial)
   6 (Ciliate Macronuclear and Dasycladacean)
   9 (Echinoderm Mitochondrial)
   10 (Euplotid Nuclear)
   11 (Bacterial)
   12 (Alternative Yeast Nuclear)
   13 (Ascidian Mitochondrial)
   14 (Flatworm Mitochondrial)
   15 (Blepharisma Macronuclear)
   16 (Chlorophycean Mitochondrial)
   21 (Trematode Mitochondrial)
   22 (Scenedesmus obliquus)
   23 (Thraustochytrium Mitochondrial)
   0
   -minsize Minimum nucleotide size of ORF to report Any integer value 30
   -find This is a small menu of possible output options. The first four
   options are to select either the protein translation or the original
   nucleic acid sequence of the open reading frame. There are two
   possible definitions of an open reading frame: it can either be a
   region that is free of STOP codons or a region that begins with a
   START codon and ends with a STOP codon. The last three options are
   probably only of interest to people who wish to investigate the
   statistical properties of the regions around potential START or STOP
   codons. The last option assumes that ORF lengths are calculated
   between two STOP codons.
   0 (Translation of regions between STOP codons)
   1 (Translation of regions between START and STOP codons)
   2 (Nucleic sequences between STOP codons)
   3 (Nucleic sequences between START and STOP codons)
   4 (Nucleotides flanking START codons)
   5 (Nucleotides flanking initial STOP codons)
   6 (Nucleotides flanking ending STOP codons)
   0
   Advanced qualifiers Allowed values Default
   -[no]methionine START codons at the beginning of protein products will
   usually code for Methionine, despite what the codon will code for when
   it is internal to a protein. This qualifier sets all such START codons
   to code for Methionine by default. Yes/No Yes
   -circular Is the sequence circular Yes/No No
   -[no]reverse Set this to be false if you do not wish to find ORFs in
   the reverse complement of the sequence. Yes/No Yes
   -flanking If you have chosen one of the options of the type of
   sequence to find that gives the flanking sequence around a STOP or
   START codon, this allows you to set the number of nucleotides either
   side of that codon to output. If the region of flanking nucleotides
   crosses the start or end of the sequence, no output is given for this
   codon. Any integer value 100
   
Input file format

   Any nucleic acid sequence USA.
   
Output file format

   The output is a sequence file containing predicted open reading frames
   longer than the minimum size, which defaults to 30 bases or 10 amino
   acids.
   
   The results from the example run are:
   
>ECLACI_1 [735 - 1112] E. coli laci gene (codes for the lac repressor).
GHRSHCDAGCQRSDGAGRNARHYRVRAARWCGYLGSGIRRYRRQLMLYPAVNHHQTGFSP
AGANQRGPLAATLSGPGGEGQSAVARLTGEKKNHPGAQYANRLSPRVGRFINAAGTTGFP
TGKRAV
>ECLACI_2 [1 - 1110] E. coli laci gene (codes for the lac repressor).
PEESQFRVVNVKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN
RVAQQLAGKQSLLIGVATSSLALHAPSQIVAAIKSRADQLGASVVVSMVERSGVEACKAA
VHNLLAQRVSGLIINYPLDDQDAIAVEAACTNVPALFLDVSDQTPINSIIFSHEDGTRLG
VEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQIQPIAEREGDWSAMSGFQQTM
QMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSSCYIPPSTTIK
QDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASPRALADSLMQLA
RQVSRLESGQ*
>ECLACI_3 [465 - 49] E. coli laci gene (codes for the lac repressor).
RRNISAGSFHSNGILVIQRIVNDQPTDALREKIVHRRFTGFDAASFYHRHHHAGTQLIGA
RFNRRDNLRRRVQGQTGGGNANQQRLFARQLLCHAVGNVIQLRHRRFHFFPRFRRNVAGL
VHHAGNGLIRDTGILCDIV

   All output ORF sequences are written to the specified outut file.
   
   The name of the ORF sequences is constructed from the name of the
   input sequence with an underscore character ('_') and a unique ordinal
   number of the ORF found appended. The description of the output ORF
   sequence is constructed from the description of the input sequence
   with the start and end positions of the ORF prepended.
   
   The unique number appended to the name is simply used to create new
   unique sequence names, it does not imply any further information
   indicating any order, positioning or sense-strand of the ORFs.
   
   If the ORF has been found in the reverse sense, then the start
   position will be smaller than the end position. The numbering uses the
   forward-sense positions, but read in the reverse sense. For example,
   >ECLACI_3 [465 - 49] in the output above is a reverse-sense ORF
   running from position 465 to 49.
   
Data files

   The START and STOP codons used by getorf are defined in the Genetic
   Code data files. By default, Genetic Code file EGC.0 is used.
   
   The default file EGC.0 is the 'Standard Code' with the rarely used
   alternate START codons omitted, it only has the normal 'AUG' START
   codon. The 'Standard Code' with the rarely used alternate START codons
   included is Genetic Code file EGC.1.
   
   It is expected that user will sometimes wish to customise a Genetic
   Code file. To do this, use the program embossdata.
   
   EMBOSS data files are distributed with the application and stored in
   the standard EMBOSS data directory, which is defined by the EMBOSS
   environment variable EMBOSS_DATA.
   
   To see the available EMBOSS data files, run:
   
% embossdata -showall

   To fetch one of the data files (for example 'Exxx.dat') into your
   current directory for you to inspect or modify, run:

% embossdata -fetch -file Exxx.dat

   Users can provide their own data files in their own directories.
   Project specific files can be put in the current directory, or for
   tidier directory listings in a subdirectory called ".embossdata".
   Files for all EMBOSS runs can be put in the user's home directory, or
   again in a subdirectory called ".embossdata".
   
   The directories are searched in the following order:
     * . (your current directory)
     * .embossdata (under your current directory)
     * ~/ (your home directory)
     * ~/.embossdata
       
   The Genetic Code data files are based on the NCBI genetic code tables.
   Their names and descriptions are:
   
   EGC.0
          Standard (Differs from GC.1 in that it only has initiation site
          'AUG')
          
   EGC.1
          Standard
          
   EGC.2
          Vertebrate Mitochodrial
          
   EGC.3
          Yeast Mitochondrial
          
   EGC.4
          Mold, Protozoan, Coelenterate Mitochondrial and
          Mycoplasma/Spiroplasma
          
   EGC.5
          Invertebrate Mitochondrial
          
   EGC.6
          Ciliate Macronuclear and Dasycladacean
          
   EGC.9
          Echinoderm Mitochondrial
          
   EGC.10
          Euplotid Nuclear
          
   EGC.11
          Bacterial
          
   EGC.12
          Alternative Yeast Nuclear
          
   EGC.13
          Ascidian Mitochondrial
          
   EGC.14
          Flatworm Mitochondrial
          
   EGC.15
          Blepharisma Macronuclear
          
   EGC.16
          Chlorophycean Mitochondrial
          
   EGC.21
          Trematode Mitochondrial
          
   EGC.22
          Scenedesmus obliquus
          
   EGC.23
          Thraustochytrium Mitochondrial
          
   The format of these files is very simple.
   
   It consists of several lines of optional comments, each starting with
   a '#' character.
   
   These are followed the line: 'Genetic Code [n]', where 'n' is the
   number of the genetic code file.
   
   This is followed by the description of the code and then by four lines
   giving the IUPAC one-letter code of the translated amino acid, the
   start codons (indicdated by an 'M') and the three bases of the codon,
   lined up one on top of the other.
   
   For example:

------------------------------------------------------------------------------
# Genetic Code Table
#
# Obtained from: http://www.ncbi.nlm.nih.gov/collab/FT/genetic_codes.html
# and: http://www3.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c
#
# Differs from Genetic Code [1] only in that the initiation sites have been
# changed to only 'AUG'

Genetic Code [0]
Standard

AAs  =   FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
Starts = -----------------------------------M----------------------------
Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
------------------------------------------------------------------------------

Notes

   None.
   
References

   None.
   
Warnings

   None.
   
Diagnostic Error Messages

   None.
   
Exit status

   It always exits with status 0.
   
Known bugs

   None.
   
See also

   Program name               Description
   marscan      Finds MAR/SAR sites in nucleic sequences
   plotorf      Plot potential open reading frames
   showorf      Pretty output of DNA translations
   wobble       Wobble base plot
   
     * checktrans - Reports STOP codons and ORF statistics of a protein
       sequence
       
Author(s)

   This application was written by Gary Williams
   (gwilliam@hgmp.mrc.ac.uk)
   
History

   2000 - written - Gary Williams
   
Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.
   
Comments
