
                              EMBOSS: diffseq
     _________________________________________________________________
   
                                Program diffseq
                                       
Function

   Find differences (SNPs) between nearly identical sequences
   
Description

   diffseq takes two overlapping, nearly identical sequences and reports
   the differences between them, together with any features that overlap
   with these regions. GFF files of the differences in each sequence are
   also produced.
   
   diffseq should be of value when looking for SNPs, differences between
   strains of an organism and anything else that requires the differences
   between sequences to be highlighted.
   
   The sequences can be very long. The program does a match of all
   sequence words of size 10 (by default). It then reduces this to the
   minimum set of overlapping matches by sorting the matches in order of
   size (largest size first) and then for each such match it removes any
   smaller matches that overlap. The result is a set of the longest
   ungapped alignments between the two sequences that do not overlap with
   each other. The mismatched regions between these matches are reported.
   
   It should be possible to find differences between sequences that are
   Mega bytes long.
   
Usage

   Here is a sample session with diffseq:

% diffseq tembl:ap000504 tembl:af129756
Find differences (SNPs) between nearly identical sequences
Word size [10]:
Output file [ap000504.diffseq]:

Command line arguments

   Mandatory qualifiers:
  [-asequence]         sequence   Sequence USA
  [-bsequence]         sequence   Sequence USA
   -wordsize           integer    Word size
   -outfile            report     Output report file

   Optional qualifiers:
   -afeatout           featout    File for output of first sequence's normal
                                  tab delimited gff's
   -bfeatout           featout    File for output of second sequence's normal
                                  tab delimited gff's
   -columns            bool       The default format for the output report
                                  file is to have several lines per difference
                                  giving the sequence positions, sequences
                                  and features.
                                  If this option is set true then the output
                                  report file's format is changed to a set of
                                  columns and no feature information is given.

   Advanced qualifiers: (none)
   General qualifiers:
  -help                bool       report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   

   Mandatory qualifiers Allowed values Default
   [-asequence]
   (Parameter 1) Sequence USA Readable sequence Required
   [-bsequence]
   (Parameter 2) Sequence USA Readable sequence Required
   -wordsize Word size Integer 2 or more 10
   -outfile Output report file Report file
   Optional qualifiers Allowed values Default
   -afeatout File for output of first sequence's normal tab delimited
   gff's Writeable feature table $(asequence.name).diffgff
   -bfeatout File for output of second sequence's normal tab delimited
   gff's Writeable feature table $(bsequence.name).diffgff
   -columns The default format for the output report file is to have
   several lines per difference giving the sequence positions, sequences
   and features. If this option is set true then the output report file's
   format is changed to a set of columns and no feature information is
   given. Yes/No No
   Advanced qualifiers Allowed values Default
   (none)
   
Input file format

   This program reads in two nucleic acid sequence USAs or two protein
   sequence USAs.
   
Output file format

   A report of the differences between the two sequences is produced,
   together with any features that overlap with these differing regions.
   
   The output is a standard EMBOSS report file.
   
   The results can be output in one of several styles by using the
   command-line qualifier -rformat xxx, where 'xxx' is replaced by the
   name of the required format. The available format names are: embl,
   genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel,
   feattable, motif, regions, seqtable, simple, srs, table, tagseq
   
   See:
   http://www.uk.embnet.org/Software/EMBOSS/Themes/ReportFormats.html for
   further information on report formats.
   
   By default marscan writes a 'diffseq' report file.
  __________________________________________________________________________

########################################
# Program: diffseq
# Rundate: Mon Feb 11 13:16:56 2002
# Report_file: ap000504.diffseq
# Additional_files: 2
# 1: AP000504.diffgff (Feature file for first sequence)
# 2: AF129756.diffgff (Feature file for second sequence)
########################################

#=======================================
#
# Sequence: AP000504     from: 1   to: 100000
# HitCount: 119
#
# Compare: AF129756     from: 1   to: 184666
#
# AP000504 overlap starts at 1
# AF129756 overlap starts at 6036
#
# (AP000504) start end length sequence
# (AF129756) start end length sequence
#
#
#
#=======================================


AP000504 847-847 Length: 1
Sequence: a
Sequence: t
AF129756 6882-6882 Length: 1

AP000504 1795-1795 Length: 1
Sequence: g
Sequence: a
AF129756 7830-7830 Length: 1

AP000504 2273-2273 Length: 1
Sequence: t
Sequence:
Feature: repeat_region 7920-8351 rpt_family='MSTB'
AF129756 8307 Length: 0

AP000504 2466-2466 Length: 1
Sequence: g
Sequence: a
Feature: repeat_region 8391-8686 rpt_family='AluSg'
AF129756 8500-8500 Length: 1

AP000504 2655-2658 Length: 4
Sequence: tgtg
Sequence:
Feature: repeat_region 8687-8731 rpt_family='(CA)n'
AF129756 8688 Length: 0

AP000504 4914 Length: 0
Sequence:
Sequence: gtgtgtgtgtgtgtgtgt
Feature: repeat_region 10910-10972 rpt_family='(CA)n'
AF129756 10945-10962 Length: 18

AP000504 4951-4953 Length: 3
Sequence: aaa
Sequence: tat
Feature: repeat_region 10991-11020 rpt_family='AT_rich'
AF129756 10999-11001 Length: 3

AP000504 6600-6600 Length: 1
Sequence: t
Sequence:
Feature: repeat_region 12628-12930 rpt_family='AluSq'
AF129756 12647 Length: 0


etc.


AP000504 97273-97274 Length: 2
Sequence: aa
Sequence:
Feature: repeat_region 103299-103402 rpt_family='AluSq'
AF129756 103302 Length: 0

AP000504 97716-97716 Length: 1
Sequence: a
Sequence: g
AF129756 103744-103744 Length: 1

AP000504 97827-97827 Length: 1
Sequence: c
Sequence: t
Feature: repeat_region 103784-104083 rpt_family='AluSx'
AF129756 103855-103855 Length: 1

#---------------------------------------
#
# Overlap_end: 100000 in AP000504
# Overlap_end: 106028 in AF129756
#
# SNP_count: 86
# Transitions: 58
# Transversions: 28
#
#
#---------------------------------------
  __________________________________________________________________________

   The first line is the title giving the names of the sequences used.
   
   The next two non-blank lines state the positions in each sequence
   where the detected overlap between them starts.
   
   There then follows a set of reports of the mismatches between the
   sequences.
   Each report consists of 4 or more lines.
     * The first line has the name of the first sequence followed by the
       start and end positions of the mismatched region in that sequence,
       followed by the length of the mismatched region. If the mismatched
       region is of zero length in this sequence, then only the position
       of the last matching base before the mismatch is given.
     * If a feature of the first sequence overlaps with this mismatch
       region, then one or more lines starting with 'Feature:' comes next
       with the type, position and tag field of the feature.
     * Next is a line starting "Sequence:" giving the sequence of the
       mismatch in the first sequence.
       
   This is followed by the equivalent information for the second
   sequence, but in the reverse order, namely 'Sequence:' line,
   'Feature:' lines and line giving the position of the mismatch in the
   second sequence.
   
   At the end of the report are two non-blank lines giving the positions
   in each sequence where the detected overlap between them ends.
   
   The last three lines of the report gives the counts of SNPs (defined
   as a change of one nucleotide to one other nucleotide, no deletions or
   insertions are counted, no multi-base changes are counted).
   
   The counts of transitions (Pyrimide to Pyrimidine or Purine to Purine)
   and transversions (Pyrimidine to Purine) are also given.
   
   It should be noted that not all features are reported.
   
   The 'source' feature found in all EMBL/Genbank feature table entries
   is not reported as this covers all of the sequence and so overlaps
   with any difference found in that sequence and so is uninformative and
   irritating. It has therefore been removed from the output report.
   
   The translation information of CDS features is often extremely long
   and does not add useful information to the report. It has therefore
   been removed from the output report.
   
   If no regions of alignment are found, the following output is given:
  __________________________________________________________________________

########################################
# Program: diffseq
# Rundate: Mon Feb 11 13:21:20 2002
# Report_file: ap000504.diffseq
# Additional_files: 2
# 1: AP000504.diffgff (Feature file for first sequence)
# 2: fred.diffgff (Feature file for second sequence)
########################################

#=======================================
#
# Sequence: AP000504     from: 1   to: 100000
# HitCount: 0
#=======================================


#---------------------------------------
#
# No regions of alignment found.
#
#
#---------------------------------------
  __________________________________________________________________________

   If the -rformat table qualifier is given then the output is given in a
   columnar format.
   
   The columns are separated by one or more spaces or TAB characters in
   the order:
   
     * sequence 1 name
     * sequence 1 start
     * sequence 1 end
     * score (always 0.000)
     * sequence 2 start
     * sequence 2 end
     * sequence 2 name
     * sequence difference length
     * sequence
     * first feature
     * second feature
       
   For example:
  __________________________________________________________________________

########################################
# Program: diffseq
# Rundate: Mon Feb 11 13:28:25 2002
# Report_file: ap000504.diffseq
# Additional_files: 2
# 1: AP000504.diffgff (Feature file for first sequence)
# 2: AF129756.diffgff (Feature file for second sequence)
########################################

#=======================================
#
# Sequence: AP000504     from: 1   to: 100000
# HitCount: 119
#
# Compare: AF129756     from: 1   to: 184666
#
# AP000504 overlap starts at 1
# AF129756 overlap starts at 6036
#
# (AP000504) start end length sequence
# (AF129756) start end length sequence
#
#
#
#=======================================

USA               Start     End   Score start  end    length name   sequence fi
rst_feature second_feature
AP000504            847     847   0.000 6882   6882   1      AF129756 t
.             .
AP000504           1795    1795   0.000 7830   7830   1      AF129756 a
.             .
AP000504           2273    2273   0.000 8307   8306   .      AF129756 .
.             repeat_region 7920-8351 rpt_family='MSTB'
AP000504           2466    2466   0.000 8500   8500   1      AF129756 a
.             repeat_region 8391-8686 rpt_family='AluSg'
AP000504           2655    2658   0.000 8688   8687   .      AF129756 .
.             repeat_region 8687-8731 rpt_family='(CA)n'
AP000504           4914    4913   0.000 10945  10962  18     AF129756 gtgtgtgtg
tgtgtgtgt .             repeat_region 10910-10972 rpt_family='(CA)n'


etc.

AP000504          93860   93860   0.000 99890  99890  1      AF129756 g
.             .
AP000504          95451   95451   0.000 101481 101481 1      AF129756 t
.             .
AP000504          96650   96650   0.000 102680 102680 1      AF129756 t
.             .
AP000504          97273   97274   0.000 103302 103301 .      AF129756 .
.             repeat_region 103299-103402 rpt_family='AluSq'
AP000504          97716   97716   0.000 103744 103744 1      AF129756 g
.             .
AP000504          97827   97827   0.000 103855 103855 1      AF129756 t
.             repeat_region 103784-104083 rpt_family='AluSx'

#---------------------------------------
#
# Overlap_end: 100000 in AP000504
# Overlap_end: 106028 in AF129756
#
# SNP_count: 86
# Transitions: 58
# Transversions: 28
#
#
#---------------------------------------
  __________________________________________________________________________

   If no regions of alignment are found, the following output is given:
  __________________________________________________________________________

########################################
# Program: diffseq
# Rundate: Mon Feb 11 13:34:34 2002
# Report_file: ap000504.diffseq
# Additional_files: 2
# 1: AP000504.diffgff (Feature file for first sequence)
# 2: fred.diffgff (Feature file for second sequence)
########################################

#=======================================
#
# Sequence: AP000504     from: 1   to: 100000
# HitCount: 0
#=======================================

USA               Start     End   Score start  end    length name   sequence fi
rst_feature second_feature

#---------------------------------------
#
# No regions of alignment found.
#
#
#---------------------------------------
  __________________________________________________________________________

Data files

Notes

   It should be noted that not all features are reported.
   
   The 'source' feature found in all EMBL/Genbank feature table entries
   is not reported as this covers all of the sequence and so overlaps
   with any difference found in that sequence and so is uninformative and
   irritating. It has therefore been removed from the output report.
   
   The translation information of CDS features is often extremely long
   and does not add useful information to the report. It has therefore
   been removed from the output report.
   
   If you run out of memory, use a larger word size.
   
   Using a larger word size increases the length between mismatches that
   will be reported as one event. Thus a word size of 50 will report two
   SNP that are with 50 bases of each other as one mismatch.
   
References

   None.
   
Warnings

   None.
   
Diagnostic Error Messages

   None.
   
Exit status

   It always exits with status 0.
   
Known bugs

   None.
   
See also

   Program name Description
   
   A graphical dotplot of the matches used in this program can be
   displayed using the program dotpath.
   
Author(s)

   This application was written by Gary Williams
   (gwilliam@hgmp.mrc.ac.uk)
   
History

   Written 15th Aug 2000 - Gary Williams.
   18th Aug 2000 - Added writing out GFF files of the mismatched regions
   
Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.
   
Comments
