|
|
EMBOSS: merger |
It uses a global alignment algorithm (Needleman & Wunsch) to optimally align the sequences and then it creates the merged sequence from the alignment. When there is a mismatch in the alignment between the two sequences, the correct base to include in the resulting sequence is chosen by using the base from the sequence which has the best local sequence quality score. The following heuristic is used to find the sequence quality score:
If one of the bases is a 'N', then the other sequence's base is used, else:
A window size around the disputed base is used to find the local quality score. This window size is increased from 5, to 10 to 20 bases or until there is a clear decision on the best choice. If there is no best choice after using a window of 20, then the base in the first sequence is used.
To calculate the quality of a window of a sequence around a base:
N.B. This heavily discriminates against the iffy bits at the end of sequence reads.
This program was originally written to aid in the reconstruction of mRNA sequences which had been sequenced from both ends as a 5' and 3' EST (cDNA). eg. joining two reads produced by primer walking sequencing.
Care should be taken to reverse one of the sequences (e.g. using the qualifier '-sreverse2') if this is required to get them both in the correct orientation.
Because it uses a Needleman & Wunsch alignment the required memory may be greater than the available memory when attempting to merge large (cosmid-sized or greater) sequences.
The gap open and gap extension penalties have been set at a higher level than is usual (50 and 5). This was experimentally determined to give the best results with a set of poor quality EST test sequences.
% merger
Input sequence: embl:eclacy
Second sequence: embl:eclaca
Output sequence [eclacy.fasta]:
Output file [stdout]:
Global: ECLACY vs ECLACA
Score: 795.00
ECLACY 1 ttccagctgagcgccggtcgctaccattaccagttggtctggtgt 45
ECLACA
.................... until ......................
ECLACY 1306 agcggccccggcccgctttccctgctgcgtcgtcaggtgaatgaa 1350
|||||||||
ECLACA 1 gtgaatgaa 9
ECLACY 1351 gtcgcttaagcaatcaatgtcggatgcggcgcgacgcttatccga 1395
|||||||||||||||||||||||||||||||||||||||||||||
ECLACA 10 gtcgcttaagcaatcaatgtcggatgcggcgcgacgcttatccga 54
ECLACY 1396 ccaacatatcataacggagtgatcgcattgaacatgccaatgacc 1440
|||||||||||||||||||||||||||||||||||||||||||||
ECLACA 55 ccaacatatcataacggagtgatcgcattgaacatgccaatgacc 99
ECLACY 1441 gaaagaataagagcaggcaagctatttaccgatatgtgcgaaggc 1485
|||||||||||||||||||||||||||||||||||||||||||||
ECLACA 100 gaaagaataagagcaggcaagctatttaccgatatgtgcgaaggc 144
ECLACY 1486 ttaccggaaaaaaga 1500
|||||||||||||||
ECLACA 145 ttaccggaaaaaagacttcgtgggaaaacgttaatgtatgagttt 189
ECLACY
ECLACA 190 aatcactcgcatccatcagaagttgaaaaaagagaaagcctgatt 234
ECLACY
Typically, one of the sequences will need to be reverse-complemented to put it into the correct orientation to make it join. For example:
% merger file1.seq file2.seq -sreverse2 -out merged.seq
Mandatory qualifiers:
[-seqa] sequence Sequence USA
[-seqb] sequence Sequence USA
[-outseq] seqout Output sequence USA
-outfile align Output alignment and explanation
Optional qualifiers:
-datafile matrixf Matrix file
-gapopen float Gap opening penalty
-gapextend float Gap extension penalty
Advanced qualifiers: (none)
General qualifiers:
-help bool report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
|
| Mandatory qualifiers | Allowed values | Default | |
|---|---|---|---|
| [-seqa] (Parameter 1) |
Sequence USA | Readable sequence | Required |
| [-seqb] (Parameter 2) |
Sequence USA | Readable sequence | Required |
| [-outseq] (Parameter 3) |
Output sequence USA | Writeable sequence | <sequence>.format |
| -outfile | Output alignment and explanation | Alignment file | stdout |
| Optional qualifiers | Allowed values | Default | |
| -datafile | Matrix file | Comparison matrix file in EMBOSS data path | EBLOSUM62 for protein EDNAFULL for DNA |
| -gapopen | Gap opening penalty | Number from 1.000 to 100.000 | 50.0 |
| -gapextend | Gap extension penalty | Number from 0.100 to 10.000 | 5 |
| Advanced qualifiers | Allowed values | Default | |
| (none) | |||
The output report file contains descriptions of the positions where there is a mismatch in the alignment and shows the alignment. Where there is a mismatch in the alignment, the chosen base is written in uppercase.
An example report file showing mismatches follows:
# j1 position base j2 position base Using
12 'G' 1 'a' 'G'
13 'C' 2 'a' 'C'
14 'G' 3 'a' 'G'
16 'T' 5 'a' 'T'
20 'T' 9 'g' 'T'
23 'G' 12 'c' 'G'
24 'C' 13 'g' 'C'
41 'G' 30 't' 'G'
57 't' 46 'C' 'C'
Global: j1 vs j2
Score: 188.00
j1 1 gtatggtcgatGCGaTgcgTatGCtGacgttAgcggcggcGatat 45
| ||| || | ||||| |||||||| ||||
j2 1 aaaaagcggatcgtnacgttngcggcggctatat 34
j1 46 attgcgagctatgatgctnatcgtngc 72
||||||||||| |||||| ||||| ||
j2 35 attgcgagctaCgatgctGatcgtAgcgtacgttgaaagctacta 79
j1
j2 80 ctatgctgtgctagctgacgtagc 103
| Program name | Description |
|---|---|
| cons | Creates a consensus from multiple alignments |
| megamerger | Merge two large overlapping nucleic acid sequences |