needleall |
Wiki
The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.Please help by correcting and extending the Wiki pages.
Function
Many-to-many pairwise alignments of two sequence setsDescription
needleall reads a set of input sequences and compares them all to one or more sequences, writing their optimal global sequence alignments to file. It uses the Needleman-Wunsch alignment algorithm to find the optimum alignment (including gaps) of two sequences along their entire length. The algorithm uses a dynamic programming method to ensure the alignment is optimum, by exploring all possible alignments and choosing the best. A scoring matrix is read that contains values for every possible residue or nucleotide match. Needleall finds the alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix, minus penalties arising from opening and extending gaps in the aligned sequences. The substitution matrix and gap opening and extension penalties are user-specified.
Algorithm
The Needleman-Wunsch algorithm is a member of the class of algorithms that can calculate the best score and alignment of two sequences in the order of mn steps, where n and m are the sequence lengths. These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.
An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty). Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.
Penalty for a gap of n positions is calculated using the following formula:
gap opening penalty + (n - 1) * gap extension penalty
In a Needleman-Wunsch global alignment, the entire length of each sequence is aligned. The sequences might be partially overlapping or one sequence might be aligned entirely internally to the other. There is no penalty for the hanging ends of the overlap. In bioinformatics, it is usually reasonable to assume that the sequences are incomplete and there should be no penalty for failing to align the missing bases.
Usage
Here is a sample session with needleall
% needleall -minscore 40 -stdout -auto ../data/test1_illumina.fastq Illumina_DpnII_Gex_PCR_Primer_2 FC12044_91407_8_200_406_24 45 (41.0) Illumina_NlaIII_Gex_PCR_Primer_2 FC12044_91407_8_200_406_24 45 (41.0) Illumina_Small_RNA_PCR_Primer_2 FC12044_91407_8_200_406_24 45 (41.0) Illumina_DpnII_Gex_Adapters1_1 FC12044_91407_8_200_106_131 35 (40.5) Illumina_Paired_End_DNA_Adapters1_1 FC12044_91407_8_200_57_85 35 (41.0) Illumina_DpnII_Gex_Adapters1_1 FC12044_91407_8_200_154_436 31 (42.0) Illumina_Genomic_DNA_PCR_Primers1_1 FC12044_91407_8_200_83_511 64 (42.0) Illumina_Paired_End_DNA_PCR_Primers1_1 FC12044_91407_8_200_83_511 64 (42.0) Illumina_DpnII_Gex_Adapters1_2 FC12044_91407_8_200_303_427 33 (40.5) Illumina_DpnII_Gex_PCR_Primer_2 FC12044_91407_8_200_303_427 51 (40.5) Illumina_DpnII_Gex_sequencing_primer FC12044_91407_8_200_303_427 38 (44.5) Illumina_NlaIII_Gex_Adapters1_2 FC12044_91407_8_200_303_427 36 (40.5) Illumina_NlaIII_Gex_PCR_Primer_2 FC12044_91407_8_200_303_427 51 (40.5) Illumina_NlaIII_Gex_sequencing_primer FC12044_91407_8_200_303_427 39 (40.5) Illumina_Small_RNA_5p_Adapter FC12044_91407_8_200_303_427 33 (40.5) Illumina_Small_RNA_PCR_Primer_2 FC12044_91407_8_200_303_427 51 (40.5) Illumina_Small_RNA_sequencing_primer FC12044_91407_8_200_303_427 38 (44.5) Illumina_Paired_End_DNA_Adapters1_1 FC12044_91407_8_200_553_135 33 (44.5) Illumina_DpnII_Gex_PCR_Primer_2 FC12044_91407_8_200_139_74 51 (46.0) Illumina_DpnII_Gex_sequencing_primer FC12044_91407_8_200_139_74 38 (42.0) Illumina_NlaIII_Gex_PCR_Primer_2 FC12044_91407_8_200_139_74 51 (46.0) Illumina_Small_RNA_PCR_Primer_2 FC12044_91407_8_200_139_74 51 (46.0) Illumina_Small_RNA_sequencing_primer FC12044_91407_8_200_139_74 38 (42.0) #--------------------------------------- #--------------------------------------- |
Go to the input files for this example
Go to the output files for this example
Command line arguments
Many-to-many pairwise alignments of two sequence sets Version: EMBOSS:6.4.0.0 Standard (Mandatory) qualifiers: [-asequence] seqset Sequence set filename and optional format, or reference (input USA) [-bsequence] seqall Sequence(s) filename and optional format, or reference (input USA) -gapopen float [10.0 for any sequence] The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. (Floating point number from 1.0 to 100.0) -gapextend float [0.5 for any sequence] The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. (Floating point number from 0.0 to 10.0) [-outfile] align [*.needleall] Output alignment file name (default -aformat score) Additional (Optional) qualifiers: -datafile matrixf [EBLOSUM62 for protein, EDNAFULL for DNA] This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. -endweight boolean [N] Apply end gap penalties. -endopen float [10.0 for any sequence] The end gap open penalty is the score taken away when an end gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. (Floating point number from 1.0 to 100.0) -endextend float [0.5 for any sequence] The end gap extension, penalty is added to the end gap penalty for each base or residue in the end gap. (Floating point number from 0.0 to 10.0) -minscore float [1.0 for any sequence] Minimum alignment score to report an alignment. (Floating point number from -10.0 to 100.0) -errorfile outfile [needleall.error] Error file to be written to Advanced (Unprompted) qualifiers: -[no]brief boolean [Y] Brief identity and similarity Associated qualifiers: "-asequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-bsequence" associated qualifiers -sbegin2 integer Start of each sequence to be used -send2 integer End of each sequence to be used -sreverse2 boolean Reverse (if DNA) -sask2 boolean Ask for begin/end/reverse -snucleotide2 boolean Sequence is nucleotide -sprotein2 boolean Sequence is protein -slower2 boolean Make lower case -supper2 boolean Make upper case -sformat2 string Input sequence format -sdbname2 string Database name -sid2 string Entryname -ufo2 string UFO features -fformat2 string Features format -fopenfile2 string Features file name "-outfile" associated qualifiers -aformat3 string Alignment format -aextension3 string File name extension -adirectory3 string Output directory -aname3 string Base file name -awidth3 integer Alignment width -aaccshow3 boolean Show accession number in the header -adesshow3 boolean Show description in the header -ausashow3 boolean Show the full USA in the alignment -aglobal3 boolean Show the full sequence in alignment "-errorfile" associated qualifiers -odirectory string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit |
Qualifier | Type | Description | Allowed values | Default |
---|---|---|---|---|
Standard (Mandatory) qualifiers | ||||
[-asequence] (Parameter 1) |
seqset | Sequence set filename and optional format, or reference (input USA) | Readable set of sequences | Required |
[-bsequence] (Parameter 2) |
seqall | Sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required |
-gapopen | float | The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. | Floating point number from 1.0 to 100.0 | 10.0 for any sequence |
-gapextend | float | The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. | Floating point number from 0.0 to 10.0 | 0.5 for any sequence |
[-outfile] (Parameter 3) |
align | Output alignment file name | (default -aformat score) | <*>.needleall |
Additional (Optional) qualifiers | ||||
-datafile | matrixf | This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. | Comparison matrix file in EMBOSS data path | EBLOSUM62 for protein EDNAFULL for DNA |
-endweight | boolean | Apply end gap penalties. | Boolean value Yes/No | No |
-endopen | float | The end gap open penalty is the score taken away when an end gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. | Floating point number from 1.0 to 100.0 | 10.0 for any sequence |
-endextend | float | The end gap extension, penalty is added to the end gap penalty for each base or residue in the end gap. | Floating point number from 0.0 to 10.0 | 0.5 for any sequence |
-minscore | float | Minimum alignment score to report an alignment. | Floating point number from -10.0 to 100.0 | 1.0 for any sequence |
-errorfile | outfile | Error file to be written to | Output file | needleall.error |
Advanced (Unprompted) qualifiers | ||||
-[no]brief | boolean | Brief identity and similarity | Boolean value Yes/No | Yes |
Associated qualifiers | ||||
"-asequence" associated seqset qualifiers | ||||
-sbegin1 -sbegin_asequence |
integer | Start of each sequence to be used | Any integer value | 0 |
-send1 -send_asequence |
integer | End of each sequence to be used | Any integer value | 0 |
-sreverse1 -sreverse_asequence |
boolean | Reverse (if DNA) | Boolean value Yes/No | N |
-sask1 -sask_asequence |
boolean | Ask for begin/end/reverse | Boolean value Yes/No | N |
-snucleotide1 -snucleotide_asequence |
boolean | Sequence is nucleotide | Boolean value Yes/No | N |
-sprotein1 -sprotein_asequence |
boolean | Sequence is protein | Boolean value Yes/No | N |
-slower1 -slower_asequence |
boolean | Make lower case | Boolean value Yes/No | N |
-supper1 -supper_asequence |
boolean | Make upper case | Boolean value Yes/No | N |
-sformat1 -sformat_asequence |
string | Input sequence format | Any string | |
-sdbname1 -sdbname_asequence |
string | Database name | Any string | |
-sid1 -sid_asequence |
string | Entryname | Any string | |
-ufo1 -ufo_asequence |
string | UFO features | Any string | |
-fformat1 -fformat_asequence |
string | Features format | Any string | |
-fopenfile1 -fopenfile_asequence |
string | Features file name | Any string | |
"-bsequence" associated seqall qualifiers | ||||
-sbegin2 -sbegin_bsequence |
integer | Start of each sequence to be used | Any integer value | 0 |
-send2 -send_bsequence |
integer | End of each sequence to be used | Any integer value | 0 |
-sreverse2 -sreverse_bsequence |
boolean | Reverse (if DNA) | Boolean value Yes/No | N |
-sask2 -sask_bsequence |
boolean | Ask for begin/end/reverse | Boolean value Yes/No | N |
-snucleotide2 -snucleotide_bsequence |
boolean | Sequence is nucleotide | Boolean value Yes/No | N |
-sprotein2 -sprotein_bsequence |
boolean | Sequence is protein | Boolean value Yes/No | N |
-slower2 -slower_bsequence |
boolean | Make lower case | Boolean value Yes/No | N |
-supper2 -supper_bsequence |
boolean | Make upper case | Boolean value Yes/No | N |
-sformat2 -sformat_bsequence |
string | Input sequence format | Any string | |
-sdbname2 -sdbname_bsequence |
string | Database name | Any string | |
-sid2 -sid_bsequence |
string | Entryname | Any string | |
-ufo2 -ufo_bsequence |
string | UFO features | Any string | |
-fformat2 -fformat_bsequence |
string | Features format | Any string | |
-fopenfile2 -fopenfile_bsequence |
string | Features file name | Any string | |
"-outfile" associated align qualifiers | ||||
-aformat3 -aformat_outfile |
string | Alignment format | Any string | score |
-aextension3 -aextension_outfile |
string | File name extension | Any string | |
-adirectory3 -adirectory_outfile |
string | Output directory | Any string | |
-aname3 -aname_outfile |
string | Base file name | Any string | |
-awidth3 -awidth_outfile |
integer | Alignment width | Any integer value | 0 |
-aaccshow3 -aaccshow_outfile |
boolean | Show accession number in the header | Boolean value Yes/No | N |
-adesshow3 -adesshow_outfile |
boolean | Show description in the header | Boolean value Yes/No | N |
-ausashow3 -ausashow_outfile |
boolean | Show the full USA in the alignment | Boolean value Yes/No | N |
-aglobal3 -aglobal_outfile |
boolean | Show the full sequence in alignment | Boolean value Yes/No | Y |
"-errorfile" associated outfile qualifiers | ||||
-odirectory | string | Output directory | Any string | |
General qualifiers | ||||
-auto | boolean | Turn off prompts | Boolean value Yes/No | N |
-stdout | boolean | Write first file to standard output | Boolean value Yes/No | N |
-filter | boolean | Read first file from standard input, write first file to standard output | Boolean value Yes/No | N |
-options | boolean | Prompt for standard and additional values | Boolean value Yes/No | N |
-debug | boolean | Write debug output to program.dbg | Boolean value Yes/No | N |
-verbose | boolean | Report some/full command line options | Boolean value Yes/No | Y |
-help | boolean | Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose | Boolean value Yes/No | N |
-warning | boolean | Report warnings | Boolean value Yes/No | Y |
-error | boolean | Report errors | Boolean value Yes/No | Y |
-fatal | boolean | Report fatal errors | Boolean value Yes/No | Y |
-die | boolean | Report dying program messages | Boolean value Yes/No | Y |
-version | boolean | Report version number and exit | Boolean value Yes/No | N |
Input file format
needleall reads in two nucleotide or protein sequences inputs. Both can be one or more sequences. All sequences in the first ionput are aligned to all sequences in the second input.
The input is a standard EMBOSS sequence query (also known as a 'USA').
Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl
Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application.
The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir (nbrf), swissprot (swiss, sw), dasgff and debug.
See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.
Input files for usage example
File: illumina_adapter_primer.fa
>Illumina_Genomici_DNA_Adapters1_1 GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG >Illumina_Genomic_DNA_Adapters1_2 ACACTCTTTCCCTACACGACGCTCTTCCGATCT >Illumina_Genomic_DNA_PCR_Primers1_1 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT >Illumina_Genomic_DNA_PCR_Primers1_2 CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT >Illumina_Genomic_DNA_sequencing_primer ACACTCTTTCCCTACACGACGCTCTTCCGATCT >Illumina_Paired_End_DNA_Adapters1_1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG >Illumina_Paired_End_DNA_Adapters1_2 ACACTCTTTCCCTACACGACGCTCTTCCGATCT >Illumina_Paired_End_DNA_PCR_Primers1_1 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT >Illumina_Paired_End_DNA_PCR_Primers1_2 CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT >Illumina_Paired_End_DNA_sequencing_primer_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT >Illumina_Paired_End_DNA_sequencing_primer_2 CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT >Illumina_DpnII_Gex_Adapters1_1 GATCGTCGGACTGTAGAACTCTGAAC >Illumina_DpnII_Gex_Adapters1_2 ACAGGTTCAGAGTTCTACAGTCCGAC >Illumina_DpnII_Gex_Adapters2_1 CAAGCAGAAGACGGCATACGA >Illumina_DpnII_Gex_Adapters2_2 TCGTATGCCGTCTTCTGCTTG >Illumina_DpnII_Gex_PCR_Primer_1 CAAGCAGAAGACGGCATACGA >Illumina_DpnII_Gex_PCR_Primer_2 AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA >Illumina_DpnII_Gex_sequencing_primer CGACAGGTTCAGAGTTCTACAGTCCGACGATC >Illumina_NlaIII_Gex_Adapters1_1 TCGGACTGTAGAACTCTGAAC >Illumina_NlaIII_Gex_Adapters1_2 ACAGGTTCAGAGTTCTACAGTCCGACATG >Illumina_NlaIII_Gex_Adapters2_1 CAAGCAGAAGACGGCATACGANN >Illumina_NlaIII_Gex_Adapters2_2 TCGTATGCCGTCTTCTGCTTG >Illumina_NlaIII_Gex_PCR_Primer_1 CAAGCAGAAGACGGCATACGA >Illumina_NlaIII_Gex_PCR_Primer_2 AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA >Illumina_NlaIII_Gex_sequencing_primer CCGACAGGTTCAGAGTTCTACAGTCCGACATG >Illumina_Small_RNA_RT_Primer CAAGCAGAAGACGGCATACGA >Illumina_Small_RNA_5p_Adapter GTTCAGAGTTCTACAGTCCGACGATC >Illumina_Small_RNA_3p_Adapter TCGTATGCCGTCTTCTGCTTGT >Illumina_Small_RNA_PCR_Primer_1 CAAGCAGAAGACGGCATACGA >Illumina_Small_RNA_PCR_Primer_2 AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA >Illumina_Small_RNA_sequencing_primer CGACAGGTTCAGAGTTCTACAGTCCGACGATC |
File: test1_illumina.fastq
@FC12044_91407_8_200_406_24 GTTAGCTCCCACCTTAAGATGTTTA +FC12044_91407_8_200_406_24 SXXTXXXXXXXXXTTSUXSSXKTMQ @FC12044_91407_8_200_720_610 CTCTGTGGCACCCCATCCCTCACTT +FC12044_91407_8_200_720_610 OXXXXXXXXXXXXXXXXXTSXQTXU @FC12044_91407_8_200_345_133 GATTTTTTAACAATAAACGTACATA +FC12044_91407_8_200_345_133 OQTOOSFORTFFFIIOFFFFFFFFF @FC12044_91407_8_200_106_131 GTTGCCCAGGCTCGTCTTGAACTCC +FC12044_91407_8_200_106_131 XXXXXXXXXXXXXXSXXXXISTXQS @FC12044_91407_8_200_916_471 TGATTGAAGGTAGGGTAGCATACTG +FC12044_91407_8_200_916_471 XXXXXXXXXXXXXXXUXXUSXXTXW @FC12044_91407_8_200_57_85 GCTCCAATAGCGCAGAGGAAACCTG +FC12044_91407_8_200_57_85 XFXMXSXXSXXXOSQROOSROFQIQ @FC12044_91407_8_200_10_437 GCTGCTTGGGAGGCTGAGGCAGGAG +FC12044_91407_8_200_10_437 USXSXXXXXXUXXXSXQXXUQXXKS @FC12044_91407_8_200_154_436 AGACCTTTGGATACAATGAACGACT +FC12044_91407_8_200_154_436 MKKMQTSRXMSQTOMRFOOIFFFFF @FC12044_91407_8_200_336_64 AGGGAATTTTAGAGGAGGGCTGCCG +FC12044_91407_8_200_336_64 STQMOSXSXSQXQXXKXXXKFXFFK @FC12044_91407_8_200_620_233 TCTCCATGTTGGTCAGGCTGGTCTC +FC12044_91407_8_200_620_233 XXXXXXXXXXXXXXXXXXXXXSXSW @FC12044_91407_8_200_902_349 TGAACGTCGAGACGCAAGGCCCGCC +FC12044_91407_8_200_902_349 XMXSSXMXXSXQSXTSQXFKSKTOF @FC12044_91407_8_200_40_618 CTGTCCCCACGGCGGGGGGGCCTGG +FC12044_91407_8_200_40_618 TXXXXSXXXXXXXXXXXXXRKFOXS @FC12044_91407_8_200_83_511 GATGTACTCTTACACCCAGACTTTG +FC12044_91407_8_200_83_511 SOXXXXXUXXXXXXQKQKKROOQSU @FC12044_91407_8_200_76_246 TCAAGGGTGGATCTTGGCTCCCAGT +FC12044_91407_8_200_76_246 XTXTUXXXXXRXXXTXXSUXSRFXQ @FC12044_91407_8_200_303_427 TTGCGACAGAGTTTTGCTCTTGTCC +FC12044_91407_8_200_303_427 XXQROXXXXIXFQXXXOIQSSXUFF @FC12044_91407_8_200_31_299 TCTGCTCCAGCTCCAAGACGCCGCC +FC12044_91407_8_200_31_299 XRXTSXXXRXXSXQQOXQTSQSXKQ @FC12044_91407_8_200_553_135 TACGGAGCCGCGGGCGGGAAAGGCG +FC12044_91407_8_200_553_135 XSQQXXXXXXXXXXSXXMFFQXTKU @FC12044_91407_8_200_139_74 CCTCCCAGGTTCAAGCGATTATCCT +FC12044_91407_8_200_139_74 RMXUSXTXXQXXQUXXXSQISISSO @FC12044_91407_8_200_108_33 GTCATGGCGGCCCGCGCGGGGAGCG +FC12044_91407_8_200_108_33 OOOSSXXSXXOMKMOFMKFOKFFFF @FC12044_91407_8_200_980_965 ACAGTGGGTTCTTAAAGAAGAGTCG +FC12044_91407_8_200_980_965 TOSSRXXXSSMSXMOMXIRXOXFFS @FC12044_91407_8_200_981_857 AACGAGGGGCGCGACTTGACCTTGG +FC12044_91407_8_200_981_857 RXMSSXXXXSXQXQXFSXQFQKMXS @FC12044_91407_8_200_8_865 TTTCCCACCCCAGGAAGCCTTGGAC +FC12044_91407_8_200_8_865 XXXFKOROMKOORMIMRIIKKORFF @FC12044_91407_8_200_292_484 TCAGCCTCCGTGCCCAGCCCACTCC +FC12044_91407_8_200_292_484 XQXOSXXXXXUXXXXIXXXXQTOXF @FC12044_91407_8_200_675_16 CTCGGGAGGCTGAGGCAGGGGGGTT +FC12044_91407_8_200_675_16 OXTXXXSXXQXXOXXKMXXMXOKQF @FC12044_91407_8_200_285_136 CCAAATCTTGAATTGTAGCTCCCCT +FC12044_91407_8_200_285_136 OSXOQXXXXXSXXUXXTXXXXTRMS |
Output file format
The output is a standard EMBOSS alignment file.
The results can be output in one of several styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.
The available multiple alignment format names are: multiple, simple, fasta, msf, clustal, mega, meganon, nexus,, nexusnon, phylip, phylipnon, selex, treecon, tcoffee, debug, srs.
The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, match, sam, bam, score, srspair
See: http://emboss.sf.net/docs/themes/AlignFormats.html for further information on alignment formats.
Output files for usage example
File: needleall.error
Alignment score (21.5) is less than minimum score(40.0) for sequences Illumina_Genomici_DNA_Adapters1_1 vs FC12044_91407_8_200_406_24 Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_Adapters1_2 vs FC12044_91407_8_200_406_24 Alignment score (31.0) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_406_24 Alignment score (25.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_406_24 Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_sequencing_primer vs FC12044_91407_8_200_406_24 Alignment score (16.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_1 vs FC12044_91407_8_200_406_24 Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_2 vs FC12044_91407_8_200_406_24 Alignment score (31.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_406_24 Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_406_24 Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_1 vs FC12044_91407_8_200_406_24 Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_2 vs FC12044_91407_8_200_406_24 Alignment score (14.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_406_24 Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_406_24 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_406_24 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_406_24 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_406_24 Alignment score (23.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_406_24 Alignment score (12.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_406_24 Alignment score (27.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_406_24 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_406_24 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_406_24 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_406_24 Alignment score (27.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_sequencing_primer vs FC12044_91407_8_200_406_24 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_RT_Primer vs FC12044_91407_8_200_406_24 Alignment score (23.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_5p_Adapter vs FC12044_91407_8_200_406_24 Alignment score (13.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_3p_Adapter vs FC12044_91407_8_200_406_24 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_1 vs FC12044_91407_8_200_406_24 Alignment score (23.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_sequencing_primer vs FC12044_91407_8_200_406_24 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Genomici_DNA_Adapters1_1 vs FC12044_91407_8_200_720_610 Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_Adapters1_2 vs FC12044_91407_8_200_720_610 Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_720_610 Alignment score (20.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_720_610 Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_sequencing_primer vs FC12044_91407_8_200_720_610 Alignment score (0.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_1 vs FC12044_91407_8_200_720_610 Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_2 vs FC12044_91407_8_200_720_610 Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_720_610 Alignment score (33.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_720_610 Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_1 vs FC12044_91407_8_200_720_610 Alignment score (33.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_2 vs FC12044_91407_8_200_720_610 Alignment score (20.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_720_610 Alignment score (9.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_720_610 Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_720_610 Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_720_610 Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_720_610 Alignment score (10.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_720_610 Alignment score (15.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_720_610 Alignment score (20.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_720_610 Alignment score (9.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_720_610 Alignment score (7.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_720_610 Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_720_610 [Part of this file has been deleted for brevity] Alignment score (13.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_675_16 Alignment score (17.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_675_16 Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_675_16 Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_675_16 Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_675_16 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_675_16 Alignment score (22.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_675_16 Alignment score (13.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_675_16 Alignment score (17.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_675_16 Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_675_16 Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_675_16 Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_675_16 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_675_16 Alignment score (21.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_sequencing_primer vs FC12044_91407_8_200_675_16 Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_RT_Primer vs FC12044_91407_8_200_675_16 Alignment score (15.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_5p_Adapter vs FC12044_91407_8_200_675_16 Alignment score (7.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_3p_Adapter vs FC12044_91407_8_200_675_16 Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_1 vs FC12044_91407_8_200_675_16 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_2 vs FC12044_91407_8_200_675_16 Alignment score (22.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_sequencing_primer vs FC12044_91407_8_200_675_16 Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Genomici_DNA_Adapters1_1 vs FC12044_91407_8_200_285_136 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_Adapters1_2 vs FC12044_91407_8_200_285_136 Alignment score (30.0) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_285_136 Alignment score (16.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_285_136 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_sequencing_primer vs FC12044_91407_8_200_285_136 Alignment score (7.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_1 vs FC12044_91407_8_200_285_136 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_2 vs FC12044_91407_8_200_285_136 Alignment score (30.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_285_136 Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_285_136 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_1 vs FC12044_91407_8_200_285_136 Alignment score (18.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_2 vs FC12044_91407_8_200_285_136 Alignment score (27.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_285_136 Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_285_136 Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_285_136 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_285_136 Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_285_136 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_285_136 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_285_136 Alignment score (26.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_285_136 Alignment score (14.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_285_136 Alignment score (2.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_285_136 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_285_136 Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_285_136 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_285_136 Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_sequencing_primer vs FC12044_91407_8_200_285_136 Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_RT_Primer vs FC12044_91407_8_200_285_136 Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_5p_Adapter vs FC12044_91407_8_200_285_136 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_3p_Adapter vs FC12044_91407_8_200_285_136 Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_1 vs FC12044_91407_8_200_285_136 Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_2 vs FC12044_91407_8_200_285_136 Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_sequencing_primer vs FC12044_91407_8_200_285_136 |
The Identity: is the percentage of identical matches between the two sequences over the reported aligned region (including any gaps in the length).
The Similarity: is the percentage of matches between the two sequences over the reported aligned region (including any gaps in the length).
Data files
For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAFULL is used. Others can be specified.
EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.
To see the available EMBOSS data files, run:
% embossdata -showall
To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:
% embossdata -fetch -file Exxx.dat
Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".
The directories are searched in the following order:
- . (your current directory)
- .embossdata (under your current directory)
- ~/ (your home directory)
- ~/.embossdata
Notes
needleall is a true implementation of the Needleman-Wunsch algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you've a lot of memory and a lot of time.References
- Needleman, S. B. and Wunsch, C. D. (1970) J. Mol. Biol. 48, 443-453.
- Kruskal, J. B. (1983) An overview of squence comparison In D. Sankoff and J. B. Kruskal, (ed.), Time warps, string edits and macromolecules: the theory and practice of sequence comparison, pp. 1-44 Addison Wesley.
Warnings
needleall is for aligning pairs of sequences over their entire length. This works best with closely related sequences. If you use needleall to align very distantly-related sequences, it will produce a result but much of the alignment may have little or no biological significance.A true Needleman Wunsch implementation like needleall needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of ints and one of floats so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.
The first input sequence set is loaded completely into memory. When comparing large numbers (or lengths) of sequences, the smallest set should be the first input to make the most efficient use of memory.
If you run out of memory, try using stretcher instead.
Diagnostic Error Messages
Uncaught exception Assertion failed raised at ajmem.c:xxx
Probably means you have run out of memory. Try using stretcher if this happens.
Exit status
0 upon successful completion.Known bugs
None.See also
Program name | Description |
---|---|
est2genome | Align EST sequences to genomic DNA sequence |
needle | Needleman-Wunsch global alignment of two sequences |
stretcher | Needleman-Wunsch rapid global alignment of two sequences |
When you want an alignment that covers the whole length of two sequences, use needle.
When you are trying to find the best region of similarity between two sequences, use water.
stretcher is a more suitable program to use to find global alignments of very long sequences.
Author(s)
Mahmut UludagEuropean Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.