EMBOSS at CSC

Tehdyt toimenpiteet

EMBOSS: needleall

needleall

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Many-to-many pairwise alignments of two sequence sets

Description

needleall reads a set of input sequences and compares them all to one or more sequences, writing their optimal global sequence alignments to file. It uses the Needleman-Wunsch alignment algorithm to find the optimum alignment (including gaps) of two sequences along their entire length. The algorithm uses a dynamic programming method to ensure the alignment is optimum, by exploring all possible alignments and choosing the best. A scoring matrix is read that contains values for every possible residue or nucleotide match. Needleall finds the alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix, minus penalties arising from opening and extending gaps in the aligned sequences. The substitution matrix and gap opening and extension penalties are user-specified.

Algorithm

The Needleman-Wunsch algorithm is a member of the class of algorithms that can calculate the best score and alignment of two sequences in the order of mn steps, where n and m are the sequence lengths. These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.

An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty). Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.

Penalty for a gap of n positions is calculated using the following formula:

gap opening penalty + (n - 1) * gap extension penalty

In a Needleman-Wunsch global alignment, the entire length of each sequence is aligned. The sequences might be partially overlapping or one sequence might be aligned entirely internally to the other. There is no penalty for the hanging ends of the overlap. In bioinformatics, it is usually reasonable to assume that the sequences are incomplete and there should be no penalty for failing to align the missing bases.

Usage

Here is a sample session with needleall


% needleall -minscore 40 -stdout -auto ../data/test1_illumina.fastq 

Illumina_DpnII_Gex_PCR_Primer_2 FC12044_91407_8_200_406_24 45 (41.0)
Illumina_NlaIII_Gex_PCR_Primer_2 FC12044_91407_8_200_406_24 45 (41.0)
Illumina_Small_RNA_PCR_Primer_2 FC12044_91407_8_200_406_24 45 (41.0)
Illumina_DpnII_Gex_Adapters1_1 FC12044_91407_8_200_106_131 35 (40.5)
Illumina_Paired_End_DNA_Adapters1_1 FC12044_91407_8_200_57_85 35 (41.0)
Illumina_DpnII_Gex_Adapters1_1 FC12044_91407_8_200_154_436 31 (42.0)
Illumina_Genomic_DNA_PCR_Primers1_1 FC12044_91407_8_200_83_511 64 (42.0)
Illumina_Paired_End_DNA_PCR_Primers1_1 FC12044_91407_8_200_83_511 64 (42.0)
Illumina_DpnII_Gex_Adapters1_2 FC12044_91407_8_200_303_427 33 (40.5)
Illumina_DpnII_Gex_PCR_Primer_2 FC12044_91407_8_200_303_427 51 (40.5)
Illumina_DpnII_Gex_sequencing_primer FC12044_91407_8_200_303_427 38 (44.5)
Illumina_NlaIII_Gex_Adapters1_2 FC12044_91407_8_200_303_427 36 (40.5)
Illumina_NlaIII_Gex_PCR_Primer_2 FC12044_91407_8_200_303_427 51 (40.5)
Illumina_NlaIII_Gex_sequencing_primer FC12044_91407_8_200_303_427 39 (40.5)
Illumina_Small_RNA_5p_Adapter FC12044_91407_8_200_303_427 33 (40.5)
Illumina_Small_RNA_PCR_Primer_2 FC12044_91407_8_200_303_427 51 (40.5)
Illumina_Small_RNA_sequencing_primer FC12044_91407_8_200_303_427 38 (44.5)
Illumina_Paired_End_DNA_Adapters1_1 FC12044_91407_8_200_553_135 33 (44.5)
Illumina_DpnII_Gex_PCR_Primer_2 FC12044_91407_8_200_139_74 51 (46.0)
Illumina_DpnII_Gex_sequencing_primer FC12044_91407_8_200_139_74 38 (42.0)
Illumina_NlaIII_Gex_PCR_Primer_2 FC12044_91407_8_200_139_74 51 (46.0)
Illumina_Small_RNA_PCR_Primer_2 FC12044_91407_8_200_139_74 51 (46.0)
Illumina_Small_RNA_sequencing_primer FC12044_91407_8_200_139_74 38 (42.0)

#---------------------------------------
#---------------------------------------

Go to the input files for this example
Go to the output files for this example

Command line arguments

Many-to-many pairwise alignments of two sequence sets
Version: EMBOSS:6.4.0.0

   Standard (Mandatory) qualifiers:
  [-asequence]         seqset     Sequence set filename and optional format,
                                  or reference (input USA)
  [-bsequence]         seqall     Sequence(s) filename and optional format, or
                                  reference (input USA)
   -gapopen            float      [10.0 for any sequence] The gap open penalty
                                  is the score taken away when a gap is
                                  created. The best value depends on the
                                  choice of comparison matrix. The default
                                  value assumes you are using the EBLOSUM62
                                  matrix for protein sequences, and the
                                  EDNAFULL matrix for nucleotide sequences.
                                  (Floating point number from 1.0 to 100.0)
   -gapextend          float      [0.5 for any sequence] The gap extension,
                                  penalty is added to the standard gap penalty
                                  for each base or residue in the gap. This
                                  is how long gaps are penalized. Usually you
                                  will expect a few long gaps rather than many
                                  short gaps, so the gap extension penalty
                                  should be lower than the gap penalty. An
                                  exception is where one or both sequences are
                                  single reads with possible sequencing
                                  errors in which case you would expect many
                                  single base gaps. You can get this result by
                                  setting the gap open penalty to zero (or
                                  very low) and using the gap extension
                                  penalty to control gap scoring. (Floating
                                  point number from 0.0 to 10.0)
  [-outfile]           align      [*.needleall] Output alignment file name
                                  (default -aformat score)

   Additional (Optional) qualifiers:
   -datafile           matrixf    [EBLOSUM62 for protein, EDNAFULL for DNA]
                                  This is the scoring matrix file used when
                                  comparing sequences. By default it is the
                                  file 'EBLOSUM62' (for proteins) or the file
                                  'EDNAFULL' (for nucleic sequences). These
                                  files are found in the 'data' directory of
                                  the EMBOSS installation.
   -endweight          boolean    [N] Apply end gap penalties.
   -endopen            float      [10.0 for any sequence] The end gap open
                                  penalty is the score taken away when an end
                                  gap is created. The best value depends on
                                  the choice of comparison matrix. The default
                                  value assumes you are using the EBLOSUM62
                                  matrix for protein sequences, and the
                                  EDNAFULL matrix for nucleotide sequences.
                                  (Floating point number from 1.0 to 100.0)
   -endextend          float      [0.5 for any sequence] The end gap
                                  extension, penalty is added to the end gap
                                  penalty for each base or residue in the end
                                  gap. (Floating point number from 0.0 to
                                  10.0)
   -minscore           float      [1.0 for any sequence] Minimum alignment
                                  score to report an alignment. (Floating
                                  point number from -10.0 to 100.0)
   -errorfile          outfile    [needleall.error] Error file to be written
                                  to

   Advanced (Unprompted) qualifiers:
   -[no]brief          boolean    [Y] Brief identity and similarity

   Associated qualifiers:

   "-asequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-bsequence" associated qualifiers
   -sbegin2            integer    Start of each sequence to be used
   -send2              integer    End of each sequence to be used
   -sreverse2          boolean    Reverse (if DNA)
   -sask2              boolean    Ask for begin/end/reverse
   -snucleotide2       boolean    Sequence is nucleotide
   -sprotein2          boolean    Sequence is protein
   -slower2            boolean    Make lower case
   -supper2            boolean    Make upper case
   -sformat2           string     Input sequence format
   -sdbname2           string     Database name
   -sid2               string     Entryname
   -ufo2               string     UFO features
   -fformat2           string     Features format
   -fopenfile2         string     Features file name

   "-outfile" associated qualifiers
   -aformat3           string     Alignment format
   -aextension3        string     File name extension
   -adirectory3        string     Output directory
   -aname3             string     Base file name
   -awidth3            integer    Alignment width
   -aaccshow3          boolean    Show accession number in the header
   -adesshow3          boolean    Show description in the header
   -ausashow3          boolean    Show the full USA in the alignment
   -aglobal3           boolean    Show the full sequence in alignment

   "-errorfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier	Type	Description	Allowed values	Default
Standard (Mandatory) qualifiers
[-asequence] (Parameter 1)	seqset	Sequence set filename and optional format, or reference (input USA)	Readable set of sequences	Required
[-bsequence] (Parameter 2)	seqall	Sequence(s) filename and optional format, or reference (input USA)	Readable sequence(s)	Required
-gapopen	float	The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences.	Floating point number from 1.0 to 100.0	10.0 for any sequence
-gapextend	float	The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring.	Floating point number from 0.0 to 10.0	0.5 for any sequence
[-outfile] (Parameter 3)	align	Output alignment file name	(default -aformat score)	<>*.needleall
Additional (Optional) qualifiers
-datafile	matrixf	This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation.	Comparison matrix file in EMBOSS data path	EBLOSUM62 for protein EDNAFULL for DNA
-endweight	boolean	Apply end gap penalties.	Boolean value Yes/No	No
-endopen	float	The end gap open penalty is the score taken away when an end gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences.	Floating point number from 1.0 to 100.0	10.0 for any sequence
-endextend	float	The end gap extension, penalty is added to the end gap penalty for each base or residue in the end gap.	Floating point number from 0.0 to 10.0	0.5 for any sequence
-minscore	float	Minimum alignment score to report an alignment.	Floating point number from -10.0 to 100.0	1.0 for any sequence
-errorfile	outfile	Error file to be written to	Output file	needleall.error
Advanced (Unprompted) qualifiers
-[no]brief	boolean	Brief identity and similarity	Boolean value Yes/No	Yes
Associated qualifiers
"-asequence" associated seqset qualifiers
-sbegin1 -sbegin_asequence	integer	Start of each sequence to be used	Any integer value	0
-send1 -send_asequence	integer	End of each sequence to be used	Any integer value	0
-sreverse1 -sreverse_asequence	boolean	Reverse (if DNA)	Boolean value Yes/No	N
-sask1 -sask_asequence	boolean	Ask for begin/end/reverse	Boolean value Yes/No	N
-snucleotide1 -snucleotide_asequence	boolean	Sequence is nucleotide	Boolean value Yes/No	N
-sprotein1 -sprotein_asequence	boolean	Sequence is protein	Boolean value Yes/No	N
-slower1 -slower_asequence	boolean	Make lower case	Boolean value Yes/No	N
-supper1 -supper_asequence	boolean	Make upper case	Boolean value Yes/No	N
-sformat1 -sformat_asequence	string	Input sequence format	Any string
-sdbname1 -sdbname_asequence	string	Database name	Any string
-sid1 -sid_asequence	string	Entryname	Any string
-ufo1 -ufo_asequence	string	UFO features	Any string
-fformat1 -fformat_asequence	string	Features format	Any string
-fopenfile1 -fopenfile_asequence	string	Features file name	Any string
"-bsequence" associated seqall qualifiers
-sbegin2 -sbegin_bsequence	integer	Start of each sequence to be used	Any integer value	0
-send2 -send_bsequence	integer	End of each sequence to be used	Any integer value	0
-sreverse2 -sreverse_bsequence	boolean	Reverse (if DNA)	Boolean value Yes/No	N
-sask2 -sask_bsequence	boolean	Ask for begin/end/reverse	Boolean value Yes/No	N
-snucleotide2 -snucleotide_bsequence	boolean	Sequence is nucleotide	Boolean value Yes/No	N
-sprotein2 -sprotein_bsequence	boolean	Sequence is protein	Boolean value Yes/No	N
-slower2 -slower_bsequence	boolean	Make lower case	Boolean value Yes/No	N
-supper2 -supper_bsequence	boolean	Make upper case	Boolean value Yes/No	N
-sformat2 -sformat_bsequence	string	Input sequence format	Any string
-sdbname2 -sdbname_bsequence	string	Database name	Any string
-sid2 -sid_bsequence	string	Entryname	Any string
-ufo2 -ufo_bsequence	string	UFO features	Any string
-fformat2 -fformat_bsequence	string	Features format	Any string
-fopenfile2 -fopenfile_bsequence	string	Features file name	Any string
"-outfile" associated align qualifiers
-aformat3 -aformat_outfile	string	Alignment format	Any string	score
-aextension3 -aextension_outfile	string	File name extension	Any string
-adirectory3 -adirectory_outfile	string	Output directory	Any string
-aname3 -aname_outfile	string	Base file name	Any string
-awidth3 -awidth_outfile	integer	Alignment width	Any integer value	0
-aaccshow3 -aaccshow_outfile	boolean	Show accession number in the header	Boolean value Yes/No	N
-adesshow3 -adesshow_outfile	boolean	Show description in the header	Boolean value Yes/No	N
-ausashow3 -ausashow_outfile	boolean	Show the full USA in the alignment	Boolean value Yes/No	N
-aglobal3 -aglobal_outfile	boolean	Show the full sequence in alignment	Boolean value Yes/No	Y
"-errorfile" associated outfile qualifiers
-odirectory	string	Output directory	Any string
General qualifiers
-auto	boolean	Turn off prompts	Boolean value Yes/No	N
-stdout	boolean	Write first file to standard output	Boolean value Yes/No	N
-filter	boolean	Read first file from standard input, write first file to standard output	Boolean value Yes/No	N
-options	boolean	Prompt for standard and additional values	Boolean value Yes/No	N
-debug	boolean	Write debug output to program.dbg	Boolean value Yes/No	N
-verbose	boolean	Report some/full command line options	Boolean value Yes/No	Y
-help	boolean	Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose	Boolean value Yes/No	N
-warning	boolean	Report warnings	Boolean value Yes/No	Y
-error	boolean	Report errors	Boolean value Yes/No	Y
-fatal	boolean	Report fatal errors	Boolean value Yes/No	Y
-die	boolean	Report dying program messages	Boolean value Yes/No	Y
-version	boolean	Report version number and exit	Boolean value Yes/No	N

Input file format

needleall reads in two nucleotide or protein sequences inputs. Both can be one or more sequences. All sequences in the first ionput are aligned to all sequences in the second input.

The input is a standard EMBOSS sequence query (also known as a 'USA').

Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl

Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application.

The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir (nbrf), swissprot (swiss, sw), dasgff and debug.

See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.

Input files for usage example

File: illumina_adapter_primer.fa

>Illumina_Genomici_DNA_Adapters1_1
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
>Illumina_Genomic_DNA_Adapters1_2
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Genomic_DNA_PCR_Primers1_1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Genomic_DNA_PCR_Primers1_2
CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
>Illumina_Genomic_DNA_sequencing_primer
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_DNA_Adapters1_1
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
>Illumina_Paired_End_DNA_Adapters1_2
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_DNA_PCR_Primers1_1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_DNA_PCR_Primers1_2
CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>Illumina_Paired_End_DNA_sequencing_primer_1
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina_Paired_End_DNA_sequencing_primer_2
CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>Illumina_DpnII_Gex_Adapters1_1
GATCGTCGGACTGTAGAACTCTGAAC
>Illumina_DpnII_Gex_Adapters1_2
ACAGGTTCAGAGTTCTACAGTCCGAC
>Illumina_DpnII_Gex_Adapters2_1
CAAGCAGAAGACGGCATACGA
>Illumina_DpnII_Gex_Adapters2_2
TCGTATGCCGTCTTCTGCTTG
>Illumina_DpnII_Gex_PCR_Primer_1
CAAGCAGAAGACGGCATACGA
>Illumina_DpnII_Gex_PCR_Primer_2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina_DpnII_Gex_sequencing_primer
CGACAGGTTCAGAGTTCTACAGTCCGACGATC
>Illumina_NlaIII_Gex_Adapters1_1
TCGGACTGTAGAACTCTGAAC
>Illumina_NlaIII_Gex_Adapters1_2
ACAGGTTCAGAGTTCTACAGTCCGACATG
>Illumina_NlaIII_Gex_Adapters2_1
CAAGCAGAAGACGGCATACGANN
>Illumina_NlaIII_Gex_Adapters2_2
TCGTATGCCGTCTTCTGCTTG
>Illumina_NlaIII_Gex_PCR_Primer_1
CAAGCAGAAGACGGCATACGA
>Illumina_NlaIII_Gex_PCR_Primer_2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina_NlaIII_Gex_sequencing_primer
CCGACAGGTTCAGAGTTCTACAGTCCGACATG
>Illumina_Small_RNA_RT_Primer
CAAGCAGAAGACGGCATACGA
>Illumina_Small_RNA_5p_Adapter
GTTCAGAGTTCTACAGTCCGACGATC
>Illumina_Small_RNA_3p_Adapter
TCGTATGCCGTCTTCTGCTTGT
>Illumina_Small_RNA_PCR_Primer_1
CAAGCAGAAGACGGCATACGA
>Illumina_Small_RNA_PCR_Primer_2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina_Small_RNA_sequencing_primer
CGACAGGTTCAGAGTTCTACAGTCCGACGATC

File: test1_illumina.fastq

@FC12044_91407_8_200_406_24
GTTAGCTCCCACCTTAAGATGTTTA
+FC12044_91407_8_200_406_24
SXXTXXXXXXXXXTTSUXSSXKTMQ
@FC12044_91407_8_200_720_610
CTCTGTGGCACCCCATCCCTCACTT
+FC12044_91407_8_200_720_610
OXXXXXXXXXXXXXXXXXTSXQTXU
@FC12044_91407_8_200_345_133
GATTTTTTAACAATAAACGTACATA
+FC12044_91407_8_200_345_133
OQTOOSFORTFFFIIOFFFFFFFFF
@FC12044_91407_8_200_106_131
GTTGCCCAGGCTCGTCTTGAACTCC
+FC12044_91407_8_200_106_131
XXXXXXXXXXXXXXSXXXXISTXQS
@FC12044_91407_8_200_916_471
TGATTGAAGGTAGGGTAGCATACTG
+FC12044_91407_8_200_916_471
XXXXXXXXXXXXXXXUXXUSXXTXW
@FC12044_91407_8_200_57_85
GCTCCAATAGCGCAGAGGAAACCTG
+FC12044_91407_8_200_57_85
XFXMXSXXSXXXOSQROOSROFQIQ
@FC12044_91407_8_200_10_437
GCTGCTTGGGAGGCTGAGGCAGGAG
+FC12044_91407_8_200_10_437
USXSXXXXXXUXXXSXQXXUQXXKS
@FC12044_91407_8_200_154_436
AGACCTTTGGATACAATGAACGACT
+FC12044_91407_8_200_154_436
MKKMQTSRXMSQTOMRFOOIFFFFF
@FC12044_91407_8_200_336_64
AGGGAATTTTAGAGGAGGGCTGCCG
+FC12044_91407_8_200_336_64
STQMOSXSXSQXQXXKXXXKFXFFK
@FC12044_91407_8_200_620_233
TCTCCATGTTGGTCAGGCTGGTCTC
+FC12044_91407_8_200_620_233
XXXXXXXXXXXXXXXXXXXXXSXSW
@FC12044_91407_8_200_902_349
TGAACGTCGAGACGCAAGGCCCGCC
+FC12044_91407_8_200_902_349
XMXSSXMXXSXQSXTSQXFKSKTOF
@FC12044_91407_8_200_40_618
CTGTCCCCACGGCGGGGGGGCCTGG
+FC12044_91407_8_200_40_618
TXXXXSXXXXXXXXXXXXXRKFOXS
@FC12044_91407_8_200_83_511
GATGTACTCTTACACCCAGACTTTG
+FC12044_91407_8_200_83_511
SOXXXXXUXXXXXXQKQKKROOQSU
@FC12044_91407_8_200_76_246
TCAAGGGTGGATCTTGGCTCCCAGT
+FC12044_91407_8_200_76_246
XTXTUXXXXXRXXXTXXSUXSRFXQ
@FC12044_91407_8_200_303_427
TTGCGACAGAGTTTTGCTCTTGTCC
+FC12044_91407_8_200_303_427
XXQROXXXXIXFQXXXOIQSSXUFF
@FC12044_91407_8_200_31_299
TCTGCTCCAGCTCCAAGACGCCGCC
+FC12044_91407_8_200_31_299
XRXTSXXXRXXSXQQOXQTSQSXKQ
@FC12044_91407_8_200_553_135
TACGGAGCCGCGGGCGGGAAAGGCG
+FC12044_91407_8_200_553_135
XSQQXXXXXXXXXXSXXMFFQXTKU
@FC12044_91407_8_200_139_74
CCTCCCAGGTTCAAGCGATTATCCT
+FC12044_91407_8_200_139_74
RMXUSXTXXQXXQUXXXSQISISSO
@FC12044_91407_8_200_108_33
GTCATGGCGGCCCGCGCGGGGAGCG
+FC12044_91407_8_200_108_33
OOOSSXXSXXOMKMOFMKFOKFFFF
@FC12044_91407_8_200_980_965
ACAGTGGGTTCTTAAAGAAGAGTCG
+FC12044_91407_8_200_980_965
TOSSRXXXSSMSXMOMXIRXOXFFS
@FC12044_91407_8_200_981_857
AACGAGGGGCGCGACTTGACCTTGG
+FC12044_91407_8_200_981_857
RXMSSXXXXSXQXQXFSXQFQKMXS
@FC12044_91407_8_200_8_865
TTTCCCACCCCAGGAAGCCTTGGAC
+FC12044_91407_8_200_8_865
XXXFKOROMKOORMIMRIIKKORFF
@FC12044_91407_8_200_292_484
TCAGCCTCCGTGCCCAGCCCACTCC
+FC12044_91407_8_200_292_484
XQXOSXXXXXUXXXXIXXXXQTOXF
@FC12044_91407_8_200_675_16
CTCGGGAGGCTGAGGCAGGGGGGTT
+FC12044_91407_8_200_675_16
OXTXXXSXXQXXOXXKMXXMXOKQF
@FC12044_91407_8_200_285_136
CCAAATCTTGAATTGTAGCTCCCCT
+FC12044_91407_8_200_285_136
OSXOQXXXXXSXXUXXTXXXXTRMS

Output file format

The output is a standard EMBOSS alignment file.

The results can be output in one of several styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.

The available multiple alignment format names are: multiple, simple, fasta, msf, clustal, mega, meganon, nexus,, nexusnon, phylip, phylipnon, selex, treecon, tcoffee, debug, srs.

The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, match, sam, bam, score, srspair

See: http://emboss.sf.net/docs/themes/AlignFormats.html for further information on alignment formats.

Output files for usage example

File: needleall.error

Alignment score (21.5) is less than minimum score(40.0) for sequences Illumina_Genomici_DNA_Adapters1_1 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_Adapters1_2 vs FC12044_91407_8_200_406_24
Alignment score (31.0) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_406_24
Alignment score (25.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_sequencing_primer vs FC12044_91407_8_200_406_24
Alignment score (16.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_1 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_2 vs FC12044_91407_8_200_406_24
Alignment score (31.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_406_24
Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_1 vs FC12044_91407_8_200_406_24
Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_2 vs FC12044_91407_8_200_406_24
Alignment score (14.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_406_24
Alignment score (24.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_406_24
Alignment score (23.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_406_24
Alignment score (12.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_406_24
Alignment score (27.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_406_24
Alignment score (27.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_sequencing_primer vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_RT_Primer vs FC12044_91407_8_200_406_24
Alignment score (23.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_5p_Adapter vs FC12044_91407_8_200_406_24
Alignment score (13.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_3p_Adapter vs FC12044_91407_8_200_406_24
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_1 vs FC12044_91407_8_200_406_24
Alignment score (23.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_sequencing_primer vs FC12044_91407_8_200_406_24
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Genomici_DNA_Adapters1_1 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_Adapters1_2 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_720_610
Alignment score (20.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_sequencing_primer vs FC12044_91407_8_200_720_610
Alignment score (0.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_1 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_2 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_720_610
Alignment score (33.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_720_610
Alignment score (31.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_1 vs FC12044_91407_8_200_720_610
Alignment score (33.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_2 vs FC12044_91407_8_200_720_610
Alignment score (20.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_720_610
Alignment score (9.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_720_610
Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_720_610
Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_720_610
Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_720_610
Alignment score (10.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_720_610
Alignment score (15.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_720_610
Alignment score (20.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_720_610
Alignment score (9.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_720_610
Alignment score (7.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_720_610
Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_720_610


  [Part of this file has been deleted for brevity]

Alignment score (13.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_675_16
Alignment score (17.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_675_16
Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_675_16
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_675_16
Alignment score (22.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_675_16
Alignment score (13.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_675_16
Alignment score (17.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_675_16
Alignment score (11.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_675_16
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_675_16
Alignment score (21.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_sequencing_primer vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_RT_Primer vs FC12044_91407_8_200_675_16
Alignment score (15.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_5p_Adapter vs FC12044_91407_8_200_675_16
Alignment score (7.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_3p_Adapter vs FC12044_91407_8_200_675_16
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_1 vs FC12044_91407_8_200_675_16
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_2 vs FC12044_91407_8_200_675_16
Alignment score (22.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_sequencing_primer vs FC12044_91407_8_200_675_16
Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Genomici_DNA_Adapters1_1 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_Adapters1_2 vs FC12044_91407_8_200_285_136
Alignment score (30.0) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_285_136
Alignment score (16.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Genomic_DNA_sequencing_primer vs FC12044_91407_8_200_285_136
Alignment score (7.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_1 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_Adapters1_2 vs FC12044_91407_8_200_285_136
Alignment score (30.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_1 vs FC12044_91407_8_200_285_136
Alignment score (21.0) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_PCR_Primers1_2 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_1 vs FC12044_91407_8_200_285_136
Alignment score (18.5) is less than minimum score(40.0) for sequences Illumina_Paired_End_DNA_sequencing_primer_2 vs FC12044_91407_8_200_285_136
Alignment score (27.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_1 vs FC12044_91407_8_200_285_136
Alignment score (13.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters1_2 vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_1 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_Adapters2_2 vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_285_136
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_DpnII_Gex_sequencing_primer vs FC12044_91407_8_200_285_136
Alignment score (26.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_1 vs FC12044_91407_8_200_285_136
Alignment score (14.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters1_2 vs FC12044_91407_8_200_285_136
Alignment score (2.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_1 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_Adapters2_2 vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_1 vs FC12044_91407_8_200_285_136
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_PCR_Primer_2 vs FC12044_91407_8_200_285_136
Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_NlaIII_Gex_sequencing_primer vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_RT_Primer vs FC12044_91407_8_200_285_136
Alignment score (15.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_5p_Adapter vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_3p_Adapter vs FC12044_91407_8_200_285_136
Alignment score (6.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_1 vs FC12044_91407_8_200_285_136
Alignment score (12.0) is less than minimum score(40.0) for sequences Illumina_Small_RNA_PCR_Primer_2 vs FC12044_91407_8_200_285_136
Alignment score (17.5) is less than minimum score(40.0) for sequences Illumina_Small_RNA_sequencing_primer vs FC12044_91407_8_200_285_136

The Identity: is the percentage of identical matches between the two sequences over the reported aligned region (including any gaps in the length).

The Similarity: is the percentage of matches between the two sequences over the reported aligned region (including any gaps in the length).

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAFULL is used. Others can be specified.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

. (your current directory)
.embossdata (under your current directory)
~/ (your home directory)
~/.embossdata

Notes

needleall is a true implementation of the Needleman-Wunsch algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you've a lot of memory and a lot of time.

References

Needleman, S. B. and Wunsch, C. D. (1970) J. Mol. Biol. 48, 443-453.
Kruskal, J. B. (1983) An overview of squence comparison In D. Sankoff and J. B. Kruskal, (ed.), Time warps, string edits and macromolecules: the theory and practice of sequence comparison, pp. 1-44 Addison Wesley.

Warnings

needleall is for aligning pairs of sequences over their entire length. This works best with closely related sequences. If you use needleall to align very distantly-related sequences, it will produce a result but much of the alignment may have little or no biological significance.

A true Needleman Wunsch implementation like needleall needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of ints and one of floats so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.

The first input sequence set is loaded completely into memory. When comparing large numbers (or lengths) of sequences, the smallest set should be the first input to make the most efficient use of memory.

If you run out of memory, try using stretcher instead.

Diagnostic Error Messages

Uncaught exception
 Assertion failed
 raised at ajmem.c:xxx

Probably means you have run out of memory. Try using stretcher if this happens.

Exit status

0 upon successful completion.

Known bugs

None.

Author(s)

Mahmut Uludag
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None

CSC