Assembly CAP2
This mode of assembly uses the global assembly program CAP2, developed by Xiaoqiu Huang. Huang, X. An improved sequence assembly program. Genomics 33, 21-31 (1996).
The CAP2 program can be accessed via the Gap4 interface through the "Assembly" menu or as a stand alone program.
The CAP2 program, for use with Gap4, must be obtained via ftp from the author, Xiaoqiu Huang.
Email Xiaoqiu Huang (huang@cs.mtu.edu) stating that you want CAP2 for
use with gap4 and the operating system for which you need the program
(one of: SunOS 4.1.1; Solaris 2.4; DEC OSF/1 V3.0 and Digital Unix; Irix
5.3). He will then contact you to arrange for the retrieval of the
binary file. The binary file is called cap2_s. Make this executable (eg
chmod a+x cap2_s) and move it to the directory
$STADENROOT/$MACHINE-bin. The CAP2 options on the "Assembly" menu
should now be available.
- Perform CAP2 assembly
- Import CAP2 assembly data
- Perform and import CAP2 assembly
- Stand alone CAP2 assembly
Perform CAP2 assembly
The assembly works on either a file or list of reading names in experiment file format (see section Experiment File). CAP2 assembles the readings and the alignments are written to the output window. Irrespective of the original file format, new reading files are written in the destination directory in experiment file format. If the destination directory does not already exist, then it is created. These new files contain the additional information required to recreate the same assembly within Gap4. This is done by the addition of an AP line. See section Directed Assembly.
It is also possible to tell the program to identify chimeric fragments, report repeat structures and resolve them by setting the "Find repeats/chimerics" radiobutton to "Yes". If this is set to "No", these tasks are not performed. At the present time, CAP2 can only resolve direct repeats and not reverse repeats.
Import CAP2 assembly
This mode imports the aligned sequences produced after CAP2 assembly into Gap4 and maintains the same alignment. Importing the files requires the directory containing the newly aligned readings, ie the destination directory used in "Perform CAP2 assembly". Readings which are not entered are written to a "list" or "file" specified in the "Save failures" entry box. This mode is functionally equivalent to "Directed assembly". See section Directed Assembly.
Perform and import CAP2 assembly
This mode performs both the assembly section Perform CAP2 assembly and the import section Import CAP2 assembly together. The assembled readings are written to the destination directory and then are automatically imported from this directory into Gap4.
Stand alone CAP2 assembly
The program can be alternatively accessed as a stand alone program with the following command line arguments
cap2_s -{format} file_of_filenames [-r] [-out destination_directory]
{format} is the file format of the file of filenames and is either in experiment file format or fasta format. Legal inputs are exp, EXP, fasta or FASTA.
file_of_filenames is the name of the file containing the reading names to be assembled for experiment files or a single file of readings in fasta format.
destination_directory is the name of a directory to which the new experiment files are written to. The default directory is "assemble".
-r is optional and is equivalent to the "Find repeats/chimerics" option above.
Further details about CAP2
The comments provided with CAP2 by Huang are detailed below.
copyright (c) 1995-96 Xiaoqiu Huang and Michigan Technological University
No part of this program may be distributed without prior written
permission of the author.
Xiaoqiu Huang
Department of Computer Science
Michigan Technological University
Houghton, MI 49931
E-mail: huang@cs.mtu.edu
Proper attribution of the author as the source of the software would
be appreciated:
Huang, X. (1996)
An Improved Sequence Assembly Program
Genomics, 33:21-31.
The CAP2 program assembles short DNA fragments into long sequences.
CAP2 contains a number of improvements to the original version
described in Genomics 14, pages 18-25, 1992. These improvements are:
o Use of a more efficient filter for quickly detecting pairs of
fragments that could not overlap.
o Accurate evaluation of overlap strengths through the use
of internally generated fragment-specific confidence vectors.
o Identification of fragments from repetitive sequences and
resolution of ambiguities in assembly of those fragments.
o Identification of chimeric fragments.
o Automated refinement of poorly aligned regions of fragment
alignments
A chimeric fragment is made of two short pieces from non-adjacent
regions of the DNA molecule. CAP2 may report a repeat structure like:
F1 5' flanking
F2 5' flanking
I1 Internal
I2 Internal
I3 Internal
T1 3' flanking
T2 3' flanking
where F1, F2, I1, I2, I3, T1 and T2 are fragment names. The
structure means that I1 ,I2 and I3 are from two copies of
a repetitive element, F1 and F2 flank the two copies at their
5' end, T1 and T2 flank them at their 3' end.
CAP2 produces the two copies in the final sequence by
resolving the ambiguities in the repeat structure.
CAP2 is efficient in computer memory: a large number of DNA
fragments can be assembled. The time requirement is acceptable;
for example, CAP2 took 1.5 hours to assemble 829 fragments of a total
of 393 kb nucleotides into a single contig on a Sun SPARC 5.
The program is written in C and runs on Sun workstations.
The CAP2 program can be run with the -r option. If this option
is specified, then the program identifies chimeric fragments,
reports repeat structures and resolves them.
Otherwise, these tasks are not performed.
Large integer values should be used for MATCH, MISMAT, EXTEND.
The comments given above are for CAP2. Written on Feb. 11, 95.
Acknowledgements
Kathryn Beal found a bug in the Filter procedure.
The array elen was not always initialized.
Below is a description of the parameters in the #define section of CAP.
Two specially chosen sets of substitution scores and indel penalties
are used by the dynamic programming algorithm: heavy set for regions
of low sequencing error rates and light set for fragment ends of high
sequencing error rates. (Use integers only.)
Heavy set: Light set:
MATCH = 2 MATCH = 2
MISMAT = -6 LTMISM = -3
EXTEND = 4 LTEXTEN = 2
In the initial assembly, any overlap must be of length at least OVERLEN,
and any overlap/containment must be of identity percentage at least
PERCENT. After the initial assembly, the program attempts to join
contigs together using weak overlaps. Two contigs are merged if the
score of the overlapping alignment is at least CUTOFF. The value for
CUTOFF is chosen according to the value for MATCH.
POS5 and POS3 are fragment positions such that the 5' end between base 1
and base POS5, and the 3' end after base POS3 are of high sequencing
error rates, say more than 5%. For mismatches and indels occurring in
the two ends, light penalties are used.
Acknowledgments
The function diff() of Gene Myers is modified and used here.
A file of input fragments looks like:
>G019uabh ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC TCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACAGTAG GACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTT AATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTT ATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTT TAGATAAGCAAAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC >G028uaah CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT TGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGC TGGCAGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGC ATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCT TCCCCATCCCATCAGTCT >G022uabh TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTG TAGGTGATTGGGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCC CATTAAAACCCTTTATGCCCATACATCATAACACTACTTCCTACCCATAA GCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAAC ACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATT GATTGAT >G023uabh AATAAATACCAAAAAAATAGTATATCTACATAGAATTTCACATAAAATAA ACTGTTTTCTATGTGAAAATTAACCTAAAAATATGCTTTGCTTATGTTTA AGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAATAATCCTCTAC GATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATGGAAATAAAAT GTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACATGAAATGCTTT TTAAAAGAAAATATTAAAGTTAAACTCCCCTATTTTGCTCGTTTTTGCTT ATCTAAAATACATTCTGCACAATCCCCAAAGATTGATCATACGTTAC >G006uaah ACATAAAATAAACTGTTTTCTATGTGAAAATTAACCTANNATATGCTTTG CTTATGTTTAAGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAAT AATCCTCTAAGATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATG GAAATAAAATGTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACAT GAAATGCTTTTTAAAAGAAAATATTAAAGTTAAACTCCCC
A string after ">" is the name of the following fragment.
Only the five upper-case letters A, C, G, T and N are allowed
to appear in fragment data. No other characters are allowed.
A common mistake is the use of lower case letters in a fragment.
To run the program, type a command of form
cap2 file_of_filenames [-r]
The output goes to the terminal screen. So redirection of the
output into a file is necessary. The output consists of three parts:
overview of contigs at fragment level, detailed display of contigs
at nucleotide level, and consensus sequences.
The output of CAP on the sample input data looks like:
'+' = direct orientation; '-' = reverse complement
OVERLAPS CONTAINMENTS
******************* Contig 1 ********************
G022uabh+
G019uabh+
G028uaah+ is in G019uabh+
G023uabh-
G006uaah- is in G023uabh-
DETAILED DISPLAY OF CONTIGS
******************* Contig 1 ********************
. : . : . : . : . : . :
G022uabh+ TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
____________________________________________________________
consensus TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
. : . : . : . : . : . :
G022uabh+ GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
____________________________________________________________
consensus GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
. : . : . : . : . : . :
G022uabh+ ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
G019uabh+ ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
G028uaah+ CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
____________________________________________________________
consensus ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
. : . : . : . : . : . :
G022uabh+ AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
G019uabh+ AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
G028uaah+ AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
____________________________________________________________
consensus AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
. : . : . : . : . : . :
G022uabh+ ATTGATTGATTGATTGAT
G019uabh+ ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
G028uaah+ ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
____________________________________________________________
consensus ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
. : . : . : . : . : . :
G019uabh+ AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
G028uaah+ AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
____________________________________________________________
consensus AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
. : . : . : . : . : . :
G019uabh+ GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
G028uaah+ GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCC-ATCAGTCT
____________________________________________________________
consensus GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
. : . : . : . : . : . :
G019uabh+ AGTCTTGTTACGTTATGACT-AATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
G023uabh- GTAACGT-ATGA-TCAATCTTTGGGGATTGTGCAGAATGT-ATTTTAGATAAGC
____________________________________________________________
consensus AGTCTTGTAACGTTATGACTCAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
. : . : . : . : . : . :
G019uabh+ AAAA-CGAGCAAAAT-GGGGAGTT-A-CTT-A-TATTT-CTTT-AAA--GC
G023uabh- AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
G006uaah- GGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
____________________________________________________________
consensus AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
. : . : . : . : . : . :
G023uabh- TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
G006uaah- TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
____________________________________________________________
consensus TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
. : . : . : . : . : . :
G023uabh- AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
G006uaah- AGTTTTTTTCCTATTTGTTTAAGATCTTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
____________________________________________________________
consensus AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
. : . : . : . : . : . :
G023uabh- AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA
G006uaah- AAAAAGCATGACATCTTAAACATAAGCAAAGCATATNNT-AGGTTAATTTTCACATAGAA
____________________________________________________________
consensus AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA
. : . : . : . : . : . :
G023uabh- AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT
G006uaah- AACAGTTTATTTTATGT
____________________________________________________________
consensus AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT
CONSENSUS SEQUENCES
>Contig 1
TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
AATTAAAGACTTGTTTAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTG
ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
AGTCTTGTAACGTTATGACTCAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
AAAAACGAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGT
TATAAGATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTT
AGTTTTTTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGAT
AAAAAGCATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAA
AACAGTTTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT
*/
This page is maintained by staden-package. Last generated on 25 April 2003.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/gap4_unix_87.html