| ||||||||||||||||||||||||
|
This document describes the TCW assembly process,
which is run via the Note that TCW cannot assemble raw RNA-seq reads. Rather, assembly in TCW serves primarily the following purposes:
Contents
Assembly of the Demo ProjectThis demo uses the project demoAsm. The assembly process uses blast and cap3 (see Installation). IMPORTANT NOTE: The Blast path MUST be defined in the HOSTS.cfg using the blast_path parameter; see Install.
Final assembly summaryBelow is the summary (the numbers can be slightly different due to a different machine, #CPUs, and blast version).
>>>Assembly Statistics 27-Mar-16 19:31:29
DATASET #SEQS #SINGLETONS #BURIED
Illumina 112 78 (69%) 13 (11%)
Sanger 98 7 (7%) 1 (1%)
Total reads: 210
Total buried: 14 Initial buries: 14 Buried during assembly: 0
Contig sizes (#reads)
Counts =2 3-5 6-10 11=20 21-50 51-100 101-1000 >1000
#Contigs 6 12 4 1 1 0 0 0
Contig lengths (bp)
Length 1-100 101-500 501-1000 1001-2000 2001-3000 3001-4000 4001-5000 >5000
#Contigs 0 60 19 21 2 1 0 1
Total contigs: 104
Contigs(>1 seq): 24
Single mate-pair: 5
Singletons: 80
Finished in 1m:40s
Transcript with counts and EST librariesWhen assembling transcripts with counts and EST libraries, a resulting contig with one or more transcript sequences and one or more ESTs will add the transcript counts and add the aligned EST.Choosing Read NamesUsing consistent and well-chosen read names makes data analysis much easier, and is essential for some aspects of TCW.The name of the read is the string immediately following the ">" in the fasta file. For example, if your fasta file contains the lines >ZM_BFa0001A01.f AAGATCCGCCTCATTCACACCCCCATCTACCTAGCTAGCTAGTTTACCAAAAAAAAATCTGGCCACA GGGATGCGGTGGCGGCTGCAGCCGGCGCCGGCGCCGACGCTGCTCCTCGTCCTGCTGGTG >ZM_BFa0001A01.r AAAAAGCAAAATACAAACCAAGCTCCAGTTCCAATACATTACTCTAGCACAAGCTTTCAG CACATTACAAAGTAGGAACCAAGACCACCCAAGCTCCAATCACACTACAATTCATCACCAthen the two read names are ZM_BFa0001A01.f and ZM_BFa0001A01.r. Naming guidelines:
The Assembly ProcessFollowing are the main stages of TCW assembly, organized by their headings which print to the screen. The sample durations are for a 700k read assembly using 6, 2.4 Ghz CPUs and the default settings.
2 In a TC (transitive closure), each contig must have an overlap with at least one contig in the TC. Assembly ParametersNote, the default parameters have been extensively tested and you will probably not want to change them. Most of the parameters for assembly are available for change through theCalculation of SNPs and extrasThe parameters listed in the following can be set in the LIB.cfg file with an editor.A SNP is possible when one or more read have a different base at some location than is found in the consensus. However, base-calling error can lead to many false positives, so TCW applies two screens to the possible SNPs. First, at least two reads must contain the SNP (you can change this with the SNP_CONFIRM parameter). Also, a probability score is applied. The probability ('p-value') is computed using a binomial score based on the number of confirming reads, the depth of the contig at that base, and the estimated basecall error rate. The error rate is estimated from mismatches seen in the clique assembly, or it can be set using BASECALL_ERROR_RATE. The p-value threshold can also be set using SNP_SCORE. When there are extra bases in some reads which are not in the consensus sequence generated by cap3, TCW uses another probability score to determine whether to regard the extras as "real" and add a pad character (*) to the consensus. The score is computed in the same way as for SNPs, and uses the config parameters EXTRA_CONFIRM, EXTRA_RATE, EXTRA_SCORE. Extras not determined to be real are stored in the database and shown in the UI. Trouble Shooting
For assembly, the database must support Innodb tablesTCW checks this using the "show engines" command in MySQL. If the Innodb engine is not listed as supported, this error is shown; however, you can still perform all TCW functions except for assembly.The most common cause of this problem is a mismatch in the innodb log file size. The MySQL error log will contain messages like InnoDB: Error: log file ./ib_logfile0 is of different size 0 5242880 bytes InnoDB: than specified in the .cnf file 0 104857600 bytes!Solution is to delete this log file and restart MySQL. Doing an assembly, the database is very slowThe default parameters of MySQL are not suitable for large high-performance databases. Especially, the innodb_buffer_pool_size must be increased. 100M is sufficient for one large project, but for many large projects it should be 1G at minimum. For more seehttp://dev.mysql.com/doc/refman/5.0/en/innodb-buffer-pool.html. Note, this only affects usage during an assembly, when InnoDB tables are used. | ||||||||||||||||||||||||
| Email Comments To: tcw@agcol.arizona.edu | ||||||||||||||||||||||||