sTCW Reproduce

Reproduce sTCW overview

This describes how to obtain the table of results corresponding to statistics in the overview. The following short-hand is used:

The "Column:x" indicates that x should be selected for viewing in the table.
#Seqs is the number of sequences, which is listed at the top of the overview.
"Stats" is the "Show Column Stats" on the "Table..." drop-down.

Always clear filters before setting new ones!

INPUT

Most of the INPUT section is data supplied by the user with runSingleTCW. The following are computed:

Counts:
SIZE	All Seqs	Column:Counts for all conditions: Stats, column:Sum
Sequences:
AVG-len	All Seqs	Column:Length; Stats, column:Average
MED-len	All Seqs	Column:Length; Stats, column:Median The median in the two cases may be slightly different because they are computed differently.

ANNOTATION

Hit statistics:

Column	Search	Obtain number
Sequences with hits	Filters: Annotation: Annotated	Number of rows
Unique hits	AnnoDB Hits: Seq:None(slow)	Hits # above table
Total sequence hits	AnnoDB Hits: Seq:None(slow)	Pairs # above table
Bases covered by hit	AnnoDB Hits: Seq:Best Bits	Unselect "Group by Hit ID""; column:Align; Stats, column:Sum; for NT, multiply by 3
Total bases for NT-sTCWdbs and residues for AA-sTCWdbs	All Seqs	Column:Length; Stats, column:Sum

Annotation databases:

The first column is the DBtype-taxonomy of each annoDB (e.g. SP-plants: SP is SwissProt, plants is the taxonomy).

Column	Search	Obtain number
ONLY	Filters: Annotation: Annotated, Best Bits, Enter DBtype and Taxonomy for ANNODB, General <=1 annoDB	Number of rows
The following all use AnnoDB Hits panel with the correct ANNODB selected from the AnnoDBs panel.
BITS	Seq:Best Bits	Seqs # above table
ANNO	Seq:Best Anno	Seqs # above table
UNIQUE	Seq:None(slow)	Hits # above table
TOTAL	Seq:None(slow)	Pairs # above table
AVG %SIM	Seq:None(slow)	Unselect "Group by Hit ID"; column:%Sim; Stats, column:Average
Rank=1 is the best hit for a sequence for a given annoDB.
HAS HIT	Seq:Rank=1	Seqs # above table; percentage of total #Seqs
AVG %SIM	Seq:Rank=1	Uncheck "Group by Hit"; Column:%Sim; Stats, column:Average
Cover >=N	Seq:Rank=1,%Sim>=N,%HitCov^*>=N	Seqs # above table; percentage of HAS HIT

^*HitCov is the difference between the hit stop and start coordinates divided by the length of the protein.

Top 15 species from total: N

The N is the number of unique species based on the first two words of the species name. From "AnnoDB Hits":

Select "Species"", select "Two words"", enter first two words of species name next to "Find", select "Find", select the entry on the left and add to the right.
Select "Best Bits", "Best Anno" or "None" for the three numbers shown.
BUILD TABLE
Use the number listed beside "Pairs".

Gene ontology statistics:

The counts of GO terms include assigned obsolete terms (level=0).

Column	Search	Obtain number
Unique GOs	GO Annotation: no filters	Results number
Unique hits with GOs	AnnoDB Hits: Seqs:None; GO,etc:Has GO	Hits number at top of table
Sequences with GOs	Filters: Annotation: Annotated, Best with GO	Number of rows
Seq best hit has GOs	AnnoDB Hits: Seq:Best Bits; GO,etc:Has GO	Seqs number at top of table
biological_process	GO Annotation: Level: biological_process	Number GOs at top of table
molecular_function	GO Annotation: Level: molecular_function	Number GOs at top of table
cellular_component	GO Annotation: Level: cellular_component	Number GOs at top of table
is_a, part_of	GO Annotation: no filters	Export..., Each GO's parents with relations, grep (see footnote^*)

^* From terminal, 'grep is_a GOeachParents.tsv | wc'. Repeat with is_a replaced with part_of.

EXPRESSION

The following sections may not exist if the input had no count files or the DE methods were not executed.

TPM:*	Filters: Counts and TPM: select Condition under Exclude; set "At Most" to 5000.	Continue using 1000, 100, 50, 5, 2 where the previous results are subtracted from the current. The results are for intervals >=N to <M.
Differential Expression:	Filters: Differential Expression: select the DE column, then enter the number, e.g. 1E-5.	These counts are accumulative.
GO enrichment:	GO Annotation: Enrich: enter the p-value threshold (e.g. 0.05); select green button next to "for" and select the p-value to filter on.	These counts are accumulative.

* If "RPKM" is at the top of the Overview instead of "TPM", then RPKM was computed instead of TPM.

SEQUENCES

If the sequences (e.g. ESTs) have been assembled, there are statistics on buried, mate-pairs, etc. Most of them are columns, so select the column, then use the "Show column stats" to view the sum. The following will only cover the three sections that exists for all NT-sTCWdbs.

Sequence lengths	Filters: General: Length: change >= to <=. Filter for each cutoff 100, 500, 1000, 2000, 3000, 4000, 5000.	The counts of the intervals to the left need to be subtracted from the number of rows.
ORF lengths	Filter: SNPs and ORFs: Has ORF >= N. Filter for each cutoff 5001, 4001, 3001, 2001, 1001, 501, 101.	The counts of the intervals to the right need to be subtracted from the number of rows.
Quality	Filters: General: Has Ns: Yes; Columns: General: Ns	The resulting rows are the #n>0; sort the rows in descending order on the Ns column, and scroll down to the row number that starts having #n>10.

ORF stats:

Column	Search	Obtain number
The following use the Basic Sequence. Select "TCW" and enter the Substring indicated.
Is Longest ORF	!Lg	#Seqs minus Result number
Markov Best Score	!Mk	#Seqs minus Result number
All of the above	$	Result number
ORF=Hit	ORF=Hit	Result number
ORF=Hit with Ends	ORF=Hit+	Result number
Multi-frame	Multi	Result number
Stops in Hit	Stop	Result number
>=9 Ns in ORF	9N	Result number
The following use the Filters section: SNPs and ORFs.
Has Hit	Protein confirmation	Number of rows
Both Ends	Has Ends (Start&Stop codons)	Number of rows
ORF>=300	Has ORF >= 300	Number of rows

GC content:

The only number reproducible is the GC Content, which is the %GC over the entire sequence.

GC content

Seq Table

Column:%GC; Stats, column:Average
Note, there will be some slight difference in the number due to round-off error.

The "Pos i" column is the percent of G or C in the ith positon of the CDS codons.
The %GC is the percent of G's and C's over the sequence length.
The CpG-O/E is ratio observed/expected [(#CpG/(#G*#C))*length].
The UTRs can be viewed in the Sequence Detail alignments, but there is no column for it.