Reproduce sTCW overview
This describes how to obtain the table of results corresponding to statistics
in the overview. The following short-hand is used:
- The "Column:x" indicates that x should be selected for viewing in the table.
- #Seqs is the number of sequences, which is listed at the top of the overview.
- "Stats" is the "Show Column Stats" on the "Table..." drop-down.
Always clear filters before setting new ones!
INPUT
Most of the INPUT section is data supplied by the user with runSingleTCW. The following
are computed:
Counts:
| SIZE | All Seqs | Column:Counts for all conditions: Stats, column:Sum
| Sequences:
| AVG-len | All Seqs | Column:Length; Stats, column:Average
| MED-len | All Seqs | Column:Length; Stats, column:Median
The median in the two cases may be slightly different because they are computed differently.
|
ANNOTATION
Hit statistics:
Column | Search | Obtain number
|
---|
Sequences with hits | Filters: Annotation: Annotated | Number of rows
| Unique hits | AnnoDB Hits: Seq:None(slow) | Hits # above table
| Total sequence hits | AnnoDB Hits: Seq:None(slow) | Pairs # above table
| Bases covered by hit | AnnoDB Hits: Seq:Best Bits | Unselect "Group by Hit ID""; column:Align;
Stats, column:Sum; for NT, multiply by 3
| Total bases for NT-sTCWdbs
and residues for AA-sTCWdbs
| All Seqs | Column:Length; Stats, column:Sum
|
Annotation databases:
The first column is the DBtype-taxonomy of each annoDB (e.g. SP-plants: SP is SwissProt, plants is the taxonomy).
Column | Search | Obtain number
|
---|
ONLY
| Filters: Annotation: Annotated, Best Bits,
Enter DBtype and Taxonomy for ANNODB,
General <=1 annoDB
| Number of rows
| The following all use AnnoDB Hits panel with the correct ANNODB selected from the AnnoDBs panel.
| BITS | Seq:Best Bits | Seqs # above table
| ANNO | Seq:Best Anno | Seqs # above table
| UNIQUE | Seq:None(slow) | Hits # above table
| TOTAL | Seq:None(slow) | Pairs # above table
| AVG %SIM | Seq:None(slow) | Unselect "Group by Hit ID"; column:%Sim; Stats, column:Average
| Rank=1 is the best hit for a sequence for a given annoDB.
| HAS HIT | Seq:Rank=1 | Seqs # above table; percentage of total #Seqs
| AVG %SIM | Seq:Rank=1 | Uncheck "Group by Hit"; Column:%Sim; Stats, column:Average
| Cover >=N | Seq:Rank=1,%Sim>=N,%HitCov*>=N
| Seqs # above table; percentage of HAS HIT
|
*HitCov is the difference between the hit stop and start coordinates divided by the length of the protein.
Top 15 species from total: N
The N is the number of unique species based on the first two words of the
species name. From "AnnoDB Hits":
- Select "Species"", select "Two words"", enter first two words of species name next to "Find", select "Find", select the entry on the
left and add to the right.
- Select "Best Bits", "Best Anno" or "None" for the three numbers shown.
- BUILD TABLE
- Use the number listed beside "Pairs".
Gene ontology statistics:
The counts of GO terms include assigned obsolete terms (level=0).
Column | Search | Obtain number
|
---|
Unique GOs
| GO Annotation: no filters
| Results number
| Unique hits with GOs
| AnnoDB Hits: Seqs:None; GO,etc:Has GO
| Hits number at top of table
| Sequences with GOs
| Filters: Annotation: Annotated, Best with GO
| Number of rows
| Seq best hit has GOs
| AnnoDB Hits: Seq:Best Bits; GO,etc:Has GO
| Seqs number at top of table
| biological_process
| GO Annotation: Level: biological_process
| Number GOs at top of table
| molecular_function
| GO Annotation: Level: molecular_function
| Number GOs at top of table
| cellular_component
| GO Annotation: Level: cellular_component
| Number GOs at top of table
| is_a, part_of
| GO Annotation: no filters
| Export..., Each GO's parents with relations, grep (see footnote*)
|
* From terminal, 'grep is_a GOeachParents.tsv | wc'. Repeat with
is_a replaced with part_of.
EXPRESSION
The following sections may not exist if the input had no count files or the DE methods
were not executed.
TPM:*
| Filters: Counts and TPM: select Condition under Exclude; set "At Most" to 5000.
| Continue using 1000, 100, 50, 5, 2 where the previous results are
subtracted from the current. The results are for intervals >=N to <M.
| Differential Expression:
| Filters: Differential Expression: select the DE column, then enter the number, e.g. 1E-5.
| These counts are accumulative.
| GO enrichment:
| GO Annotation: Enrich: enter the p-value threshold (e.g. 0.05); select green button next to "for" and select the p-value
to filter on.
| These counts are accumulative.
|
* If "RPKM" is at the top of the Overview instead of "TPM", then RPKM was computed instead of TPM.
SEQUENCES
If the sequences (e.g. ESTs) have been assembled, there are statistics on buried,
mate-pairs, etc. Most of them are columns, so select the column, then use the
"Show column stats" to view the sum. The following will only cover the three sections
that exists for all NT-sTCWdbs.
Sequence lengths
| Filters: General: Length: change >= to <=.
Filter for each cutoff 100, 500, 1000, 2000, 3000, 4000, 5000.
| The counts of the intervals to the left need to be subtracted from the number of rows.
| ORF lengths
| Filter: SNPs and ORFs: Has ORF >= N.
Filter for each cutoff 5001, 4001, 3001, 2001, 1001, 501, 101.
| The counts of the intervals to the right need to be subtracted from the number of rows.
| Quality
| Filters: General: Has Ns: Yes; Columns: General: Ns
| The resulting rows are the #n>0; sort the rows in descending order on the Ns column,
and scroll down to the row number that starts having #n>10.
|
ORF stats:
Column | Search | Obtain number
|
---|
The following use the Basic Sequence. Select "TCW" and enter
the Substring indicated.
| Is Longest ORF | !Lg | #Seqs minus Result number
| Markov Best Score | !Mk | #Seqs minus Result number
| All of the above | $ | Result number
| ORF=Hit | ORF=Hit | Result number
| ORF=Hit with Ends | ORF=Hit+ | Result number
| Multi-frame | Multi | Result number
| Stops in Hit | Stop | Result number
| >=9 Ns in ORF | 9N | Result number
| The following use the Filters section: SNPs and ORFs.
| Has Hit | Protein confirmation | Number of rows
| Both Ends | Has Ends (Start&Stop codons) | Number of rows
| ORF>=300 | Has ORF >= 300 | Number of rows
|
GC content:
The only number reproducible is the GC Content, which is the %GC over the entire sequence.
GC content | Seq Table | Column:%GC; Stats, column:Average
Note, there will be some slight difference in the number due to round-off error.
|
- The "Pos i" column is the percent of G or C in the ith positon of the CDS codons.
- The %GC is the percent of G's and C's over the sequence length.
- The CpG-O/E is ratio observed/expected [(#CpG/(#G*#C))*length].
- The UTRs can be viewed in the Sequence Detail alignments, but there is no column for it.
|