RunDE computes differential expression (DE) between conditions for an existing singleTCW database,
as follows:
- DE per sequence: R-scripts for EdgeR, DESeq2.
- It is important that you study the documentation for the respective DE method to determine the
best approach for your data. You may either use one of the existing scripts,
customize an existing script, or create a new one.
- The R-scripts for DE sections explains how runDE writes the data to R variables,
executes the R-script, then reads the results into sTCWdb.
- GO enhancement: R-script for GOseq.
This document uses the "demoTra" project as an example. The
User Guide explains how to built it.
Contents:
- Installing R and packages
- Running the demo - Highly recommended to do this first
- DE between conditions
- Methods and Options
- Execute
- GO enhancement
- Remove
- R-scripts
- R directory permissions
1. Installing R and packages
Installing R
Installing rJava and R packages
From the command line, type:
>runDE tra
Or, if you type runDE with no arguments, the sTCW graphical chooser will appear
with all sTCW databases listed; select the demoTra database.
Either way, this brings up the runDE interface shown below.
Define a pairwise comparison by selecting "Stem" for Group 1 and "Root" for Group 2, as shown.
Check the box Save results in p-value column; the text box will display the column name "StRo", which can be changed.
Assuming you have installed edgeR, leave the
Differential Expression R-script as edgeRglm.R .
Execute Group 1 - Group 2. See Execute DE for the output.
The GO enrichment R-script will be goSeqBH.R by default, which
performs BH multiple hypothesis testing correction. However,
this results in 0 p-value<0.05 for the demo,
hence, for the demo only, change this to goSeqNoFDR.R. Then select
Execute. See Execute GO output.
|
|
> ****************************************************************
******** Start DE execution for column: StRo *********
****************************************************************
Replicates:
5 Stem
5 Root
Collecting count data (may take several minutes)
Using CPM filter > 1 for >= 2
1 filtered sequences
Assigning R variables
gc: GC values of sequences
rowNames: sequences (row) names
grpNames: group (column) names
repNames: replicate names
counts: counts of sequences
countData <- array(counts,dim=c(210,10))
rm(counts)
rownames(countData) <- rowNames
nGroup1 <- 5
nGroup2 <- 5
Start R-script
source('R-scripts/edgeRglm.R')
Loading required package: limma
Using traditional glm (quasi-likelihood F-tests)
results
rowNames
R-script done
Number of DE results:
<1e-5 <1e-4 <0.001 <0.01 <0.05
6 17 32 65 97
Saving 210 scores for StRo
Adding column to database...
.
Finished DE execution for StRo 0m:0s (8Mb)
Complete all Group1-Group2 for tra
The console is in R, you may run R commands -- q() or Cntl-C when done, or perform another Execute.
The R console remains open and you can now manually explore the data using R, if desired. However, it easiest to explore it be
selecting "Close" on runDE, which closed the java program but not R. Then do the following:
>q()
Save workspace image? [y/n/c]: y
R
This saves all variables written by runDE and the scripts (in this case edgeRglm.R). By restarting R, you can now
explore the data, e.g.
> library(edgeR)
> plotBCV(y)
> plotMDS(y, labels=grpNames)
Trace output to stderr and Log file (append): projects/DElogs/tra.log Size: 11.0kb
****************************************************************
******** 1/1. GO enrichment for column: RoOl *********
****************************************************************
Assigning R variables
6,651 Numbers GOs
208 Sequences with GOs from 211 (98.6%)
seqNames: sequence names
seqLens: sequence lengths
nSeqs <- 211
seqDEs: DE binary vector (110 p-value < 0.05)
names(seqDEs) <- seqNames
seqGOs <- vector("list",nSeqs)
For all n: seqGOs[[n]] <- c(gonum list)
names( seqGOs) = seqNames ...
Start R-script
source('R-scripts/goSeqNoFDR.R')
Using manually entered categories.
Calculating the p-values...
oResults
goNums
R-script done
Saving 6,651 values to database
172 GO p-values < 0.05
Finished GO enrichment for RoOl 0m:1s
Summary:
ColName #DEseq (%) DEseq%GO DEavgLen DEstdLen avgPWF stdPWF #<0.05
RoOl 110 (52.9%) 82.3% 1073.89 679.84 0.528607 0.073576 172
Complete GO enrichment for tra
The console is in R, you may run R commands -- q() or Cntl-C when done, or perform another Execute.
The R console remains open and you can now manually explore the data using R, if desired.
However, it easiest to explore it be selecting Close on runDE, which closed the java program but not R.
Then do the following:
>q()
Save workspace image? [y/n/c]: y
R
This saves all variables written by runDE and the script. By restarting R, you can now explore the data for the
last enrichment computation, e.g.
> library(goseq)
> plotPWF(pwf)
GO output summary
At the end of execution, a summary will be printed to the terminal and log file. For example, if
all enrichment p-values are computed for demoTra, the following will be output:
Summary:
ColName #DEseq (%) DEseq%GO DEavgLen DEstdLen avgPWF stdPWF #<0.05
StRo 97 (46.6%) 80.7% 1075.12 703.22 0.464116 0.000063 311
StOl 100 (48.1%) 79.5% 991.49 595.44 0.480768 0.065060 48
RoOl 110 (52.9%) 82.3% 1073.89 679.84 0.528607 0.073576 172
ColName | The DE column used as input for the GO enrichment.
| #DEseq (%) | The number of DE sequences for the DE column, where a sequence is DE based on the value
supplied (Top or p-value).
| DEseq%GO | The percent of unique direct or inherited GOs for the set of DE sequences. Due to inheritance,
this tends to be a high percentage.
| DEavgLen | The average length of the DE sequences.
| DEstdLen | The standard deviation of the length of the DE sequences.
| avgPWF | The average of the PWF (Probability Weighting Function).
| stdPWF | The standard deviation of the PWF.
| #<0.05 | The number of GOs that have a enrichment p-value <0.05.
|
Note the low stdPWF for StRo indicating the values are very close;
the plotPWF for it is non-existent. This may be too small of a sample size.
View DE Results
To view the resulting p-values, run 'viewSingleTCW tra' from the command line,
and click >> Columns. The column options has a section
Differential Expression with the DE columns listed, e.g. the DE columns for
demoTra are shown on the right. There is also
a section for Conditions, which has the "Stem", "Root, and "Oleaf" column options.
Select the desired columns (e.g. "Stem", "Root", "StRo"); you may then select a filter
(e.g. all sequences that have an absolute p-value < 0.05).
Note in the below figure that if Stem < Root, the p-value StRo is negative, else it is positive.
|
viewSingleTCW DE section with all pairs computed.
|
The highlighting and significant digits can be changed by the
Decimal Display panel.
View GO Enhancement Results
3. DE between conditions
R-script
The default R-script is edgeR.R, however, there are other scripts in the R-script directory, or the
user can supply there own (see R-script for DE).
Pre-filter
This example uses a database with 48k sequences and 5 replicates for root and stem.
The three filtering options were run with defaults (table on lower left)
before input to edgeR (results in figure on lower right).
The first option filtered 7,333 sequences and the second option filtered 1,290;
with edgeR, the number of analyzed sequences effects the resulting DE value.
Filter | Description | Column1
|
CPM2 >=[n] for >[m]
| Filter all sequences that do not have more than m replicates with CPM >= n (used m=2 and n=1)
| RoSt
|
Count >[n]
| Filter all sequences that do not have any replicates with a count > n (used n=10)
| RoSt1
|
None | No filtering | RoSt2
|
1 Column in table on right.
2 CPM (Counts per million) = (count/library size)*1M.
|
The DE value (RoSt) is set to negative if Root<Stem.
The columns prefixed with '#' are the counts, which are used by the filters.
The columns with no prefix, Root and Stem, are the TPM values.
|
In the figure on the right:
- DE value 3.00: the filtered sequences.
- DE value 2.00: assigned an NA from the DE computation.
To view the first 50 filtered sequences that do not have 'all' zero count values,
execute runDE with the ID or database name and a -v,
e.g. runDE OlR -v.
Using CPM filter > 1 for >= 2, the following shows the first few filtered sequences:
Rep Names Root1 Root2 Root3 Root4 Root5 Stem1 Stem2 Stem3 Stem4 Stem5
Lib Sizes 25267201 26691329 20806376 24815333 29748103 36131457 25997359 21874134 25691566 31637621
SeqID Counts : CpM
OlR_000003 41 2 16 8 8 7 7 7 4 6 : 1.6 0.1 0.8 0.3 0.3 0.2 0.3 0.3 0.2 0.2
OlR_000016 23 5 37 21 17 26 16 12 13 23 : 0.9 0.2 1.8 0.8 0.6 0.7 0.6 0.5 0.5 0.7
OlR_000053 17 12 10 26 18 30 12 15 25 27 : 0.7 0.4 0.5 1.0 0.6 0.8 0.5 0.7 1.0 0.9
OlR_000063 18 4 21 15 10 24 20 19 24 26 : 0.7 0.1 1.0 0.6 0.3 0.7 0.8 0.9 0.9 0.8
OlR_000066 8 3 1 0 0 20 15 13 34 28 : 0.3 0.1 0.0 0.0 0.0 0.6 0.6 0.6 1.3 0.9
OlR_000083 11 12 13 22 12 11 15 2 3 6 : 0.4 0.4 0.6 0.9 0.4 0.3 0.6 0.1 0.1 0.2
OlR_000100 14 6 21 15 12 7 10 8 2 8 : 0.6 0.2 1.0 0.6 0.4 0.2 0.4 0.4 0.1 0.3
OlR_000162 11 10 7 17 5 8 15 7 18 10 : 0.4 0.4 0.3 0.7 0.2 0.2 0.6 0.3 0.7 0.3
OlR_000181 25 1 24 10 7 6 13 3 4 7 : 1.0 0.0 1.2 0.4 0.2 0.2 0.5 0.1 0.2 0.2
Using Count > 10, the following shows the first few filtered sequences.
SeqID Root1 Root2 Root3 Root4 Root5 Stem1 Stem2 Stem3 Stem4 Stem5
OlR_000238 4 1 4 1 4 0 0 0 0 0
OlR_000264 8 2 5 2 7 1 0 0 0 1
OlR_000489 0 0 3 0 2 2 1 2 0 1
OlR_000519 0 0 2 0 0 0 0 0 0 0
OlR_000625 5 0 3 2 1 0 1 1 0 1
OlR_000642 7 3 6 4 9 5 4 1 9 8
Fixed dispersion
If there are not biological replicates, then it is necessary to set the dispersion. If it is not set, TCW will use 0.1 by default.
The most meaningful DE comparison is between exactly two conditions, as illustrated in the example above.
Groups of more than one condition may also be selected, however
the conditions in each group will be treated as replicates, leading (probably) to an overestimation
of biological variability and underestimation of DE.
If multi-condition testing is desired, e.g. to find sequences reduced in condition A compared to
any or all of conditions B,C,D, then the recommended process is to compute the individual
pair DEs A-B, A-C, A-D, and then use filter options in viewSingleTCW
to perform the screen.
There are four options to add DE values to the sTCW database, as explained in the following four sections.
The DE p-value columns will be displayed in viewSingleTCW in the order created.
- Only check conditions in Group 1.
- Results will be saved in the database with auto-generated column names.
- Execute: Each condition selected in group 1 will be pairwise compared; any selections
in Group 2 are ignored.
For example, if all three conditions (Stem, Root, Oleaf) were selected, then it would
compute:
Stem Root StRo
Stem Oleaf StOl
Root Oleaf RoOl
where the last column is the generated column name.
Group 1 - Group 2
- Select at least one condition from each group.
- If you want to store the results in the database, check Save results in P-value column.
If the groups have been selected, a column name will be auto-generated which you can change.
When you change the groups for a subsequent execution, uncheck-then-check the column box
to auto-generated a name for the new groups.
- Execute: The condition(s) in Group 1 will be compared with the condition(s) in group 2.
- Any checked conditions in Group 1 and Group 2 are ignored; the conditions to
compare are specified in the file.
- Results will be saved in the database with the provided column names.
- Execute: All pairs from the file will be compared with the selected method
and options.
File format: Each line has 3 columns which represent Group 1, Group 2 and the column name to use, respectively.
For example:
Root Stem RoSt
Root Oleaf RoOl
Stem Oleaf StmLea
Root:Stem Oleaf RoSt_Ol
The last row shows how to select two or more from Group 1; no spaces are allowed between the ":".
Group 2 can only have one condition.
Keep the column names short but meaningful. A file similar to the above is in the projects/demoTra directory,
filename traDE.tsv to allow user to try this option.
The first line defines the two groups and the p-value column name,
using the same format as the "Get Pairs from File". All remaining lines have a sequence name followed by the p-value. For example,
#Example p-value file
Root Stem RoSt
tra_001 0.14
tra_002 0.14
tra_003 0.65
The p-values are read as absolute values since runDE makes values negative if Cond1<Cont2.
4. GO enrichment (over-represented categories)
| Go to top |
If GO annotations are loaded (see User Guide), you can
compute p-values for enrichment of DE in GO categories (terms/IDs).
- R-script: Use the default goSeqBH.R or supply your own.
- P-value columns: If one or more DE p-values have been added to the database,
the <---> box will say All p-values;
this can be changed to compute GO enrichment for just one p-value column.
- P-value cutoff: A p-value of 0.05 is the default; this applies to the DE p-value so that
all sequences with a p-value<0.05 will be considered DE (which assumes FDR correction was applied).
Since the DE p-values still contain considerable
uncertainty as absolute measures of probability, there is an option to defined DE
in terms of ranking, i.e., top 10% (the ranking is ordered by the p-values).
The result is a p-value column for the GOs, having
the same names as the corresponding sequence DE columns, accessible through the Basic Filters GO Annotation function of
viewSingleTCW.
When a DE p-value column is removed, the corresponding GO p-value column will be removed.
Removing a GO p-value column does not effect the corresponding DE column.
In both cases, either all columns can be removed or a selected column.
6. R-scripts
When the Execute is selected, TCW performs the following:
- Writes the necessary information to the R environment based on the options selected.
- Runs the R-script using the source command.
- Reads the results variable from the R environment, which should contain the p-values.
TCW writes the following variables to the R environment (the following assignments are from the above demo example):
> gc: GC values of sequences
rowNames: sequences (row) names
grpNames: group (column) names
repNames: replicate names
counts: counts of sequences
countData <- array(counts,dim=c(210,10))
rm(counts)
rownames(countData) <- rowNames
nGroup1 <- 5
nGroup2 <- 5
The variables displayed with a ":" after them are assigned with the assign function, whereas the others are the exact assignment.
The most important variables are countData and grpNames, e.g.
> head(countData)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
tra_001 1017 594 1222 1209 1315 378 1002 1649 826 1195
tra_002 272 239 431 400 368 101 206 151 109 185
tra_003 3830 5185 4847 4857 5451 1859 2506 1334 1541 2307
tra_004 1707 1088 2429 2210 2334 529 919 1103 810 1427
tra_005 479 369 439 444 565 114 192 151 232 351
tra_006 1122 923 1381 1320 1482 632 839 905 670 823
> grpNames
[1] "Stem" "Stem" "Stem" "Stem" "Stem" "Root" "Root" "Root" "Root" "Root"
The R-script should put the p-values in the results variable, e.g.
> head(results)
[1] 1.537634e-07 1.695981e-07 2.478791e-07 2.478791e-07 4.026519e-06
[6] 4.600047e-06
The order should correspond to the input order; if changed, then change the order in rowNames to correspond.
An example script is:
# edgeR glm method unless no replicates
library(edgeR)
y <- DGEList(counts=countData,group=grpNames)
y <- calcNormFactors(y)
if (nGroup1==1 && nGroup2==1) {
writeLines("Using classic with fixed dispersion")
et <- exactTest(y, dispersion=disp)
res <- topTags(et, n=nrow(et), adjust.method="BH")
} else {
writeLines("Using traditional glm")
design <- model.matrix(~grpNames)
y <- estimateDisp(y, design)
fit <- glmQLFit(y,design)
qlf <- glmQLFTest(fit,coef=2)
res <- topTags(qlf, n=nrow(qlf), adjust.method="BH")
}
# Columns are: logFC logCPM F PValue FDR
results <- res$table$FDR
rowNames <- rownames(res)
This works like the R-scripts for DE, with the variables shown below in the GO enrichment
example output:
Assigning R variables
6,651 Numbers GOs
208 Sequences with GOs from 211 (98.6%)
seqNames: sequence names
seqLens: sequence lengths
nSeqs <- 211
seqDEs: DE binary vector (97 p-value < 0.05)
names(seqDEs) <- seqNames
seqGOs <- vector("list",nSeqs)
For all n: seqGOs[[n]] <- c(gonum list)
names( seqGOs) = seqNames
The GOs assigned to each sequence are both the direct and indirect.
The p-values to be loaded must be in oResults and
the associated GO numbers are in goNums (note that TCW removes the "GO:" prefix and just used the number).
If the R-script file name contains "goseq" (case-insensitive) , it expects the variable pwf (case-sensitive) to exist.
The following is the goSeqBH.R script:
# goSeq with Benjamini and Hochberg FDR
suppressPackageStartupMessages(library(goseq))
pwf <- nullp(seqDEs,'','',seqLens,FALSE)
GO.wall <- goseq(pwf,'','',seqGOs)
goNums <- GO.wall$category
oResults <- p.adjust(GO.wall$over_represented_pvalue, method="BH")
FDR
According to the GOseq Manual: "Having performed the GO analysis, you may now wish to
interpret the results. If you wish to identify categories significantly enriched/unenriched
below some p-value cutoff, it is necessary to first apply some kind of multiple hypothesis
testing correction. For example, GO categories over enriched using a .05 FDR cutoff
[Benjamini and Hochberg, 1995] are:"
> enriched.GO=GO.wall$category[p.adjust(GO.wall$over_represented_pvalue, + method="BH")<.05]
Hence, the values loaded into "oResults" for the TCW supplied goSeqBH.R:
oResults <- p.adjust(GO.wall$over_represented_pvalue, method="BH")
The p.adjust methods are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
demoTra: The TCW supplied demo result in 0 p-values<0.05.
In order to view some p-values<0.05 results
for the demo, use goSeqNoFDR.R, which has no multiple testing correction.
If you use your own R-script, make sure the name reflects its function. The R-script names are
shown in the TCW Overview, as illustrated below:
DE (Differential Expression) computation:
Column Method Conditions
StRo edgeRglm.R Stem : Root
StOl edgeRglm.R Stem : Oleaf
RoOl edgeRglm.R Root : Oleaf
GO enrichment computation:
Column Method Cutoff
StRo goSeqNoFDR.R 0.0500
StOl goSeqNoFDR.R 0.0500
RoOl goSeqNoFDR.R 0.0500
The following was written in 2014, so may be dated.
Unless you plan to run as root (not recommended),
or will be the only one running runDE, you will also want to set up R so that one user
can install packages to be used by all users. Note: you
do NOT need R to view the DE results with viewSingleTCW.
When you install an R package, it can go either to a shared location or to a user-specific location.
Installing to the shared location is better, as others can then use the package, however in the default
installation of R the shared directories are owned by the root user and their permissions must be
changed to allow ordinary users to install shared packages. This is done as follows (unless you have
sudo access, you will need an administrator to do this):
- Run R (by typing "R")
- In R, use Sys.getenv("R_HOME") to find the R installation directory (usually /usr/lib64/R)
- Exit from R with command "q()"
- ls /usr/lib64/R (or wherever the R_HOME was)
- The directory listing should have a library subdirectory; if not, R is not installed right
- Change permissions on the library directory
(and everything under it) such that ordinary users can write to it.
Ideally there is an appropriate user group and you can use chgrp -R <group> library
and chmod -R g+w library to give this group write permissions. Otherwise,
chmod -R 777 library will be necessary.
Also, you can check the R-devel installation using ls <R_HOME>/etc. There
should be a file Renviron in this etc directory; if not, or if etc is
not there, then R-devel was not installed.
|