To prepare for annotation with runSingleTCW, it is necessary to download the databases to compared against.
The runAS program provides support for downloading the taxonomic and full UniProts along with mapping from the UniProt IDs to GO, KEGG, Pfam, EC, and InterPro.
Tested: runAS has been tested on Linux and MacOS. If you have any problems,
please let me know at tcw at agcol.arizona.edu.
Terminology:
The term "AnnoDB" refers to any database that will be used for annotation, i.e. the sequences in
TCW will be searched against all AnnoDB databases and the hits stored in the single TCW database
(sTCWdb) for query.
Requirements:
→ RunAS uses curl for downloading annoDBs and the GO database.
You can get curl on most Linux machines with 'sudo yum install curl',
and MacOS comes with it. If you cannot install it, runAS will prompt you as shown on the right;
if you select Continue, it will perform the download with its own Java code, though it may take
longer and is not as robust, i.e. could have potential problems due to network latency, etc. |
|
Processing steps: The TCW runAS will perform the following:
- Create the directory under projects/DBfasta for the downloads and generated FASTA files.
- Download the selected Taxonomic UniProts .dat files and create FASTA files.
- Download the selected full UniProt .dat file and create a FASTA file of the sequences not found in any the downloaded
taxonomic files.
- Create GO database, which contains mappings from UniProt IDs to GO, KEGG, EC, Pfam and InterPro.
- Download go-basic.obo from http://current.geneontology.org/ontology/
- Create a local mySQL GO database (GOdb) with the information from this file.
- Add information to the GOdb from the .fasta and .dat files in the UniProt directory.
- Create the file projects/AnnoDBs_UniProt_<date>.cfg to be imported to runSingleTCW.
Important:
Memory and Time
| This can take a lot of memory and time, so make sure to read this section.
|
What AnnoDBs to use
| To reduce the memory and time, make sure to read this section.
|
Creating AnnoDBs from other...
| Other databases, such as NCBI nr, can be used for annotation but they will not have GO, KEGG, EC, Pfam, or InterPro.
|
The TCW package provides subsets of UniProt for annotating the demo.
In order to add GO annotations, a local GO mySQL 'demo' database needs to be created.
- From the TCW_4 directory, execute:
./runAS -d
The "-d" will cause it to enter the demo parameters, as shown on the right.
The highlighted entries already exist. It is only necessary to build the GO database.
- Execute Build GO.
→ The GO tables are available for the demo, i.e. they will not be downloaded.
→ When you select Build GO, a popup will say
"GO files exist. Build GO database Only", click Continue.
→ Building the GO database takes anywhere from 3-10 minutes.
|
|
Details about the Demo setup
In the projects/DBfasta directory, there is the sub-directory UniProt_demo and GO_obodemo,
which contains the following:
GO_obodemo:
go_basic.obo
UniProt_demo:
sp_bacteria/ sp_fungi/ sp_plants/ tr_plants/
sp_full/ sp_invertebrates/ tr_invertebrates/
Each taxonomic directory has a .dat and a .fasta file, which are very small subsets of the
original UniProt taxonomic .dat file.
Typically, all you need to do is make sure you have an internet connection open and that you have enough disk space
(see Memory), then start the interface shown on the lower left by typing at the command line: ./runAS
- The TCW Annotation Directories define where the files will be put. TCW provides defaults as shown on the right;
it is recommended you use the defaults.
- Select the taxonomic databases you want to use, then select Build Tax,
which downloads the respective .dat.gz files and creates FASTA files.
- Select the full databases you want to use, then select Build Full,
which downloads the respective .dat.gz file and creates a subset FASTA file that only contains
the sequence NOT in the downloaded taxonomic FASTA files. See Full subsets for more detail.
- Select Build GO, which downloads the GO database, creates a local mySQL GO database
with a mapping of the UniProts from your downloaded set. This uses the information in HOSTS.cfg.
- Select AnnoDB.cfg, which writes a file called projects/AnnoDBs_UniProt_<date>.cfg
that contains all the information downloaded; this can be used as input to runSingleTCW
(see Import AnnoDBs).
|
|
Check: The Check function automatically runs on startup and after any Build.
It highlights everything that has been done.
For example, the figure above shows that fungi and plant SwissProt have been downloaded and processed.
To force a check, or to view the UniProts in an existing goDB, select the Check button.
A log of the processing is written to projects/DBfasta/logs/runAS.log.
See the log file for an example.
Important points:
- runAS will not replace an existing downloaded file:
It will overwrite a .fasta file, but never a .dat file. If you want a .dat file downloaded again,
you must remove it yourself.
- Build GOdb only after all desired taxonomic and full databases are downloaded: It is important that you create the GO database right after downloading
the UniProt files so that they correspond. It is also important that you have downloaded all desired taxonomic and
full UniProt databases.
- Only download what you need!
See Memory and Time and What AnnoDBs to use.
- runMultitCW: If multiple sTCWdbs are to be compared using multiTCW,
it is important they use the same set of AnnoDBs and GO database (see Entering AnnoDBs).
Full subsets:
When you select Build Full, a pop-up similar to the one on the right will be shown,
where only the taxonomic names will be shown that correspond to downloaded taxonomic .dat files.
This allows you to create different subsets. Typically, you will only want one subset, which is the
one corresponding to the taxonomic files downloaded.
The FASTA file will have a suffix indicating what
subset it corresponds to. For example, the selection on the right would create the file uniprot_sprot_xBFxIxPxxV.fasta,
where the 10 characters represent the 10 taxonomic databases in alphabetic order, and the capital letters represent the
taxonomic sequences removed (Bacteria, Fungi, Invertebrate, Plant, Virus).
Details: You may unselect all entries and it will create a FASTA file of all sequences. When runSingleTCW loads UniProt IDs,
it only loads the first occurrence of a UniProt ID, so duplicates will not cause errors. However, by using the proper subset, processing is faster and
the e-values are lower since there are less sequences in the database.
You may create new subsets at any time as it does not effect the GOdb. Only one file will be shown in the AnnoDB.cfg file.
|
|
Check: Select to update the highlighting, as discussed below. The Check function is automatically run on startup,
and when any of the three "Builds" are executed.
Label Highlights
- At the top:
- If the UniProt directory label is highlighted in blue, it exists.
- If the GO directory label is highlighted in pink, it exists but the GO OBO file has not been downloaded.
If the GO directory label is highlighted in blue, the GO OBO file has been downloaded.
- On the middle right:
- If the GO Database label is highlighted in blue, the GO database exists.
Taxonomic and Full UniProt Highlights
Clear checkbox: If a Taxonomic is clear, then neither the .dat file or .fasta file exists for it.
When you check the box followed by Build Tax, you will need to confirm a popup that states "Download SP - xxx",
where xxx will be the list of files to download. The download is always automatically followed by creating the .fasta files.
The same applies to the Full checkboxes.
Pink checkbox: If the .dat file exists, but the .fasta file does not, the checkbox will be highlighted pink.
Check the pink box(s) and run Build Tax in order to create the .fasta file only.
The same applies to the Full checkboxes.
Blue checkbox: If both the .dat file and the .fasta file exists, the check box will be highlighted blue.
File Structure
For each taxonomic and full UniProt that you downloaded, a directory will be created under the UniProt directory.
For example,
./TCW/projects/DBfasta/UniProt_Dec2021%> ls *
sp_archaea:
uniprot_sprot_archaea.dat.gz uniprot_sprot_archaea.fasta
sp_full:
uniprot_sprot.dat.gz uniprot_sprot_AxxxxxxxxV.fasta
sp_viruses:
uniprot_sprot_viruses.dat.gz uniprot_sprot_viruses.fasta
When you run the BLAST or DIAMOND search programs from runSingleTCW,
the formatted files will be placed in the corresponding directory.
Compress Fasta: If you plan on using DIAMOND as the search program, you may compress
the fasta files after download, e.g.
cd projects/DBfasta/UniProt_<date>
gzip */*.fasta
GO (Gene Ontology)
The go-basic.obo file is downloaded from
http://current.geneontology.org/ontology/.
Database: This text entry on the runAS interface is the name of the GO MySQL database that
will be created; you will enter this name in runSingleTCW.
The processing steps are as follows:
- Download the GO Basic OBO file to GO directory.
- Build a GO specific MySQL database (referred to as GOdb) with the contents of the file.
- Add the UniProts from all subdirectories under the UniProt directory
(e.g. projects/DBfasta/UniProt_Mar2021) to the GOdb.
runAS does not remove files that are no longer necessary, which are the files downloaded from the internet:
- All "dat.gz" files in the UniProt directories, as the information has been transferred
to the FASTA files and GO database.
- The GO directory, as the information has been transferred to the GO database.
These files can be removed, as runSingleTCW uses the FASTA files in the UniProt directories and
the GO mySQL database. However, if you do not have a space problem, keep them just for insurance; when UniProt
does the monthly update, your downloaded files will no be longer available on their site.
For the FASTA files that you will be using DIAMOND to search against, you can gzip them as
DIAMOND can search against gzipped files.
When your calculating space, remember that the BLAST and DIAMOND programs will
format the .fasta file, which takes up even more space. For example:
/TCW/projects/DBfasta/UniProt_Dec2021/sp_full% ls -hlG
-rw-r--r-- 1 cari staff 597M Dec 20 07:07 uniprot_sprot.dat.gz
-rw-r--r-- 1 cari staff 54M Dec 20 15:55 uniprot_sprot_xBFxIxPxxV.fasta
-rw-r--r-- 1 cari staff 55M Dec 20 16:15 uniprot_sprot_xBFxIxPxxV.fasta.dmnd
Taxonomic
Downloads on 6-Jun-2021 onto a Linux machine with a ~500 Mbsp download connection and 128Gb of RAM
on a Sunday afternoon. Note, there can be considerable difference in download times.
File
| .dat Size
| Download
| .fasta Size1
| Creation
|
uniprot_sprot_bacteria.dat.gz
| 203Mb
| 0m:27s
| 150Mb
| 0m:25s
|
uniprot_sprot_fungi.dat.gz
| 49Mb
| 0m:15s
| 21Mb
| 0m:04s
|
uniprot_sprot_invertebrates.dat.gz
| 34Mb
| 0m:05s
| 14Mb
| 0m:02s
|
uniprot_sprot_plants.dat.gz
| 51Mb
| 0m:10s
| 21Mb
| 0m:04s
|
uniprot_sprot_viruses.dat.gz
| 16Mb
| 0m:06s
| 9Mb
| 0m:01s
|
uniprot_sprot.dat.gz
| 587Mb
| 1m:09s
| 55Mb2
| 1m:43s
|
|
uniprot_trembl_bacteria.dat.gz
| 87Gb
| 1h:57m:02s
| 64Gb
| 2h:24m:45s
|
uniprot_trembl_fungi.dat.gz
| 8.3Gb
| 13m:41s
| 7.4Gb
| 13m:33s
|
uniprot_trembl_invertebrates.dat.gz
| 7.8Gb
| 12m:05s
| 6.8Gb
| 12m:50s
|
uniprot_trembl_plants.dat.gz
| 12Gb
| 16m:21s
| 10.3Gb
| 18m:57s
|
uniprot_trembl_viruses.dat.gz
| 3.5Gb
| 5m:20s
| 2.3Gb
| 5m:21s
|
1When TCW extracts the sequence into a FASTA file, it is not written in a gzipped format.
However, if you are going to use DIAMOND, you can zip them (the uniprot_trembl_bacteria.fasta zipped file is 31Gb).
2The subset, i.e. full SwissProt minus all downloaded taxonomic entries.
GO database
It takes less than a minute to download the GO file. The time it takes to build the GO
database is proportional to the number of UniProts to be processed. For example,
Machine
| AnnoDBs
| Time
| Database size
|
MacOS Catalina
| SwissProt Plant and Full, TrEMBL Plant
| 24m:47s
| 2.7Gb
|
Linux (as specified above)
| The 11 taxonomic and full listed above
| 10h:42m:27s1
| 26Gb1
|
1Most of the time, 8h:23m:30s, was for loading the uniprot_trembl_bacteria.dat,
which would also account for the database size.
Strong suggestions:
- Only download what is relevant!
- Download all relevant SwissProt files and the Full SwissProt UniProt.
- Download only the most relevant TrEMBL files, and never the Full TrEMBL UniProt unless absolutely necessary.
- Do not perform constant downloads, it is a drain on the UniProt servers.
The UniProts do not change that fast, and it changes 'best' hits in TCW, which can disturb any on-going analysis.
Evidence: In order to show that it is sufficient to just download the most relevant databases,
the following test was performed.
The dataset used for the following tests is from de novo assembled sequences from
Andropogon gerardii, which is related to Sorghum. It was downloaded from
Dryad and published by
Hoffman and Smith (2017).
The full dataset had >60k transcripts, which was reduced to 27,085 for faster tests.
Four annotations were compared:
Annotation
| AnnoDBs
| #Annotated
|
#1
| sp_plants, tr_plants, sp_ful
| 25,049 (92.5%)
|
#2
| #1 + sp_virus, sp_fungi, sp_invertebrate, sp_bacteria
| 25,052 (92.5%)
|
#3
| #2 + tr_virus, tr_fungi, tr_invertebrate, tr_bacteria, tr_full
| 25,070 (92.6%)
|
#4
| #1 + nr
| 25,160 (92.9%)
|
Using only sp_plants, tr_plants, and sp_ful, 92.5% of the transcripts were annotated compared
with 92.9% using the entire NR database.
If your organism is not closely related to any model organism, then there will likely be a bigger difference.
Creating AnnoDBs from other databases
| Go to top |
UniProt and NCBI-nr descriptor lines works with TCW. For other databases, you will need to make sure they
have a TCW accepted descriptor line.
Description lines
The description line is the ">" line that describes the subsequent sequence in a FASTA file.
From it, runSingleTCW extracts:
- DBtype: used in naming the tab output file and is used in viewSingleTCW
to aid in identifying where the hitID is from.
- hitID: the unique identifier of the hit.
- description: generally the functional description
- species: the species
UniProt
>sp|Q9V2L2|1A1D_PYRAB Putative 1-ami OS=Pyrococcus abyssi GN=PYRAB00630 PE=3 SV=1v
- The 'sp' is the DBtype. For TrEMBL, the first two characters would be 'tr'.
- The third entry of the first string is the identifier (e.g. 1A1D_PYRAB)
- The string up to the OS is the description.
- The string after the "OS=" is the species.
NCBI nr (See Download NR)
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
- The first entry is the identifier (e.g. XP_642837.1). Note, there is no longer a way to detect
the database origin within the file, hence, the DBtype will be the generic 'PR' for protein.
- The text from the first space to the first "[" is the description.
- The text within the "[]" is the species.
As it does not have an "type code", its type will default to "PR". If the taxonomy is given as "nr", the
TCW abbreviation for this database will be PRnr.
Generic
If you have a file other than UniProt or nr, make the descriptor names as follows:
>CC|ID description OS=species
- CC is the type code, and will be used as the DBtype in TCW.
- ID is the unique identifier
- Everything up to the OS is the description
- Everything after the OS is the species
Example 1: The TCW perl script scripts/formatPlantTFDB.pl takes as input a file from
PlantTFDB,
which has header lines like:
>KFK36254.1 Arabis alpina|G2-like|G2-like family protein
and converts them to header lines:
>tf|G2_like_1 G2-like family protein {KFK36254.1} OS=Arabis alpina
The DBtype will be "tf"". If the taxonomy entered into runSingleTCW is "plants"",
the abbreviation for this database will be TFpla.
Example 2: The TCW python script scripts/formatNCBIrna.py takes as
input an RNA file from NCBI, which has header lines like:
>XM_002436391.2 PREDICTED: Sorghum bicolor GDP-mannose 4,6 dehydratase 1 (LOC8069086), mRNA
and converts them to header lines:
>XM_002436391.2 GDP-mannose 4,6 dehydratase 1 (LOC8069086), mRNA OS=Sorghum bicolor
As this does not have a type code at the beginning, its type will default to "NT".
If the taxonomy is entered as "sb", the abbreviation for this database will be NTsb.
The script can be modified to add a type code.
Entering AnnoDBs and GOs into runSingleTCW
| Go to top |
Execute ./runSingleTCW and select your project.
→ Select Import Anno, a file chooser will popup. Select either of the following to enter the names of the
UniProt in the AnnoDB table and the GO database:
projects/AnnoDBs_UniProt_<date>.cfg | This will use the AnnoDBs & GO written by AnnoDB.cfg.
| projects/<project-name>/sTCW.cfg | This will use the AnnoDBs & GO used by another project.
|
→ Now you are ready to run Annotate with the UniProt and GO you just downloaded.
AnnoDBs can be entered using the Add button, where the taxonomy is defined. They can also be changed with Edit.
The GO database and GO slim category are defined or changed in the Options menu.
Why use taxonomic databases instead of the full UniProt
| Go to top |
viewSingleTCW refers to the annoDBs by the 'DBtype' and 'taxonomy', with them
combined into 'DBtax'. The DBtype and taxonomy can be queried on and columns of the data viewed.
The "sp" is SwissProt and the "tr" is "TrEMBL".
The following shows an example of a set of hit proteins:
The following shows a table of sequences:
The following shows the details of a specific sequence:
The following is an example record in the OBO file:
[Term]
id: GO:0000785
name: chromatin
namespace: cellular_component
alt_id: GO:0000789
alt_id: GO:0000790
alt_id: GO:0005717
def: "The ordered and organized complex of DNA, protein, ....
comment: Chromosomes include parts that are not part of ....
synonym: "chromosome scaffold" RELATED []
synonym: "cytoplasmic chromatin" NARROW []
synonym: "nuclear chromatin" NARROW []
xref: NIF_Subcellular:sao1615953555
is_a: GO:0110165 ! cellular anatomical entity
relationship: part_of GO:0005694 ! chromosome
TCW parses for the following keywords:
Keyword | AmiGO term | TCW term | Example
|
---|
id | Accession | GO ID | GO:0000785
| name | Name | Description | chromatin
| namespace | Ontology | Domain | cellular_component
| is_a | is_a | is_a | GO:0110165
| relationship: part_of | ? | part_of | GO:0005694
| alt-id | Alternate ID | Alternate ID | GO:0000790
| | replaced by | Replaced by | GO:0000785
| is_obsolete: true | Name: obsolete | Description: obsolete | obsolete replicative cell aging
|
Views in AmiGO and TCW:
NOTES:
- UniProt occasionally uses the Alternate IDs and has a few Obsolete GO terms.
- I cannot guarantee that AmiGO always treats "alt_id" as specified here.
To download the UniProt files without runAS:
- Go to UniProt Downloads.
- In the second line from the top, it says "For downloading complete data sets we recommend
using ftp.uniprot.org." Click the ftp.uniprot.org.
- This brings up the UniProt download directories in a Finder window. You may view it as "Guest".
- Click "Current_release", "knowledgebase". Here you will see "complete" and "taxonomic_divisions".
The NCBI-nr database can be downloaded:
- NCBI nr
(https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA).
- As of 24-Jan-21, it is 89GB and took 1h:45m to download.
- It is called nr.gz; since the File Chooser requires a FASTA suffix,
rename it: mv nr.gz nr.fa.gz
GO Basic OBO file:
http://geneontology.org/docs/download-ontology/
|