AGCoL runAS - Annotation Setup Guide UA
BIO5
TCW Home | Download | Docs | Tour

To prepare for annotation with runSingleTCW, it is necessary to download the databases to compared against. The runAS program provides support for downloading the taxonomic and full UniProts along with mapping from the UniProt IDs to GO, KEGG, Pfam, EC, and InterPro.

Tested: runAS has been tested on Linux and MacOS. If you have any problems, please let me know at tcw at agcol.arizona.edu.

Contents:
  1. Overview
  2. Running the demo - Recommended
  3. Using runAS
  4. Details and file structure
  5. Cleanup
  6. Memory and Time
  1. What AnnoDBs to use
  2. Creating AnnoDBs from other databases (e.g. NCBI-nr)
  3. Entering AnnoDBs and GOs into runSingleTCW
  4. Why use taxonomic databases
  5. Parsing go-basic.obo
  6. Links to relevant databases

Overview

Go to top
Terminology:

The term "AnnoDB" refers to any database that will be used for annotation, i.e. the sequences in TCW will be searched against all AnnoDB databases and the hits stored in the single TCW database (sTCWdb) for query.

Requirements:

RunAS uses curl for downloading annoDBs and the GO database.
You can get curl on most Linux machines with 'sudo yum install curl', and MacOS comes with it. If you cannot install it, runAS will prompt you as shown on the right; if you select Continue, it will perform the download with its own Java code, though it may take longer and is not as robust, i.e. could have potential problems due to network latency, etc. curl

Processing steps: The TCW runAS will perform the following:

  1. Create the directory under projects/DBfasta for the downloads and generated FASTA files.
  2. Download the selected Taxonomic UniProts .dat files and create FASTA files.
  3. Download the selected full UniProt .dat file and create a FASTA file of the sequences not found in any the downloaded taxonomic files.
  4. Create GO database, which contains mappings from UniProt IDs to GO, KEGG, EC, Pfam and InterPro.
    1. Download go-basic.obo from http://current.geneontology.org/ontology/
    2. Create a local mySQL GO database (GOdb) with the information from this file.
    3. Add information to the GOdb from the .fasta and .dat files in the UniProt directory.
  5. Create the file projects/AnnoDBs_UniProt_<date>.cfg to be imported to runSingleTCW.
Important:
Memory and Time This can take a lot of memory and time, so make sure to read this section.
What AnnoDBs to use To reduce the memory and time, make sure to read this section.
Creating AnnoDBs from other... Other databases, such as NCBI nr, can be used for annotation but they will not have GO, KEGG, EC, Pfam, or InterPro.

Running the demo

Go to top
The TCW package provides subsets of UniProt for annotating the demo. In order to add GO annotations, a local GO mySQL 'demo' database needs to be created.
  1. From the TCW_4 directory, execute:
      ./runAS -d
    The "-d" will cause it to enter the demo parameters, as shown on the right. The highlighted entries already exist. It is only necessary to build the GO database.
     
     
  2. Execute Build GO.
    → The GO tables are available for the demo, i.e. they will not be downloaded.
    → When you select Build GO, a popup will say "GO files exist. Build GO database Only", click Continue.
    → Building the GO database takes anywhere from 3-10 minutes.
demo

Details about the Demo setup

In the projects/DBfasta directory, there is the sub-directory UniProt_demo and GO_obodemo, which contains the following:
  GO_obodemo:
     go_basic.obo

  UniProt_demo:
    sp_bacteria/  sp_fungi/          sp_plants/       tr_plants/
    sp_full/      sp_invertebrates/  tr_invertebrates/
Each taxonomic directory has a .dat and a .fasta file, which are very small subsets of the original UniProt taxonomic .dat file.

Using runAS

Go to top
Typically, all you need to do is make sure you have an internet connection open and that you have enough disk space (see Memory), then start the interface shown on the lower left by typing at the command line: ./runAS
  1. The TCW Annotation Directories define where the files will be put. TCW provides defaults as shown on the right; it is recommended you use the defaults.
     
  2. Select the taxonomic databases you want to use, then select Build Tax, which downloads the respective .dat.gz files and creates FASTA files.
     
  3. Select the full databases you want to use, then select Build Full, which downloads the respective .dat.gz file and creates a subset FASTA file that only contains the sequence NOT in the downloaded taxonomic FASTA files. See Full subsets for more detail.
     
  4. Select Build GO, which downloads the GO database, creates a local mySQL GO database with a mapping of the UniProts from your downloaded set. This uses the information in HOSTS.cfg.
     
  5. Select AnnoDB.cfg, which writes a file called projects/AnnoDBs_UniProt_<date>.cfg that contains all the information downloaded; this can be used as input to runSingleTCW (see Import AnnoDBs).
     
runAS
Check: The Check function automatically runs on startup and after any Build. It highlights everything that has been done. For example, the figure above shows that fungi and plant SwissProt have been downloaded and processed.

To force a check, or to view the UniProts in an existing goDB, select the Check button.

bullet A log of the processing is written to projects/DBfasta/logs/runAS.log. See the log file for an example.


Important points:
  • runAS will not replace an existing downloaded file: It will overwrite a .fasta file, but never a .dat file. If you want a .dat file downloaded again, you must remove it yourself.
     
  • Build GOdb only after all desired taxonomic and full databases are downloaded: It is important that you create the GO database right after downloading the UniProt files so that they correspond. It is also important that you have downloaded all desired taxonomic and full UniProt databases.
     
  • Only download what you need! See Memory and Time and What AnnoDBs to use.
     
  • runMultitCW: If multiple sTCWdbs are to be compared using multiTCW, it is important they use the same set of AnnoDBs and GO database (see Entering AnnoDBs).

Full subsets:
When you select Build Full, a pop-up similar to the one on the right will be shown, where only the taxonomic names will be shown that correspond to downloaded taxonomic .dat files. This allows you to create different subsets. Typically, you will only want one subset, which is the one corresponding to the taxonomic files downloaded.

The FASTA file will have a suffix indicating what subset it corresponds to. For example, the selection on the right would create the file uniprot_sprot_xBFxIxPxxV.fasta, where the 10 characters represent the 10 taxonomic databases in alphabetic order, and the capital letters represent the taxonomic sequences removed (Bacteria, Fungi, Invertebrate, Plant, Virus).

Details: You may unselect all entries and it will create a FASTA file of all sequences. When runSingleTCW loads UniProt IDs, it only loads the first occurrence of a UniProt ID, so duplicates will not cause errors. However, by using the proper subset, processing is faster and the e-values are lower since there are less sequences in the database. You may create new subsets at any time as it does not effect the GOdb. Only one file will be shown in the AnnoDB.cfg file.

Full subset

Details and file structure

Go to top
Check: Select to update the highlighting, as discussed below. The Check function is automatically run on startup, and when any of the three "Builds" are executed.

Label Highlights

  • At the top:
    • If the UniProt directory label is highlighted in blue, it exists.
    • If the GO directory label is highlighted in pink, it exists but the GO OBO file has not been downloaded.
      If the GO directory label is highlighted in blue, the GO OBO file has been downloaded.
  • On the middle right:
    • If the GO Database label is highlighted in blue, the GO database exists.
Taxonomic and Full UniProt Highlights

Clear checkbox: If a Taxonomic is clear, then neither the .dat file or .fasta file exists for it. When you check the box followed by Build Tax, you will need to confirm a popup that states "Download SP - xxx", where xxx will be the list of files to download. The download is always automatically followed by creating the .fasta files. The same applies to the Full checkboxes.

Pink checkbox: If the .dat file exists, but the .fasta file does not, the checkbox will be highlighted pink. Check the pink box(s) and run Build Tax in order to create the .fasta file only. The same applies to the Full checkboxes.

Blue checkbox: If both the .dat file and the .fasta file exists, the check box will be highlighted blue.

File Structure

For each taxonomic and full UniProt that you downloaded, a directory will be created under the UniProt directory. For example,

  ./TCW/projects/DBfasta/UniProt_Dec2021%> ls *
  sp_archaea:
  uniprot_sprot_archaea.dat.gz     uniprot_sprot_archaea.fasta

  sp_full:
  uniprot_sprot.dat.gz     uniprot_sprot_AxxxxxxxxV.fasta

  sp_viruses:
  uniprot_sprot_viruses.dat.gz      uniprot_sprot_viruses.fasta
When you run the BLAST or DIAMOND search programs from runSingleTCW, the formatted files will be placed in the corresponding directory.

Compress Fasta: If you plan on using DIAMOND as the search program, you may compress the fasta files after download, e.g.

  cd projects/DBfasta/UniProt_<date>
  gzip */*.fasta

GO (Gene Ontology)

The go-basic.obo file is downloaded from http://current.geneontology.org/ontology/.

Database: This text entry on the runAS interface is the name of the GO MySQL database that will be created; you will enter this name in runSingleTCW.

The processing steps are as follows:

  1. Download the GO Basic OBO file to GO directory.
  2. Build a GO specific MySQL database (referred to as GOdb) with the contents of the file.
  3. Add the UniProts from all subdirectories under the UniProt directory (e.g. projects/DBfasta/UniProt_Mar2021) to the GOdb.

Clean up

Go to top
runAS does not remove files that are no longer necessary, which are the files downloaded from the internet:
  • All "dat.gz" files in the UniProt directories, as the information has been transferred to the FASTA files and GO database.
  • The GO directory, as the information has been transferred to the GO database.
These files can be removed, as runSingleTCW uses the FASTA files in the UniProt directories and the GO mySQL database. However, if you do not have a space problem, keep them just for insurance; when UniProt does the monthly update, your downloaded files will no be longer available on their site.

For the FASTA files that you will be using DIAMOND to search against, you can gzip them as DIAMOND can search against gzipped files.

When your calculating space, remember that the BLAST and DIAMOND programs will format the .fasta file, which takes up even more space. For example:

  /TCW/projects/DBfasta/UniProt_Dec2021/sp_full% ls -hlG
  -rw-r--r--  1 cari  staff   597M Dec 20 07:07 uniprot_sprot.dat.gz
  -rw-r--r--  1 cari  staff    54M Dec 20 15:55 uniprot_sprot_xBFxIxPxxV.fasta
  -rw-r--r--  1 cari  staff    55M Dec 20 16:15 uniprot_sprot_xBFxIxPxxV.fasta.dmnd

Memory and Time

Go to top
Taxonomic

Downloads on 6-Jun-2021 onto a Linux machine with a ~500 Mbsp download connection and 128Gb of RAM on a Sunday afternoon. Note, there can be considerable difference in download times.

File .dat Size Download .fasta Size1 Creation
uniprot_sprot_bacteria.dat.gz 203Mb  0m:27s 150Mb  0m:25s
uniprot_sprot_fungi.dat.gz 49Mb  0m:15s 21Mb  0m:04s
uniprot_sprot_invertebrates.dat.gz 34Mb  0m:05s 14Mb  0m:02s
uniprot_sprot_plants.dat.gz 51Mb  0m:10s 21Mb  0m:04s
uniprot_sprot_viruses.dat.gz 16Mb  0m:06s 9Mb  0m:01s
uniprot_sprot.dat.gz 587Mb  1m:09s 55Mb2 1m:43s
uniprot_trembl_bacteria.dat.gz 87Gb  1h:57m:02s 64Gb  2h:24m:45s
uniprot_trembl_fungi.dat.gz 8.3Gb  13m:41s 7.4Gb  13m:33s
uniprot_trembl_invertebrates.dat.gz 7.8Gb  12m:05s 6.8Gb  12m:50s
uniprot_trembl_plants.dat.gz 12Gb  16m:21s 10.3Gb  18m:57s
uniprot_trembl_viruses.dat.gz 3.5Gb  5m:20s 2.3Gb  5m:21s
1When TCW extracts the sequence into a FASTA file, it is not written in a gzipped format. However, if you are going to use DIAMOND, you can zip them (the uniprot_trembl_bacteria.fasta zipped file is 31Gb).
2The subset, i.e. full SwissProt minus all downloaded taxonomic entries.

GO database

It takes less than a minute to download the GO file. The time it takes to build the GO database is proportional to the number of UniProts to be processed. For example,

Machine AnnoDBs Time Database size
MacOS Catalina SwissProt Plant and Full, TrEMBL Plant 24m:47s 2.7Gb
Linux (as specified above) The 11 taxonomic and full listed above 10h:42m:27s1 26Gb1
1Most of the time, 8h:23m:30s, was for loading the uniprot_trembl_bacteria.dat, which would also account for the database size.

What AnnoDBs to use

Go to top
Strong suggestions:
  • Only download what is relevant!
    • Download all relevant SwissProt files and the Full SwissProt UniProt.
    • Download only the most relevant TrEMBL files, and never the Full TrEMBL UniProt unless absolutely necessary.
  • Do not perform constant downloads, it is a drain on the UniProt servers.
    The UniProts do not change that fast, and it changes 'best' hits in TCW, which can disturb any on-going analysis.
Evidence: In order to show that it is sufficient to just download the most relevant databases, the following test was performed.

The dataset used for the following tests is from de novo assembled sequences from Andropogon gerardii, which is related to Sorghum. It was downloaded from Dryad and published by Hoffman and Smith (2017). The full dataset had >60k transcripts, which was reduced to 27,085 for faster tests.

Four annotations were compared:

Annotation AnnoDBs #Annotated
#1 sp_plants, tr_plants, sp_ful 25,049 (92.5%)
#2 #1 + sp_virus, sp_fungi, sp_invertebrate, sp_bacteria 25,052 (92.5%)
#3 #2 + tr_virus, tr_fungi, tr_invertebrate, tr_bacteria, tr_full 25,070 (92.6%)
#4 #1 + nr 25,160 (92.9%)

Using only sp_plants, tr_plants, and sp_ful, 92.5% of the transcripts were annotated compared with 92.9% using the entire NR database. If your organism is not closely related to any model organism, then there will likely be a bigger difference.

Creating AnnoDBs from other databases

Go to top
UniProt and NCBI-nr descriptor lines works with TCW. For other databases, you will need to make sure they have a TCW accepted descriptor line.

Description lines

The description line is the ">" line that describes the subsequent sequence in a FASTA file. From it, runSingleTCW extracts:

  • DBtype: used in naming the tab output file and is used in viewSingleTCW to aid in identifying where the hitID is from.
  • hitID: the unique identifier of the hit.
  • description: generally the functional description
  • species: the species

UniProt
  >sp|Q9V2L2|1A1D_PYRAB Putative 1-ami OS=Pyrococcus abyssi GN=PYRAB00630 PE=3 SV=1v
  • The 'sp' is the DBtype. For TrEMBL, the first two characters would be 'tr'.
  • The third entry of the first string is the identifier (e.g. 1A1D_PYRAB)
  • The string up to the OS is the description.
  • The string after the "OS=" is the species.

NCBI nr (See Download NR)

  >XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
  • The first entry is the identifier (e.g. XP_642837.1). Note, there is no longer a way to detect the database origin within the file, hence, the DBtype will be the generic 'PR' for protein.
  • The text from the first space to the first "[" is the description.
  • The text within the "[]" is the species.
As it does not have an "type code", its type will default to "PR". If the taxonomy is given as "nr", the TCW abbreviation for this database will be PRnr.

Generic

If you have a file other than UniProt or nr, make the descriptor names as follows:

  >CC|ID description OS=species
  • CC is the type code, and will be used as the DBtype in TCW.
  • ID is the unique identifier
  • Everything up to the OS is the description
  • Everything after the OS is the species

Example 1: The TCW perl script scripts/formatPlantTFDB.pl takes as input a file from PlantTFDB, which has header lines like:

  >KFK36254.1 Arabis alpina|G2-like|G2-like family protein
and converts them to header lines:
  >tf|G2_like_1 G2-like family protein {KFK36254.1} OS=Arabis alpina
The DBtype will be "tf"". If the taxonomy entered into runSingleTCW is "plants"", the abbreviation for this database will be TFpla.

Example 2: The TCW python script scripts/formatNCBIrna.py takes as input an RNA file from NCBI, which has header lines like:

  >XM_002436391.2 PREDICTED: Sorghum bicolor GDP-mannose 4,6 dehydratase 1 (LOC8069086), mRNA
and converts them to header lines:
  >XM_002436391.2 GDP-mannose 4,6 dehydratase 1 (LOC8069086), mRNA OS=Sorghum bicolor
As this does not have a type code at the beginning, its type will default to "NT". If the taxonomy is entered as "sb", the abbreviation for this database will be NTsb. The script can be modified to add a type code.

Entering AnnoDBs and GOs into runSingleTCW

Go to top
Execute ./runSingleTCW and select your project.

→ Select Import Anno, a file chooser will popup. Select either of the following to enter the names of the UniProt in the AnnoDB table and the GO database:

projects/AnnoDBs_UniProt_<date>.cfg This will use the AnnoDBs & GO written by AnnoDB.cfg.
projects/<project-name>/sTCW.cfg This will use the AnnoDBs & GO used by another project.

→ Now you are ready to run Annotate with the UniProt and GO you just downloaded.

AnnoDBs can be entered using the Add button, where the taxonomy is defined. They can also be changed with Edit.

The GO database and GO slim category are defined or changed in the Options menu.

Why use taxonomic databases instead of the full UniProt

Go to top
viewSingleTCW refers to the annoDBs by the 'DBtype' and 'taxonomy', with them combined into 'DBtax'. The DBtype and taxonomy can be queried on and columns of the data viewed. The "sp" is SwissProt and the "tr" is "TrEMBL".

The following shows an example of a set of hit proteins:

The following shows a table of sequences:

The following shows the details of a specific sequence:

Parsing go-basic.obo

Go to top
The following is an example record in the OBO file:
  [Term]
  id: GO:0000785
  name: chromatin
  namespace: cellular_component
  alt_id: GO:0000789
  alt_id: GO:0000790
  alt_id: GO:0005717
  def: "The ordered and organized complex of DNA, protein, ....
  comment: Chromosomes include parts that are not part of  ....
  synonym: "chromosome scaffold" RELATED []
  synonym: "cytoplasmic chromatin" NARROW []
  synonym: "nuclear chromatin" NARROW []
  xref: NIF_Subcellular:sao1615953555
  is_a: GO:0110165 ! cellular anatomical entity
  relationship: part_of GO:0005694 ! chromosome
TCW parses for the following keywords:
Keyword AmiGO termTCW termExample
id Accession GO ID GO:0000785
name Name Description chromatin
namespace Ontology Domain cellular_component
is_a is_a is_a GO:0110165
relationship: part_of? part_of GO:0005694
alt-id Alternate IDAlternate ID GO:0000790
  replaced by Replaced byGO:0000785
is_obsolete: trueName: obsoleteDescription: obsoleteobsolete replicative cell aging

Views in AmiGO and TCW:

AmiGOTCW
NOTES:
  1. UniProt occasionally uses the Alternate IDs and has a few Obsolete GO terms.
  2. I cannot guarantee that AmiGO always treats "alt_id" as specified here.

Links to relevant databases

Go to top
To download the UniProt files without runAS:
  • Go to UniProt Downloads.
  • In the second line from the top, it says "For downloading complete data sets we recommend using ftp.uniprot.org." Click the ftp.uniprot.org.
  • This brings up the UniProt download directories in a Finder window. You may view it as "Guest".
  • Click "Current_release", "knowledgebase". Here you will see "complete" and "taxonomic_divisions".
The NCBI-nr database can be downloaded:
  • NCBI nr (https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA).
  • As of 24-Jan-21, it is 89GB and took 1h:45m to download.
  • It is called nr.gz; since the File Chooser requires a FASTA suffix, rename it: mv nr.gz nr.fa.gz
GO Basic OBO file: http://geneontology.org/docs/download-ontology/

Email Comments To: tcw@agcol.arizona.edu