SyMAP NCBI

NCBI convert to SyMAP

This document corresponds to the ConvertNCBI release with SyMAP v5.0.2 and updated for SyMAP v5.1.2 (dated August-2022). It has been tested on these genomes.

NCBI supplies FASTA formatted files for genome sequence and GFF3 formatted files for the annotation, where FASTA and GFF3 files are the input to SyMAP. However, using them directly can cause problems. The following provides a simple scheme to produce only the files necessary.

Contents

1.	Download
2.	Convert files
3.	Load files into SyMAP

5.	Scaffolds
6.	Editing the script
4.	What the ConvertNCBI script does

Download

Go to NCBI.
As shown in Fig 1:
1. Select "Genome" from the pull-down at the top.
2. Enter you genome name followed by "Search". You should see a page similar to Fig 2.
The following are two approaches to download the FASTA and GFF files:
1. As shown in Fig 2:
  1. Use the genome link beside the "Download sequences in FASTA format for genome,..."
  2. Use the GFF link beside the "Download genome annotation in GFF...".
  3. Process as described in Convert A.
2. As shown in Fig 2 and Fig 3:
  1. Go to the NCBI Datasets page (link at bottom of Fig 2). This will bring up a window as shown in the background of Fig 3.
  2. Select the assembly you want, then click on the Action dots, which brings up a menu; select Download. That brings up a window as shown in Fig 3.
  3. Select the file source, then genome FASTA and annotation GFF for download.
  4. Process as described in Convert B.

If NCBI only provides the Genbank format, try using bioconvert to convert to GFF3, followed by the SyMAP converter. Alternatively, try Galaxy.

Fig 1. Search the NCBI site.

Fig 2. Download A: download the genome and GFF files.

Fig 3. Download B: from NCBI Datasets, select the genome and GFF files.

ConvertNCBI optional flags

Flag	Description	Details	Default
-s	Include Scaffolds	Any sequence not labeled a "chromosome" will be written as a scaffold. See section Scaffolds	No scaffolds
-m	Hard-mask	NCBI genome sequences are soft-masked (generally for download A only), which is changed to hard masked	Leave as soft-mask
-p	1st protein name	1st protein name for the gene will be at the end of the `Attributes` using the `protein=` keyword. This can be searched using the Queries	Do not include
-pa	All protein names	Same as above, but all protein names for the gene will be included in the `protein=` keyword.	Do not include
-l	Use linkage groups	Search 'linkage' instead of 'chromosome'	Use chromosomes
-v	Verbose	Print out header lines of skipped sequences	No print

Beware: There are variations in the text associated with the FASTA ">" header lines. The rules used by this script are as follows:

If the ">" line contains "chromosome N", where N={number, X, Y or roman numeral}, than it is identified as such.
If the ">" line contains "scaffold" or "contig", than it is identified as a scaffold (if "chromosome" is also on the line, it is still treated as a scaffold).
All other ">" entries are considered "other".

It is a real good idea to execute the following on your fasta file:

   zgrep ">" [prefix]._genomic.fna.gz

This shows you the lines and confirms that the above rules will work for your species. If not, then you will need to edit the script to work (or ask me to at symap@agcol.arizona.edu).

Convert files

Go to top

The following conversions were tested on the Oryza Sativa NCBI files on 15-Aug-22. The download A genome sequence was soft-masked, but the download B was hard-masked.

Convert files from FASTA and GFF (download A)

Go to the symap_5/data/seq directory.
Make a subdirectory for your species and move the FASTA and GFF files into the directory.
From the seq directory, type the following at the command line to copy the ConvertNCBI script:
```
cp ../../scripts/ConvertNCBI*.class .
chmod 755 *.class
```
Execute
```
java ConvertNCBI <species>
```

Example

From the symap_5 directory:

> cd data/seq
> mkdir riceA
> cd riceA
> mv ~/Download/GCF_001433935.1_IRGSP-1.0_genomic.fna.gz .
> mv ~/Download/GCF_001433935.1_IRGSP-1.0_genomic.gff.gz .
> cd ..
> cp ../../scripts/ConvertNCBI*.class .
> chmod 755 *.class
> java ConvertNCBI riceA

This results in the following contents:

data/seq
	ConvertNCBI.class
	ConvertNCBI$Gene.class
data/seq/riceA/
      GCF_001433935.1_IRGSP-1.0_genomic.fna.gz
      GCF_001433935.1_IRGSP-1.0_genomic.gff.gz
      annotation/
         anno.gff
         gap.gff
      sequence/
         genomic.fna

The output gives useful details of the annotation (e.g. see riceA details); if the details do not appear right, you may need to edit the script for your genomes.

Convert files from ncbi_dataset.zip (download B)

This approach does not seem to soft-mask the sequence. Also, you need to check the resulting directory to make sure there is just one sub-directory.

Follow all steps from Download A, except for step 2, do the following:

Make a subdirectory for your species and move the ncbi_dataset.zip file to the species directory and unzip it.

Example

From the symap_5 directory:

> cd data/seq
> mkdir riceB
> cd riceB
> mv ~/Download/ncbi_dataset.zip .
unzip ncbi_dataset.zip
Archive:  ncbi_dataset.zip
  inflating: README.md
  inflating: ncbi_dataset/data/assembly_data_report.jsonl
  inflating: ncbi_dataset/data/GCA_001433935.1/chr1.fna
  inflating: ncbi_dataset/data/GCA_001433935.1/chr2.fna
  ...
  inflating: ncbi_dataset/data/GCA_001433935.1/genomic.gff
  inflating: ncbi_dataset/data/GCA_001433935.1/sequence_report.jsonl
  inflating: ncbi_dataset/data/dataset_catalog.json

> cd ..
> cp ../../scripts/ConvertNCBI*.class .
> chmod 755 *.class
> java ConvertNCBI riceB

This results in the following contents (some NCBI files not listed):

data/seq
	ConvertNCBI.class
	ConvertNCBI$Gene.class
data/seq/riceB/
	ncbi_dataset.zip
	ncbi_dataset/data/GCA_001433935.1/
		chr1.fna
		...
	annotation/
		anno.gff
		gap.gff
	sequence/
		genomic.fna

The output gives useful details of the annotation (e.g. see riceB details); if the details do not appear right, you may need to edit the script for your genomes.

Load files into SyMAP

Go to top

The above scenario puts the files in the default SyMAP directories. When you start up SyMAP, you will see your projects listed on the left of the panel (e.g as shown for demos). Check the projects you want to load, which will cause them to be shown on the right of the symap window and continue as described in the System Guide.

Scaffolds

Go to top

By default, the ConvertNCBI script creates the genomic.fna file with only the chromosomes. However, you can have it also include the scaffolds by using the "-s" flag, e.g. using Oryza Sativa

java ConvertNCBI rice -s

This will include all chromosomes (prefix 'C') and scaffolds (prefix 's') in the genomic.fna file. Beware, there can be many tiny scaffolds. If they all aligned in SyMAP, it causes the display to be very cluttered. Hence, it is best to just align the largest ones (e.g. the longest 30); merge them if possible, then try the smaller ones. You should set the following SyMAP project parameters:

grp_prefix needs to be blank as there is no common prefix now.
min_size should be set to only load the largest scaffolds. To determine the value to use, run the lenFasta.pl script, e.g. from the seq directory and using rice as an example:
```
cp ../../scripts/lenFasta.pl .
perl lenFasta.pl riceA/sequence/genomic.fna
```

As of 12-Aug-22 (GCF_001433935.1_IRGSP-1.0_genomic.fna), rice has 58 sequences where 12 are chromosomes, 43 are scaffolds and 3 are other. The script outputs all their sorted lengths followed by the following table:

Read riceA/sequence/genomic.fna and print sorted lengths
Read 55 sequences
   N    Length  Seqid
   1  43270923  >C01 NC_029256.1
   2  36413819  >C03 NC_029258.1
   3  35937250  >C02 NC_029257.1
...
Values for min_len (assuming no duplicate lengths):
#Seqs  min_len
   10 27531856
   20    19457
   30    11447
   40    10311
   50     7140

To align the top 30 sequences (12 chromosomes, 18 of the largest scaffolds), this says to set min_size to 11447.

Editing the script

Go to top

You may want the make changes such as what attributes are included. Therefore, the ConvertNCBI.java code is supplied in the scripts directory. It is very simply written, it does not use external libraries and only uses common programming techniques.

Once you make your changes, execute:

   javac ConvertNCBI.java

You will need to have JDK installed to use the 'javac' command.

What the ConvertNCBI script does

Go to top

The following occurs in the data/seq/<project directory name> where "project directory name" is the argument supplied to ConvertNCBI. The following is for Download A, but works similarly for Download B. It assumes no parameters are set (e.g. -s for scaffolds).

Reads the file ending in '.fna.gz' (or '.fna') and writes a new file called sequence/genomic.fna with the following changes:
1. Sequences must have the word "chromosome" in their ">" header line in order to be copied (unless -l or -s flags).
2. The header line is replaced with ">ChrN", where N comes from the header line "chromosome N", e.g. "chromosome 1", etc. For example,
```
>NC_029256.1 Oryza sativa Japonica Group cultivar Nipponbare chromosome 1, IRGSP-1.0
```
  is replaced with:
```
>Chr1  NC_029256.1
```
3. Gaps of >30,000 are written to the annotation/gap.gff file (30,000 is hard-coded in ConvertNCBI script).
Reads the file ending in 'gff.gz' (or .gff) and writes the file annotation/anno.gff. The gff3 format has 9 columns, where the first is the 'seqid', the third is the 'type' (e.g. feature 'gene'), the last column is a semicolon-delimited keyword=value attribute list. The input file is processed as follows:
1. The type=gene with attribute gene-biotype=protein-coding are processed. The gene line is written to the anno.gff file with the following changes:
  1. The first column 'seqid' is replace with the 'ChrN' value assigned when reading the '.fna' file.
  2. A subset of the attributes are written, where the product keyword value is modified as follows:
    1. If there are multiple mRNA lines for a gene where the values are different, they are concatenated together.
    2. If there are multiple mRNA lines for a gene where the only difference is the variant, then only the variant difference is show, e.g.
```
product=monocopper oxidase-like protein SKU5%2C transcript variant X2, X1, X3
```
2. The first type=mRNA line for a gene is written to the anno.gff file followed by its type=exon, where they each are written with the new seqid and a subset of the attributes.
3. If the "-p" flag is set, then the ID= from the first type=CDS for the gene is extracted, and the "cds-" removed. From the files I have tested, this name corresponds with the ">" name in the NCBI protein file. Likewise for the "-pa" flags, but all unique ID's from the gene's type=CDS records are made into a comma-delimited list; note, this can cause a long list.

Go to top

Email Comments To: symap@agcol.arizona.edu