AGCoL Other input files to SyMAP UA
BIO5
SyMAP Home | Download | Docs | Input | System Guide | User Guide | Tour
This page covers using FASTA and GFF files from sources other than NCBI or Ensembl.

Requirements

The requirements are listed in Sequence files and Annotation files.

Run Summarize files to make sure your files have the right requirements for input to SyMAP; if they do, you may directly input them to SyMAP.

The xToSymap conversion programs cannot be of aid to you if your GFF file does not meet the requirements specified. You will need to write your own conversion program, or you can send email to symap@agcol.arizona.edu and I will do it for you.

The main reason for conversion is to provide shorten chromosome/scaffold names. If your sequence names are long, it really clutters everything; find a way to edit your files to shorten them. You may be able to use one of the xToSymap conversion programs for your data.

xToSymap NCBI and Ensembl conversions

Some important differences between the NCBI and Ensembl files:

  1. Protein_coding attribute: NCBI uses gene_biotype= whereas Ensembl uses biotype=. Since this is the keyword that says whether the gene is "protein_coding", it is VERY important.
     
  2. Description attribute: Early NCBI files did not have a description= keyword for the gene, and only had product= keywords for the mRNAs (now it has both). In constrast, Ensembl only has the description= keyword for the gene.
     
  3. File suffix: NCBI uses the suffixes ".fna" and ".gff" whereas Ensembl uses ".fa" and ".gff3". The respective conversions requires the correct suffix; this prevent running the wrong conversion.
When you run Summarize files on recent files, you will see as its last line:
Input type: NCBI chromosome prefix
            NCBI mRNA 'product' keyword; NCBI 'gene_biotype' keyword
or
Input type: Ensembl like 'Number, X, Y, Roman'
            Ensembl gene 'description' keyword; Ensembl 'biotype' keyword
The NCBI chromosome prefix is precisely "NC_". However, if your file has its own prefix, but otherwise adheres to the NCBI keywords, you can enter a Only prefix with the NCBI convert.

The Ensembl 'Number, X, Y, Roman' is when the FASTA line starts with the chromosome indicates, e.g. '>1' or '>X' or '>XI';

When you run Summarize files, if your results match NCBI or Ensembl, that will imply you can use one of them.

For details on the conversion, see NCBI and Ensembl.

  Go to top

Email Comments To: symap@agcol.arizona.edu