This page covers using FASTA and GFF files from sources other than NCBI or Ensembl.
Requirements
The requirements are listed in Sequence files and Annotation files.Run Summarize to make sure your files have the right requirements for input to SyMAP; if they do, you may directly input them to SyMAP.
The
The main reason for conversion is to provide shorten chromosome/scaffold names. If your sequence names are long,
it really clutters everything; find a way to edit your files to shorten them.
You may be able to use one of the
xToSymap NCBI and Ensembl conversions
Some important differences between the NCBI and Ensembl files:
- Protein_coding attribute: NCBI uses
gene_biotype= whereas Ensembl usesbiotype= . Since this is the keyword that says whether the gene is "protein_coding", it is VERY important.
- Description attribute: Early NCBI files did not have a
description= keyword for the gene, and only hadproduct= keywords for the mRNAs (now it has both). In constrast, Ensembl only has thedescription= keyword for the gene.
- File suffix: NCBI uses the suffixes ".fna" and ".gff" whereas Ensembl uses ".fa" and ".gff3". The respective conversions requires the correct suffix; this prevent running the wrong conversion.
Input type: NCBI chromosome prefix NCBI mRNA 'product' keyword; NCBI 'gene_biotype' keywordor
Input type: Ensembl like 'Number, X, Y, Roman' Ensembl gene 'description' keyword; Ensembl 'biotype' keywordThe NCBI chromosome prefix is precisely "NC_". However, if your file has its own prefix, but otherwise adheres to the NCBI keywords, you can enter a
The Ensembl
e.g. '>1' or '>X' or '>XI';
When you run
For details on the conversion, see NCBI and Ensembl.
Go to top |
Email Comments To: symap@agcol.arizona.edu