Ensembl supplies FASTA formatted files for genome sequence and GFF formatted files for the annotation The following provides a simple scheme to produce the correctly formatted files for SyMAP.
|
|
Reasons to convert files
- Only chromosome and optional scaffold sequences are processed.
- Only the 'protein-coding' genes are processed.
- Gene attributes:
ID From the input gene attributes. Name From the input gene attributes (if it is not equal ID ).desc Is the gene description , where symbols (e.g. %3B) are replaced with the correct character.
It removes the ending "[Source:..." from the inputdescription .rnaID Is equal to the first mRNA ID. Following the ID is (n) , where n=number of mRNAs for the gene.proteinID (optional) Is equal the 1st CDS protein-id of the 1st mRNA.
Download
- Go to Ensembl, which shows all species for which Ensembl has a genome. For plants and fungi, see EnsemblPlant and EnsemblFungi.
- Select your species.
- Select Download DNA sequence (FASTA).
This takes you to a FTP site. It is recommended that you download the [prefix].dna_sm.toplevel.fa.gz, as it is the soft masked chromosome sequences. - Select the GFF3 from the
Download genes, cDNAs, ncRNA, proteins - FASTA - GFF3 line.
This takes you to an FTP site. Download the [prefix].gff3.gz file.

Multiple files: Ensembl allows individual chromosome FASTA and GFF files to be downloaded,
hence, the
Convert files
Options | Scaffolds | Go to top |
- Go to the symap_5/data/seq directory.
- Make a subdirectory for your species (see Project directory) and move the FASTA and GFF files into the directory.
-
Start the xToSymap program,
select the appropriate options (described below),
then select
Convert . The FASTA file must end in ".fa" and the annotation file must end in ".gff3" (the Ensembl defaults). They may be zipped, i.e. have a '.gz' suffix.
data/seq/cabb/ Brassica_oleracea.BOL.59.gff3.gz Brassica_oleracea.BOL.dna_sm.toplevel.fa.gz annotation/ anno.gff gaps.gff sequence/ genomic.fnaThe terminal/log output gives useful details of the annotation, see log.
Options
Option | Description | Default |
Most Ensembl FASTA header lines specify the chromosome number, X, Y or Roman numeral. Only these sequences will be written. | Any that have 'chromosome' on their header line. | |
Any sequence with 'scaffold' in the header line will be written to the FASTA file. See Scaffolds | Chromosomes only | |
Mt/Pt chromosomes will be included in FASTA and GFF. Only the first occurrence will be included. | No Mt/Pt | |
Only sequences with the specified prefix will be processed. | None | |
A new attribute called | Do not include | |
Print extra info, e.g. see log | No print |
2For situations needing
Rules: There are variations in the text associated with
- If
Only prefix is not blank, sequences are filtered out if theseqid does not start with the prefix. Then all the following apply to the non-filtered sequences:
- Chromosomes: A sequence is considered a chromosome if: (a) the ">" is followed by a number, X, Y, or roman numeral,
or (b) the header line contains the word 'chromosome'.
- The exception is when the header line starts with '>Mt' or '>Pt', these will not be output unless
Include Mt/Pt is selected. - Chromosomes are always output unless
Only prefix is set, and the prefix does not match. - Output
Seqid : If the ">" line contains 'chromosome N', where N={number, X, Y or roman numeral}, than this number is used. Otherwise, the word following 'chromosome' is used (e.g. C1).
- The exception is when the header line starts with '>Mt' or '>Pt', these will not be output unless
- Scaffolds: A sequence is considered a scaffold if the header line contains the word 'scaffold'.
- They will only be output if
Include scaffold is selected. - Output
Seqid : 'Scaf' followed by a consecutive number.
- They will only be output if
- Unknown: All other ">" entries are considered 'unknown'.
- They will only be output if
Only prefix matches the input header lineseqid . - Output:
Seqid : 'Unk' followed by a consecutive number.
- They will only be output if
Scaffolds
By default, theGroup prefix needs to be blank as there is no common prefix now.Minimum length should be set to only load the largest scaffolds.
Calculate the length using thexToSymap Lengths.
General
Load files into SyMAP | Editing the script | What the ConvertEnsembl script does | Go to top |
Load files into SyMAP
The above scenario puts the files in the default SyMAP directories.- When you start up ./symap, you will see your projects listed on the left of the panel (e.g demos).
- Check the projects
you want to load, which will cause them to be shown on the right of the
symap panel. - For the project you want to load, open the Project Parameters panel to enter the appropriate values.
- The select
Load Project .
Editing the ConvertEnsembl script
The script scripts/ConvertEnsembl.java executes the same code as the
What the ConvertEnsembl script does | Go to top |
- Sequences are output according to the Option rules.
- Gaps of >30,000 are written to the annotation/gap.gff file (this value can be changed
in the
xToSymap interface).
GFF: Reads the file ending in 'gff3.gz' and writes the file annotation/anno.gff. The gff3 format has 9 columns, where the first is the 'seqid', the third is the 'type' (e.g. feature 'gene'), the last column is a semicolon-delimited keyword=value 'attribute' list. The file is processed as follows:
- Only lines with the 'type' (3rd column) equal 'gene', 'mRNA' and 'exon' are read.
- Genes with "biotype=protein_coding" are written to file, followed by the first mRNA and its exons.
- All lines have their seqID replaced with the assigned seqID used in the FASTA file. They all have a modified set of attributes written.
- The only gene attributes retained are ID, description and Name (if it is not the same as ID).
It removes the ending "[Source:..." from the input
description .
Go to top |
Email Comments To: symap@agcol.arizona.edu