Ensembl supplies FASTA formatted files for genome sequence and GFF formatted files for the annotation
The following provides a simple scheme to produce the correctly formatted files for SyMAP.
Contents
Reasons to convert
- Only chromosome and optional scaffold sequences are processed.
- Only the 'protein-coding' genes are processed.
- Gene attributes:
ID | From the input gene attributes.
| Name | From the input gene attributes (if it is not equal ID).
| desc | Is the gene description,
where symbols (e.g. %3B) are replaced with the correct character.
It removes the ending "[Source:..." from the input description.
| rnaID | Is equal to the first mRNA ID.
Following the ID is (n), where n=number of mRNAs for the gene.
| proteinID
| (optional) Is equal the 1st CDS protein-id of the 1st mRNA.
|
- If it has problems converting the file(s), then symap will have problems loading them;
the script can be edited for your particular files. (Note: Ensembl formats are not totally consistent, so
xToSymap may not take everything into account).
Download
- Go to Ensembl, which shows all species
for which Ensembl has a genome. For plants and fungi,
see EnsemblPlant
and EnsemblFungi.
- Select your species.
- Select Download DNA sequence (FASTA).
This takes you to a FTP site.
It is recommended that you download the [prefix].dna_sm.toplevel.fa.gz, as it
is the soft masked chromosome sequences.
- Select the GFF3 from the
Download genes, cDNAs, ncRNA, proteins - FASTA - GFF3 line.
This takes you to an FTP site. Download the [prefix].gff3.gz file.
Multiple files: Ensembl allows individual chromosome FASTA and GFF files to be downloaded, hence, the ConvertEnsembl
option (and script) will process all .fa (or .fa.gz) and .gff3 (or .gff3.gz) files in the directory.
- Go to the symap_5/data/seq directory.
- Make a subdirectory for your species and move the FASTA and GFF files
into the directory.
-
Start the xToSymap program,
select the appropriate options (described below),
then select Convert. The FASTA file must end in ".fa" and the annotation file
must end in ".gff3" (the Ensembl defaults). They may be zipped, i.e. have a '.gz' suffix.
This results in the following contents:
data/seq/cabb/
Brassica_oleracea.BOL.59.gff3.gz
Brassica_oleracea.BOL.dna_sm.toplevel.fa.gz
annotation/
anno.gff
gaps.gff
sequence/
genomic.fna
The terminal/log output gives useful details of the annotation, see
log.
Option | Description | Default
| Only #,X,Y,I2
| Most Ensembl FASTA header lines specify the chromosome number, X, Y or Roman numeral.
Only these sequences will be written. | Any that have 'chromosome' on their header line.
| Include scaffolds1 | Any sequence with 'scaffold' in the header line
will be written to the FASTA file. See Scaffolds | Chromosomes only
| Include Mt/Pt
| Mt/Pt chromosomes will be included in FASTA and GFF. Only the first occurrence will be included.
| No Mt/Pt
| Only prefix2 | Only sequences with the specified prefix will be processed. | None
| Protein-id
| A new attribute called proteinID= will be the value of the protein-id of the 1st CDS for
the 1st mRNA of the gene.
This can be searched using the Queries
| Do not include
| Verbose | Print extra info, e.g. see
log
| No print
|
1You may use Include scaffold, and then limit the input on the symap
Load by setting
Minimal length in the project's
Parameters.
2For situations needing Only #,X,Y,I and Only prefix,
see exceptions. This mainly occur
if the words 'chromosome' or 'scaffold' are not in the header lines.
Rules: There are variations in the text associated with >seqid header lines. The rules
used by this script are as follows:
- If Only prefix is not blank, sequences are filtered out if the seqid does not
start with the prefix. Then all the following apply to the non-filtered sequences:
- Chromosomes: A sequence is considered a chromosome if: (a) the ">" is followed by a number, X, Y, or roman numeral,
or (b) the header line contains the word 'chromosome'.
- The exception is when the header line starts with '>Mt' or '>Pt', these will not be output unless
Include Mt/Pt is selected.
- Chromosomes are always output unless Only prefix is set, and the prefix does not match.
- Output Seqid: If the ">" line contains 'chromosome N', where N={number, X, Y or roman numeral},
than this number is used. Otherwise, the word following 'chromosome' is used (e.g. C1).
- Scaffolds: A sequence is considered a scaffold if the header line contains the word 'scaffold'.
- They will only be output if Include scaffold is selected.
- Output Seqid: 'Scaf' followed by a consecutive number.
- Unknown: All other ">" entries are considered 'unknown'.
- They will only be output if Only prefix matches the input header line seqid.
- Output: Seqid: 'Unk' followed by a consecutive number.
See Summarize to help determine how to set the options for your input.
By default, the Convert option creates the genomic.fna file with only the chromosomes.
However, you can have it also include the scaffolds by selecting Include Scaffolds.
This will include all chromosomes (prefix 'C') and scaffolds (prefix 's') in the genomic.fna file.
Beware, there can be many tiny scaffolds. If they all aligned in SyMAP, it causes the display to be very cluttered.
Hence, it is best to just align the largest ones (e.g. the longest 30); merge them if possible, then try
the smaller ones. You should set the following SyMAP project's
Parameters:
- Group prefix needs to be blank as there is no common prefix now.
- Minimum size should be set to only load the largest scaffolds.
As of 23-July-2024, cabbage (Brassica_oleracea.BOL.dna_sm.toplevel.fa.gz)
had 9 chromosome sequences and 32,877 scaffolds. If these were included in the converted output,
Minimum size should be entered to reduce the number loaded. Calculate the Minimum size using
the xToSymap Lengths button.
It outputs all their
sorted lengths followed by a summary table of lengths:
Seq# Length Seqid
1 43,764,888 >C1
2 52,886,895 >C2
3 64,984,695 >C3
4 53,719,093 >C4
5 46,902,585 >C5
6 39,822,476 >C6
7 48,366,697 >C7
8 41,758,685 >C8
9 54,679,868 >C9
10 550,871 >s01 Scaffold00285
11 360,705 >s02 Scaffold00418
...(list rest of scaffolds with lengths)
Values for parameter 'Minimal length' (assuming no duplicate lengths):
#Seqs Minimum length
10 550,871
20 193,719
30 152,041
40 122,531
50 98,180
60 85,914
70 70,524
80 65,697
90 61,049
100 58,280
To align the top 30 sequences (9 chromosomes, 21 of the largest scaffolds),
this says to set Minimum length to 152,041.
The above scenario puts the files in the default SyMAP directories.
When you start up SyMAP, you will see your projects listed on the left of the panel
(e.g as shown for demos).
- Check the projects
you want to load, which will cause them to be shown on the right of the symap window.
- For the project you want to load, open the
project parameters window
to enter the appropriate values.
- The select Load Project.
The Ensembl files are not consistent in their header lines. Hence, the parsing could
be incorrect. If it is not parsing correctly (the summary output should indicate
if it is correct or not), edit the program
as described here.
What the ConvertEnsembl script does
| Go to top |
FASTA: Reads the file ending in '.fa.gz' and writes a new file called
sequence/genomic.fna with the following changes:
- Sequences are output according to the Option rules.
- Gaps of >30,000 are written to the annotation/gap.gff file (this value can be changed
in the xToSymap interface).
GFF: Reads the file ending in 'gff3.gz' and writes the file annotation/anno.gff. The
gff3 format
has 9 columns, where the first is the 'seqid', the third is the 'type' (e.g. feature 'gene'), the
last column is a semicolon-delimited keyword=value 'attribute' list. The file is processed as follows:
- Only lines with the 'type' (3rd column) equal 'gene', 'mRNA' and 'exon' are read.
- Genes with "biotype=protein_coding" are written to file, followed by the first mRNA and
its exons.
- All lines have their seqID replaced with that assigned when reading the FASTA file. They all have a modified
set of attributes written.
- The only gene attributes retained are ID, description and Name (if it is not the same as ID).
It removes the ending "[Source:..." from the input description.
Go to top
|