SyMAP was written for diverse plant genomes with short introns, but has recently been modified to work
for the long introns of mammalian genomes, and less diverse genomes. This is addressed in Pair Parameters.
For the first time user of SyMAP, read the System Guide to try the demo and learn details
of the input files. This document provide details of parameters and functions.
Contents:
Project Manager
1. Build database
This provides a quick summary of the build functions.
See System Guide for details.
The projects selected from the Projects label in
the left panel are listed in the Selected section on the right.
The possible functions vary with the state, as listed below:
♦ If there is any project not loaded to the database, you will see:
|
Load All Projects
Load all projects that have not been loaded yet.
| ♦ If a project is not loaded to the database, you will see:
|
Remove from disk
| Remove your files from disk. The project will no longer be shown on the left.
|
Load project
| Loads the sequence and optional annotations to the database.
After loading results,
always verify them by selecting the View link, which
provides a summary of what has been loaded.
|
Parameters
| This brings up a panel of parameters, see Project Parameters.
After the project is loaded, you can still change the Display parameters.
| ♦ If a project is loaded into the database, you will see:
|
Remove from database
| The sequences and annotation will be removed, but the files stay on disk.
|
Reload project
| Executes Remove from database followed by Load project.
If there are alignment files, you will be asked if you want them removed;
see parameters to determine whether they should be removed.
|
Reload annotation
| Removes the annotation from the database then load the annotations.
The Alignment&Synteny commands will
recognize if there is an existing alignment
and will only perform the synteny computation.
|
Parameters
| This brings up the project parameters panel.
| |
Sequence alignments are performed with
MUMmer3,
but can be changed to use
MUMmer4 (see SyMAP
MUMmer).
This section shows a table with the status of alignments between
the selected loaded projects.
Each cell in the table represents a pair of projects and the cell contains a status code showing
whether or not that pair has been aligned (codes are listed below). Note that the table
shows each pair cell twice, but only the lower cells are activated.
Clicking on a cell selects that pair of projects (the cell will be highlighted in green).
Alignment&Synteny may then be computed or viewed for the selected pair using
the function buttons.
Code
| Description
|
✔
| Synteny for this project pair is ready to view.
|
A
|
The MUMmer alignment has been performed but the synteny computation has not been run.
This status occurs if a pair is completed but then annotations are re-loaded for one of the projects,
or if the MUMmer files have been added by the user.
|
?
|
The alignment have not been completed. In this case, select Selected Pair (Redo) and the
alignments will be completed followed by the synteny algorithm.
|
| The alignment has not been started.
|
CPUs: Enter the maximum number of CPUs to use for an alignment. SyMAP will use
up to that number (but may use fewer, depending on the number of sequences being aligned). Alternatively,
the number of CPUs may be entered in the symap.config file or
entered as a command line argument (-p N).
Concat: By default, all sequences of the 1st genome are concatenated into one file
and then multiple files are created from the 2nd genome to be searched against the first;
the multiple files may have two or more sequential chromosomes concatenated together to
create files that are >60M. To reduce memory usage, you can uncheck Concat so that
multiple files are created for each genomes,
and all files from the 1st genome are searched against the 2nd genome.
See timing difference.
Alignment&Synteny (A&S)
|
All Pairs
Run (or complete) the synteny computation for all pairs in the Available Syntenies table.
However, this will not run 'self-syntenies'; those need to be done individually by selecting
the pair. Also, a draft sequence that is to be ordered can only be aligned to
one complete sequence, so that should also be done individually.
|
Selected Pair
Selected Pair (Redo)
|
Run (or complete) the A&S computation for the selected project pair.
If the pair is already complete, the button label changes to Redo, and
only the synteny computation will be rerun.
If you wish to rerun the MUMmer alignment, first use the Clear Pair function.
|
Clear Pair
|
You will be prompt whether you want:
(1) remove synteny from the database only, or
(2) remove synteny from the database and remove alignments.
|
Parameters
|
Set the pair parameters for the selected pair cell .
| |
For the remaining display buttons, see User Guide.
2.a Overview
Click the Parameters link to open the Parameters window shown on the lower right.
Make sure the following parameters are set
correctly before running the Alignment&Synteny (A&S) step.
Load project:
Group prefix, Minimum length,
Sequence files
Alignment&Synteny: Mask non-genes.
If any of these need to be changed after A&S, then the alignment needs to be run again (note, the
alignment can take a long time!!).
Sequence and Anno (annotation) files:
If the files are put in the default locations, nothing needs to be entered for them.
Default locations:
data/seq/<project-name>/sequence
data/seq/<project-name>/annotation
Do NOT use downloaded NCBI or Ensembl files directly.
See toSymap for an interface
to convert these. If you are using files from another source, make sure they conform to the rule for
Sequence files and
Annotation files.
|
|
For the following parameters, new values take immediate effect.
Parameter | Description | Default Value
| Category
|
Category label for the project. This is only used to group project on the
left side of the Project Manager window.
| Uncategorized
| Display name
| A user-friendly name for the project. Use any combination of letters, digits and dash.
Shorter names will work better in the displays. It must be unique for the category.
| The project directory name.
| Abbreviation
| A name (must be exactly 4 characters) to be used in the column headings for Queries.
| Last 4 characters of Display name.
| Description
| Description of the project. This is only shown in the Selected section.
Do NOT use quotes, backslash or #.
|
| Group type
| How to refer to the sequences. This is shown on the Selected section.
| Chromosome
| Anno key count
| This applies to the annotation attributes columns shown in the Queries results table. See the
Annotation section below.
| 50
|
Group prefix
The term "Group" is used for any sequence type, e.g. chromosome, scaffold, contig.
1. | When a Group prefix is entered, it allows SyMAP to remove the prefix from the chromosome
names and use the remaining part as a shorter name (e.g. '1' instead of 'chr1', as shown on the lower right).
| 2. | If the sequence file has a mix of prefixes (e.g. Chr01, Scaf2345):
- If a Group prefix is entered, only sequences with that prefix will be loaded.
- Leave the Group prefix blank will load all sequences; their prefix will not be removed
(e.g. the 'chr3' on the upper right).
| 3. | Long names really clutter the display; it is best to use a consistent prefix that SyMAP can remove,
otherwise, use really short prefixes, e.g. 's' for scaffold, 'c' for chromosome.
| 4. | You may remove the prefix after the project is loaded. For example, if your sequences had
names "Scaf1", "Scaf2", etc, and "Scaf" was NOT entered as the Group prefix before load, you may
enter it later and it will be removed from all sequence names. However, this is not reversible
and a prefix cannot be added to the sequence names.
|
|
|
This parameter is finicky; after loading project, check View and the output to the terminal
to make sure the annotation was loaded right. Also, see
xToSymap
as it may help you create the files with good prefixes.
Minimum length
This must be an integer, commas are allowed (e.g. 1,000,000).
Minimum length of a group sequence to load; smaller sequences will be ignored. Note that annotations
for ignored sequences will also be ignored, but some error messages will be printed to the terminal.
Sequence files
Select the input FASTA sequence file(s), or directories of sequence files.
For formatting, see Sequence files.
Default location: data/seq/<project-name>/sequence
If any of the above 3 are changed:
▸ If the project has already been loaded, Reload Project.
▸ If A&S has previously been run,
select Clear pair and remove the alignment files, then run A&S.
Anno keywords
A comma separated list of keywords. This can be used to reduce the annotation attribute keywords shown in the 2D display and Queries table.
See the Annotation section below.
Anno files
Select the input GFF3-formatted annotation files corresponding to your sequences. Note, using
a GFF3 file directly can cause problems; see
Annotation files.
Annotation is optional but highly recommended. Default location: data/seq/<project-name>/annotation
If either of the above are changed:
▸ If the annotation has already been loaded, Reload Annotation.
▸ If A&S has been previously run, re-run A&S (the existing alignments files will be reused).
Mask non-genes
Mask out all non-genic parts of the sequences before running MUMmer (gene annotation must
be provided). This can save time but prevents non-annotated anchors from being found. Also,
this has become less relevant since the Cluster Hits algorithms have become more
gene-aware.
▸ If this is changed after A&S, the alignment files need to be
removed with Clear pair and A&S run again.
Order against
For draft contig sets, this allows you to order them using synteny to
one of the other projects.
See Ordering details.
▸ If the draft has been aligned to the Order against sequence, but this parameter
was not set, it can be set and the A&S run with the existing alignment files.
2.f Rules for saving project parameters
| Go to top |
Before the project is loaded, the parameters are saved to the file
data/seq/<project-name>/params
When the project is loaded, all parameters are saved to the database except the A&S parameters.
When the A&S is executed, then the corresponding parameters are saved to the database.
Any parameter not the defaults will be shown on the View popup.
The params file parameters are shown on the Parameters window. These can only be
viewed and changed using symap (not viewSymap).
2.g Annotation
The gene annotation is shown on the
2D display and as columns in the
Queries results table. The annotation
comes from the last column, called attributes, for each gene in the GFF file. The attributes is a keyword=value list,
e.g.
ID=gene-AT1G01010;Name=NAC001;ID=rna-NM_099983.2;product=NAC domain containing protein 1
Defaults: Generally, all genes in a file have the same keywords, in which case, use the defaults.
This will cause the entire attribute to be shown for the gene in the 2D display, and the
Queries table will have columns for each keyword that has over 50 occurrences.
In the example above, the columns will be ID, Name and product (the second ID will be ignored).
If there are many different keywords in the attribute list, this causes too many columns in the Queries table.
This can be reduced by one (or both) of the following:
Anno key count: If there are many different keywords in the attribute list,
set this count N to filter out all keywords with <N occurrences.
The Anno Key Count can be modified at any time using symap (not viewSymap).
Anno keywords: The keyword=value pairs to be saved for each gene can be limited by listing
the desired keywords separated by commas. Using this approach, it will also reduce
annotation description per gene in the 2D display.
For example, if the string "ID, product" was entered,
the Name=value would not be part of any gene annotation. This must be set on Load.
3.a Overview
Select a pair cell in the Available Syntenies table followed by the Parameters
button, which will popup the window shown on the right.
|
|
|
|
If a ✔ or A in pair cell exists, and the parameters are changed, do the following:
Alignment | Select Clear Pair to remove the existing alignments, then Selected Pair (ReDo).
| Cluster Hits or Synteny | Use Selected Pair (ReDo) directly;
the existing alignments for the pair will be used.
|
The parameters are described in the following three sections. The most important parameter is for Algo2;
the Intergenic parameter must be increased for conserved genomes.
|
The default MUMmer parameter seems to work fine with SyMAP, so probably do not
need changing.
Parameter | Description | Default
|
PROmer Args
| Arguments for PROmer1
|
|
NUCmer Args
| Arguments for NUCmer1
|
|
Self Args
| Arguments to use when aligning a chromosome to itself.2
|
|
PROmer Only
| Use PROmer for all alignments.3
| Off
|
NUCmer Only
| Use NUCmer for all alignments.3
| Off
|
1BEWARE: Entered PROmer and NUCmer arguments are NOT checked for correctness.
2When self-alignment is performed, standard arguments are used
when comparing different chromosomes. However, additional arguments may be desired
when a chromosome sequence is run against itself, e.g. --nosimplify.
3 By default, PROmer is used for alignments between different
projects, while NUCmer is used for self alignments.
Algorithm 1 (modified original, updated v5.4.0, abbreviated Algo1):
Pros This is an generic algorithm that has knowledge of genes versus intergenic hits.
It is recommended for ordering sequence contigs and when there are little or no gene annotation.
It must be used for self-synteny. It has been used on 100's of genome comparisons.
Cons It does not distinguish between exon and intron hits. It is more likely to miss good homologous gene pairs.
Parameters It does not need parameter adjustment. However, this gives no control over what hits are filtered.
Algorithm 2 (exon-intron, last update v5.4.8, abbreviated Algo2):
Pros This is a new algorithm that has explicit knowledge of gene pairs and their exon-intron structure, and shows all gene pairs with
hits unless filtered by the parameters (and a few internal filters).
It has better filtering for small intergenic and intron-only hits.
Cons It does not perform self-synteny or take NUCmer files as input.
It does not work when a given chromosome is split over multiple MUMmer files
(this will not happen when SyMAP generates the MUMmer files).
Parameters The parameters generally will need adjusting. However, this gives more control over what gene pairs
are shown and can surpress the minor intergenic hits. See Hints
below the parameter explanation.
Algorithm 1 is the default for now; this is because Algorithm 2 generally requires some
experimenting with the parameters whereas Algorithm 1 does not.
Parameter | Description | Default
| Algorithm 1 (original) | Yes
|
Top N piles
| It will retain the top N hits of a piled region, as
well as all hits with score at least 80% of the Nth hit.
| 2
| Algorithm 2 (gene-centric)
Categories: | EE | Exon-Exon
| | EI | Exon-Intron
| | En | Exon-intergenic
| | II | Intron-Intron
| | In | Intron-intergenic
| | nn | intergenic-intergenic
|
| No
|
Exon
| If the hit is EE, remove all hits that have less than N aligned bases (EQ 1).
| 100
|
Intron
| If the hit is EI, II, En or In remove all hits that have less than N aligned bases (EQ 1).
| 300
|
Intergenic
| If the hit is nn, remove all hits that have less than N aligned bases (EQ 1).
Increase this parameter for conserved genomes.
| 600
|
Keep piles
| EE, EI, En, II, In
•This ONLY applies when there is a pile of overlapping hits (Pile of Hits);
it tells the algorithm what type of cluster hits to retain if they are in a pile.
•Hits are filtered before pile analysis.
•Intergenic-intergenic pile hits, and any unchecked categories, are filtered
as described in the next row..
| EE, EI, En
|
Top N piles
| Algo2 uses Algo1 parameter
for any uncheck categories, but in a more conservative way. It will retain the Top N hits of a piled region that are within
80% of the length of the longest hit.
| 2
|
EQ 1. Minimum matched bases = hit-length*identity, where hit-length is the maximum
length of the query and target length reported by MUMmer, and identity
is the percent identity reported by MUMmer.
For clustered hits, EQ 1 is applied to the summed lengths and identity.
Hint for Algorithm 1: Increasing the Top N parameter can cause too many hits and reduce synteny.
Decreasing it can remove more gene-pair hits.
Hence, try Algorithm 2 if you want more gene pairs.
Hint for Algorithm 2: On the output to the terminal, if any chromosome pair shows over 10,000 hits, the
parameters probably need to be made more stringent; too many hits (and piles) confuses the synteny algorithm, which results
in synteny blocks not being found; it also results in very long execution time.
For highly similar genomes, it is necessary to increase the Intergenic parameter, e.g. >1000;
you may also need to increase the others parameters and uncheck EI and En to reduce hits. For distant genomes, decrease the parameters.
Suggestion: for large
genomes, experiment with the parameters on just one pair of the chromosomes.
Piles of hits: The below image shows a pile of hits on the left that link to repetitive genes on the right.
The left image shows a pile of hits in an intergenic region to multiple other regions.
There are MANY occurrence of repeats like this in the MUMmer file, which is why these piles
must be filtered; if they are not, the synteny algorithm does not perform well.
|
|
Running with ./symap -s provides additional output for both algorithms.
Wrong orientation
It can happen that all hits in a cluster can have the same strand (+/+ or -/-), yet the cluster aligns to
a positive and negative gene (or vice versa). By default, Algo2 includes these hits and writes the count
to the terminal, e.g.
6,736 Both Genes - cluster strands differ from gene (Multi 1,425, Single 5,311)
If you would like to have them excluded or printed, use one of the following.
./symap -wse # exclude
| ./symap -wsp # print to terminal
| ./symap -wsp >ws.log | # print to file ws.log
| | |
You can view them in
the Query where the Hit St column will be "=" but the gene Gst are not equal, or the other way around.
Parameter | Description | Default
|
Min Dots
| Minimum number of anchors required to define a synteny block.
| 7
|
Merge Blocks
| Merge overlapping (or nearby) synteny blocks into larger blocks.
| Off
|
Selecting Block Merge may be beneficial when there are
many small blocks. Below shows
an example where the first image does not have merged blocks and the second does (the blue dots are hits that
belong to the block).
3.e Rules for saving pairs parameters
| Go to top |
Before the A&S is executed, the parameters are saved in
data/seq_results/<proj1-to-proj2>/params
Once the A&S is executed, the parameters are stored in the database.
The file parameters are shown on the pair Parameter window. These can only be
viewed and changed using symap (not viewSymap).
Any parameter not the default will be shown on the Summary page.
BEWARE: If you run A&S, then change the PROmer or NUCmer settings, but forget to Clear Pair
before running A&S again, the parameters
on the Summary page will be wrong (SyMAP does not check for this situation).
3.f MUMmer parameters
The arguments for MUMmer are NOT checked for correctness.
To see the parameters for the default MUMmer V3 on MacOS, from the symap directory:
./ext/mummer/mac/mummer -h
./ext/mummer/mac/promer -h
./ext/mummer/mac/nucmer -h
To see the parameters for the default MUMmer V3 on Linux:
./ext/mummer/linux/mummer -h
./ext/mummer/linux/promer -h
./ext/mummer/linux/nucmer -h
If you compiled V4 in the /ext directory:
./ext/mummer4/m4/bin/mummer -h
./ext/mummer4/m4/bin/promer -h
./ext/mummer4/m4/bin/nucmer -h
For a detail discussion of MUMmer running in SyMAP,
see MUMmer.
Go to top
|