AGCoL System Quick Guide
and Parameters
UA
BIO5
SyMAP Home | Download | Docs | System Guide | User Guide | Tour
SyMAP was written for diverse plant genomes with short introns, but has recently been modified to work for the long introns of mammalian genomes, and less diverse genomes. This is addressed in Pair Parameters.

For the first time user of SyMAP, read the System Guide to try the demo and learn details of the input files. This document provide details of parameters and functions.

Contents:

1. Build database
  1. Selected
  2. Available Syntenies
  3. CPU and Concat
  4. Function buttons
2. Project Parameters
  1. Overview
  2. Display
  3. Load project
  4. Load annotation
  5. Alignment&Synteny (A&S)
  6. Rules for saving project parameters
  7. Annotation
3. Pair Parameters
  1. Overview
  2. Alignment
  3. Cluster Hits
  4. Synteny
  5. Rules for saving pair parameters
  6. MUMmer parameters

Project Manager
manager

1. Build database

This provides a quick summary of the build functions. See System Guide for details.

Sections:     Selected Available Synteny Function buttons

1.a Selected

Go to top

The projects selected from the Projects label in the left panel are listed in the Selected section on the right. The possible functions vary with the state, as listed below:

If there is any project not loaded to the database, you will see:
   Load All Projects Load all projects that have not been loaded yet.
If a project is not loaded to the database, you will see:
   Remove from disk Remove your files from disk. The project will no longer be shown on the left.
   Load project Loads the sequence and optional annotations to the database.
   Parameters This brings up a panel of parameters, see Project Parameters. After the project is loaded, you can still change the Display parameters.
If a project is loaded into the database, you will see:
   Remove from database The sequences and annotation will be removed, but the files stay on disk.
   Reload project Executes Remove from database followed by Load project.
If there are alignment files, you will be asked if you want them removed; see parameters to determine whether they should be removed.
   Reload annotation Removes the annotation from the database then load the annotations.
The Alignment&Synteny commands will recognize if there is an existing alignment and will only perform the synteny computation.
   Parameters This brings up the project parameters panel.

1.b Available Syntenies

Go to top
Sequence alignments are performed with MUMmer3, but can be changed to use MUMmer4 (see SyMAP MUMmer).

This section shows a table with the status of alignments between the selected loaded projects. Each cell in the table represents a pair of projects and the cell contains a status code showing whether or not that pair has been aligned (codes are listed below). Note that the table shows each pair cell twice, but only the lower cells are activated.

Clicking on a cell selects that pair of projects (the cell will be highlighted in green). Alignment&Synteny may then be computed or viewed for the selected pair using the function buttons.

Code Description
Synteny for this project pair is ready to view.
A The MUMmer alignment has been performed but the synteny computation has not been run. This status occurs if a pair is completed but then annotations are re-loaded for one of the projects, or if the MUMmer files have been added by the user.
? The alignment have not been completed. In this case, select Selected Pair (Redo) and the alignments will be completed followed by the synteny algorithm.
  The alignment has not been started.

1.c CPU and Concat

Go to top
CPUs: Enter the maximum number of CPUs to use for an alignment. SyMAP will use up to that number (but may use fewer, depending on the number of sequences being aligned). Alternatively, the number of CPUs may be entered in the symap.config file or entered as a command line argument (-p N).

Concat: By default, all sequences of the 1st genome are concatenated into one file and then multiple files are created from the 2nd genome to be searched against the first; the multiple files may have two or more sequential chromosomes concatenated together to create files that are >60M. To reduce memory usage, you can uncheck Concat so that multiple files are created for each genomes, and all files from the 1st genome are searched against the 2nd genome. See timing difference.

1.d Function Buttons

Go to top
Alignment&Synteny (A&S)
All Pairs Run (or complete) the synteny computation for all pairs in the Available Syntenies table. However, this will not run 'self-syntenies'; those need to be done individually by selecting the pair. Also, a draft sequence that is to be ordered can only be aligned to one complete sequence, so that should also be done individually.
Selected Pair
Selected Pair (Redo)
Run (or complete) the A&S computation for the selected project pair.
If the pair is already complete, the button label changes to Redo, and only the synteny computation will be rerun.
If you wish to rerun the MUMmer alignment, first use the Clear Pair function.
Clear Pair You will be prompt whether you want:
(1) remove synteny from the database only, or
(2) remove synteny from the database and remove alignments.
Parameters Set the pair parameters for the selected pair cell .

For the remaining display buttons, see User Guide.


2. Project Parameters

Go to top
Sections:     Overview Display Load project Load annotation Alignment&Synteny (A&S) Save Annotation

2.a Overview

Click the Parameters link to open the Parameters window shown on the lower right.

 

Make sure the following parameters are set correctly before running the Alignment&Synteny (A&S) step.

Load project: Group prefix, Minimum length, Sequence files
Alignment&Synteny: Mask non-genes.

If any of these need to be changed after A&S, then the alignment needs to be run again (note, the alignment can take a long time!!).

 

Sequence and Anno (annotation) files:
If the files are put in the default locations, nothing needs to be entered for them.

Default locations:
data/seq/<project-name>/sequence
data/seq/<project-name>/annotation

Project options

parameters1

2.b Display

Go to top
For the following parameters, new values take immediate effect.
ParameterDescriptionDefault Value
  Category Category label for the project. This is only used to group project on the left side of the Project Manager window. Uncategorized
  Display name A user-friendly name for the project. Use any combination of letters, digits and dash. Shorter names will work better in the displays. It must be unique for the category. The project directory name.
  Abbreviation A name (must be exactly 4 characters) to be used in the column headings for Queries. Last 4 characters of Display name.
  Description Description of the project. This is only shown in the Selected section. Do NOT use quotes, backslash or #.  
  Group type How to refer to the sequences. This is shown on the Selected section. Chromosome
  Anno key count This applies to the annotation attributes columns shown in the Queries results table. See the Annotation section below. 50

2.c Load project

Go to top
Group prefix
The term "Group" is used for any sequence type, e.g. chromosome, scaffold, contig.
  • When a Group prefix is entered, it allows SyMAP to remove the prefix from the chromosome names and use the remaining part as a shorter name (e.g. '1' instead of 'chr1', as shown on the lower right).
     
  • If the sequence file has a mix of prefixes (e.g. Chr01, Scaf2345):
    • If a Group prefix is entered, only sequences with that prefix will be loaded.
    • Leave the Group prefix blank will load all sequences; their prefix will not be removed (e.g. the 'chr3' on the upper right).

     
  • Long names really clutter the display; it is best to use a consistent prefix that SyMAP can remove, otherwise, use really short prefixes, e.g. 's' for scaffold, 'c' for chromosome.
Demo prefix
  • You may remove the prefix after the project is loaded. For example, if your sequences had names "Scaf1", "Scaf2", etc, and "Scaf" was NOT entered as the Group prefix before load, you may enter it later and it will be removed from all sequence names. However, this is not reversible and a prefix cannot be added to the sequence names.
This parameter is finicky; after loading project, check View and the output to the terminal to make sure the annotation was loaded right.

Minimum length
Minimum length of a group sequence to load; smaller sequences will be ignored. Note that annotations for ignored sequences will also be ignored, but some error messages will be printed to the terminal.

Sequence files
Select the input FASTA sequence file(s), or directories of sequence files. For formatting, see Sequence files. Default location: data/seq/<project-name>/sequence

If any of the above 3 are changed:
▸ If the project has already been loaded, Reload Project.
▸ If A&S has previously been run, select Clear pair and remove the alignment files, then run A&S.

2.d Load annotation

Go to top
Anno keywords
A comma separated list of keywords. This can be used to reduce the annotation attribute keywords shown in the 2D display and Queries table. See the Annotation section below.

Anno files
Select the input GFF3-formatted annotation files corresponding to your sequences. Note, using a GFF3 file directly can cause problems; see Annotation files. Annotation is optional but highly recommended. Default location: data/seq/<project-name>/annotation

If either of the above are changed:
▸ If the annotation has already been loaded, Reload Annotation.
▸ If A&S has been previously run, re-run A&S (the existing alignments files will be reused).

2.e Alignment&Synteny (A&S)

Go to top
Mask non-genes
Mask out all non-genic parts of the sequences before running MUMmer (gene annotation must be provided). This can save time but prevents non-annotated anchors from being found. Also, this has become less relevant since the Cluster Hits algorithms have become more gene-aware.

▸ If this is changed after A&S, the alignment files need to be removed with Clear pair and A&S run again.

Order against
For draft contig sets, this allows you to order them using synteny to one of the other projects. See Ordering details.

▸ If the draft has been aligned to the Order against sequence, but this parameter was not set, it can be set and the A&S run with the existing alignment files.

2.f Rules for saving project parameters

Go to top
Before the project is loaded, the parameters are saved to the file
  data/seq/<project-name>/params
When the project is loaded, all parameters are saved to the database except the A&S parameters.
When the A&S is executed, then the corresponding parameters are saved to the database.

Any parameter not the defaults will be shown on the View popup.

The params file parameters are shown on the Parameters window. These can only be viewed and changed using symap (not viewSymap).

2.g Annotation

The gene annotation is shown on the 2D display and as columns in the Queries results table. The annotation comes from the last column, called attributes, for each gene in the GFF file. The attributes is a keyword=value list, e.g.
   ID=gene-AT1G01010;Name=NAC001;ID=rna-NM_099983.2;product=NAC domain containing protein 1
Defaults: Generally, all genes in a file have the same keywords, in which case, use the defaults. This will cause the entire attribute to be shown for the gene in the 2D display, and the Queries table will have columns for each keyword that has over 50 occurrences. In the example above, the columns will be ID, Name and product (the second ID will be ignored).

If there are many different keywords in the attribute list, this causes too many columns in the Queries table. This can be reduced by one (or both) of the following:

Anno key count: If there are many different keywords in the attribute list, set this count N to filter out all keywords with <N occurrences. The Anno Key Count can be modified at any time using symap (not viewSymap).

Anno keywords: The keyword=value pairs to be saved for each gene can be limited by listing the desired keywords separated by commas. Using this approach, it will also reduce annotation description per gene in the 2D display. For example, if the string "ID, product" was entered, the Name=value would not be part of any gene annotation. This must be set on Load.


3. Pair Parameters

Go to top
Sections:     Overview Alignment Cluster Hits Synteny Save MUMmer parameters

3.a Overview

Select a pair cell in the Available Syntenies table followed by the Parameters button, which will popup the window shown on the right. alignLine
pair parameters    

If a ✔ or A in pair cell exists, and the parameters are changed, do the following:

Alignment Select Clear Pair to remove the existing alignments, then Selected Pair (ReDo).
Cluster Hits or
Synteny
Use Selected Pair (ReDo) directly; the existing alignments for the pair will be used.

 

The parameters are described in the following three sections. The most important parameter is for Algo2; the Intergenic parameter must be increased for conserved genomes.

3.b Alignment

Go to top
The default MUMmer parameter seems to work fine with SyMAP, so probably do not need changing.
ParameterDescriptionDefault
  PROmer Args Arguments for PROmer1  
  NUCmer Args Arguments for NUCmer1  
  Self Args Arguments to use when aligning a chromosome to itself.2  
  PROmer Only Use PROmer for all alignments.3 Off
  NUCmer Only Use NUCmer for all alignments.3 Off

1BEWARE: Entered PROmer and NUCmer arguments are NOT checked for correctness.
2When self-alignment is performed, standard arguments are used when comparing different chromosomes. However, additional arguments may be desired when a chromosome sequence is run against itself, e.g. --nosimplify.
3 By default, PROmer is used for alignments between different projects, while NUCmer is used for self alignments.

3.c Cluster Hits

Go to top
Algorithm 1 (original, updated v5.4.0, abbreviated Algo1):
Pros This is an generic algorithm that has knowledge of genes versus intergenic hits. It is recommended for ordering sequence contigs and when there are little or no gene annotation. It must be used for self-synteny. It has been used on 100's of genome comparisons.
Cons It does not distinguish between exon and intron hits. It is more likely to miss good homologous gene pairs.
Parameters It does not need parameter adjustment. However, this gives no control over what hits are filtered.

Algorithm 2 (new, last update v5.4.8, abbreviated Algo2):
Pros This is a new algorithm that has explicit knowledge of gene pairs and their exon-intron structure, and shows all gene pairs with hits unless filtered by the parameters (and a few internal filters). It has better filtering for small intergenic and intron-only hits.
Cons It does not perform self-synteny or take NUCmer files as input. It does not work when a given chromosome is split over multiple MUMmer files (this will not happen when SyMAP generates the MUMmer files).
Parameters The parameters generally will need adjusting. However, this gives more control over what gene pairs are shown and can surpress the minor intergenic hits. See Hints below the parameter explanation.

Algorithm 1 is the default for now; this is because Algorithm 2 generally requires some experimenting with the parameters whereas Algorithm 1 does not.

ParameterDescriptionDefault
Algorithm 1 (original)Yes
  Top N piles It will retain the top N hits of a piled region, as well as all hits with score at least 80% of the Nth hit. 2
Algorithm 2 (gene-centric)
   Categories:EEExon-Exon    EIExon-Intron    EnExon-intergenic
   IIIntron-Intron    InIntron-intergenic    nnintergenic-intergenic
No
  Exon If the hit is EE, remove all hits that have less than N aligned bases (EQ 1). 100
  Intron If the hit is EI, II, En or In remove all hits that have less than N aligned bases (EQ 1). 300
  Intergenic If the hit is nn, remove all hits that have less than N aligned bases (EQ 1).
Increase this parameter for conserved genomes.
600
  Keep piles EE, EI, En, II, In
•This ONLY applies when there is a pile of overlapping hits (Pile of Hits); it tells the algorithm what type of cluster hits to retain if they are in a pile.
•Hits are filtered before pile analysis.
•Intergenic-intergenic pile hits, and any unchecked categories, are filtered as described in the next row..
EE, EI, En
  Top N piles Algo2 uses Algo1 parameter for any uncheck categories, but in a more conservative way. It will retain the Top N hits of a piled region that are within 80% of the length of the longest hit. 2
EQ 1. Minimum matched bases = hit-length*identity, where hit-length is the maximum length of the query and target length reported by MUMmer, and identity is the percent identity reported by MUMmer. For clustered hits, EQ 1 is applied to the summed lengths and identity.

Hint for Algorithm 1: Increasing the Top N parameter can cause too many hits and reduce synteny. Decreasing it can remove more gene-pair hits. Hence, try Algorithm 2 if you want more gene pairs.

Hint for Algorithm 2: On the output to the terminal, if any chromosome pair shows over 10,000 hits, the parameters probably need to be made more stringent; too many hits (and piles) confuses the synteny algorithm, which results in synteny blocks not being found; it also results in very long execution time. For highly similar genomes, it is necessary to increase the Intergenic parameter, e.g. >1000; you may also need to increase the others parameters and uncheck EI and En to reduce hits. For distant genomes, decrease the parameters. Suggestion: for large genomes, experiment with the parameters on just one pair of the chromosomes.

Piles of hits: The below image shows a pile of hits on the left that link to repetitive genes on the right.

The left image shows a pile of hits in an intergenic region to multiple other regions. There are MANY occurrence of repeats like this in the MUMmer file, which is why these piles must be filtered; if they are not, the synteny algorithm does not perform well.
Pile to repetitive gene

Pile to intergenic

Running with ./symap -s provides additional output for both algorithms.

Wrong orientation

It can happen that all hits in a cluster can have the same strand (+/+ or -/-), yet the cluster aligns to a positive and negative gene (or vice versa). By default, Algo2 includes these hits and writes the count to the terminal, e.g.
6,736 Both Genes - cluster strands differ from gene (Multi 1,425, Single 5,311)
If you would like to have them excluded or printed, use one of the following.
./symap -wse # exclude
./symap -wsp # print to terminal
./symap -wsp >ws.log   # print to file ws.log

You can view them in the Query where the Hit St column will be "=" but the gene Gst are not equal, or the other way around.

3.d Synteny

Go to top
ParameterDescriptionDefault
  Min Dots Minimum number of anchors required to define a synteny block. 7
  Merge Blocks Merge overlapping (or nearby) synteny blocks into larger blocks. Off

Selecting Block Merge may be beneficial when there are many small blocks. Below shows an example where the first image does not have merged blocks and the second does (the blue dots are hits that belong to the block).

demo Not Merge demoMerge

3.e Rules for saving pairs parameters

Go to top
Before the A&S is executed, the parameters are saved in
  data/seq_results/<proj1-to-proj2>/params
Once the A&S is executed, the parameters are stored in the database.

The file parameters are shown on the pair Parameter window. These can only be viewed and changed using symap (not viewSymap).

Any parameter not the default will be shown on the Summary page.

BEWARE: If you run A&S, then change the PROmer or NUCmer settings, but forget to Clear Pair before running A&S again, the parameters on the Summary page will be wrong (SyMAP does not check for this situation).

3.f MUMmer parameters

The arguments for MUMmer are NOT checked for correctness.

To see the parameters for the default MUMmer V3 on MacOS, from the symap directory:

./ext/mummer/mac/mummer -h
./ext/mummer/mac/promer -h
./ext/mummer/mac/nucmer -h
To see the parameters for the default MUMmer V3 on Linux:
./ext/mummer/linux/mummer -h
./ext/mummer/linux/promer -h
./ext/mummer/linux/nucmer -h
If you compiled V4 in the /ext directory:
./ext/mummer4/m4/bin/mummer -h
./ext/mummer4/m4/bin/promer -h
./ext/mummer4/m4/bin/nucmer -h
For a detail discussion of MUMmer running in SyMAP, see MUMmer.

Go to top

Email Comments To: symap@agcol.arizona.edu