AGCoL SyMAP Interface
and Parameters
UA
BIO5
SyMAP Home | Download | Docs | Input | System Guide | User Guide | Tour
SyMAP was written for diverse plant genomes with short introns, but has recently been modified to work for the long introns of mammalian genomes, and less diverse genomes. This is addressed in Pair Parameters.

Contents: Start SyMAP

1. Build database
  1. Selected
  2. Available Syntenies
  3. CPU and Concat
  4. Function buttons
2. Project Parameters
  1. Overview
  2. Display
  3. Load project
  4. Load annotation
  5. Alignment&Synteny (A&S)
  6. Rules for saving project parameters
  7. GFF Attributes
3. Pair Parameters
  1. Overview
  2. Alignment
  3. Cluster Hits
  4. Synteny
  5. Rules for saving pair parameters
  6. MUMmer parameters

Start SyMAP

To start SyMAP, type at the command line: ./symap
To view the command line options, type: ./symap -h

For the first time user of SyMAP, see:

This document provide details of parameters and functions.

Project Manager
manager

1. Build database

Sections:     Selected Available Synteny Function buttons
To view the command line options, type: ./symap -h

1.a Selected

Go to top

The projects selected from the Projects label in the left panel are listed in the Selected section on the right. The possible functions vary with the state, as listed below:

If there is any project not loaded to the database, you will see:
   Load All Projects Load all projects that have not been loaded yet.
If a project is not loaded to the database, you will see:
   Remove from disk Remove your files from disk. The project will no longer be shown on the left.
   Load project Loads the sequence and optional annotations to the database. After loading results, always verify them by selecting the View link, which provides a summary of what has been loaded.
   Parameters This brings up a panel of parameters, see Project Parameters. After the project is loaded, you can still change the Display parameters.
If a project is loaded into the database, you will see:
   Remove from database The sequences and annotation will be removed, but the files stay on disk.
   Reload project Executes Remove from database followed by Load project.
If there are alignment files, you will be asked if you want them removed; see parameters to determine whether they should be removed.
   Reload annotation Removes the annotation from the database then load the annotations.
The Alignment&Synteny commands will recognize if there is an existing alignment and will only perform the synteny computation.
   Parameters This brings up the project parameters panel.

1.b Available Syntenies

Go to top
Sequence alignments are performed with MUMmer3, but can be changed to use MUMmer4 (see SyMAP MUMmer).

This section shows a table with the status of alignments between the selected loaded projects. Each cell in the table represents a pair of projects and the cell contains a status code showing whether or not that pair has been aligned (codes are listed below). Note that the table shows each pair cell twice, but only the lower cells are activated.

Clicking on a cell selects that pair of projects (the cell will be highlighted in green), and the buttons that can be selected are activated.

Code Description
Synteny for this project pair is ready to view.
A The MUMmer alignment has been performed but the synteny computation has not been run. This status occurs if a pair is completed but then annotations are re-loaded for one of the projects, or if the MUMmer files have been added by the user.
? The alignment have not been completed. In this case, select Selected Pair (Redo) and the alignments will be completed followed by the synteny algorithm.
  The alignment has not been started.

1.c CPU and Concat

Go to top
CPUs: Enter the maximum number of CPUs to use for an alignment. SyMAP will use up to that number (but may use fewer, depending on the number of sequences being aligned). Alternatively, the number of CPUs may be entered in the symap.config file or entered as a command line argument (-p N).

Concat: By default, all sequences of the 1st genome are concatenated into one file and then multiple files are created from the 2nd genome to be searched against the first. For the 2nd genome, sequential short sequences are put into one file until the file length is >60M.

Concat unchecked: To reduce memory usage, you can uncheck Concat so that multiple files are created for each genomes, and all files from the 1st genome are searched against the all files from the 2nd genome. For both genomes, sequential short sequences are put into one file until the file length is >60M.

The following statistics are from comparing Arabidopsis thaliana (119M) against Brassica rapa (297M) on a MacOS using 1 CPU.

Concatenated       Not concatenated
48819 hits
  334 synteny blocks
46319 gene hits
38334 synteny hits

Finished in 1 hr 7 min 41 sec
48846 hits
  334 synteny blocks
46348 gene hits
38345 synteny hits

Finished in 1 hr 34 min 29 sec

1.d Function Buttons

Go to top
Alignment&Synteny (A&S)
All Pairs Run (or complete) the synteny computation for all pairs in the Available Syntenies table. However, this will not run 'self-syntenies'; those need to be done individually by selecting the pair. Also, a draft sequence that is to be ordered can only be aligned to one complete sequence, so that should also be done individually.
Selected Pair
Selected Pair (Redo)
Run (or complete) the A&S computation for the selected project pair.
If the pair is already complete, the button label changes to Redo, and only the synteny computation will be rerun.
If you wish to rerun the MUMmer alignment, first use the Clear Pair function.
Clear Pair You will be prompt whether you want:
(1) remove synteny from the database only, or
(2) remove synteny from the database and remove alignments.
Parameters Set the pair parameters for the selected pair cell .

For the remaining display buttons, see User Guide.


2. Project Parameters

Go to top
 Project Parameters
Sections:     Overview Display Load project Load annotation Alignment&Synteny Save GFF Attributes

2.a Overview

Click the Parameters link for a project to open the parameters window shown on the lower right.

Make sure the following parameters are set correctly before running the Alignment&Synteny (A&S) step.

Load project: Group prefix, Minimum length, Sequence files. See Load project.

Alignment&Synteny: Mask non-genes. See Mask.

If any of these need to be changed after A&S, then the alignment needs to be run again (note, the alignment can take a long time!!).

 

Sequence and Anno (annotation) files:
If the files are put in the default locations, nothing needs to be entered for them. Default locations:

  data/seq/<project-name>/sequence
  data/seq/<project-name>/annotation
See input for a description of the input files.
Project options

parameters1

2.b Display

Go to top
For the following parameters, new values take immediate effect.
ParameterDescriptionDefault Value
  Category Category label for the project. This is only used to group project on the left side of the Project Manager window. Uncategorized
  Display name A user-friendly name for the project. Use any combination of letters, digits and dash. Shorter names will work better in the displays. It must be unique for the category. The project directory name.
  Abbreviation A name (must be exactly 4 characters) to be used in the column headings for Queries. Last 4 characters of Display name.
  Description Description of the project. This is only shown in the Selected section and View popup. Do NOT use quotes, backslash or #.  
  Group type How to refer to the sequences. This is shown on the Selected section. Chromosome
  Anno key count This applies to the annotation attributes columns shown in the Queries results table. See the GFF Attributes section below. 50

2.c Load project

Go to top
Group prefix
The term "Group" is used for any FASTA sequence type, e.g. chromosome, scaffold, contig. This option sounds trivial but is important for a good display, so please read carefully the following.
1. When a Group prefix is entered, it allows SyMAP to remove the prefix from the chromosome names and use the remaining part as a shorter name (e.g. '1' instead of 'chr1', as shown on the lower right).
2.If the sequence file has a mix of prefixes (e.g. Chr01, Scaf2345):
  • If a Group prefix is entered, only sequences with that prefix will be loaded.
  • Leave the Group prefix blank will load all sequences; their prefix will not be removed (e.g. the 'chr3' on the upper right).
3.Long names really clutter the display; it is best to use a consistent prefix that SyMAP can remove, otherwise, use really short prefixes, e.g. 's' for scaffold, 'c' for chromosome.
4. You may remove the prefix after the project is loaded. For example, if your sequences had names "Scaf1", "Scaf2", etc, and "Scaf" was NOT entered as the Group prefix before load, you may enter it later and it will be removed from all sequence names. However, this is not reversible and a prefix cannot be added to the sequence names.
Demo prefix
This parameter is finicky; after loading a project, select View for a popup of the input, and check the output to the terminal to make sure the annotation was loaded right. Also, see xToSymap as it may help you create the files with good prefixes.

Minimum length
This must be an integer, commas are allowed (e.g. 1,000,000).
This is the minimum length of the FASTA sequence that will be loaded; smaller sequences will be ignored. Note that annotations for ignored sequences will also be ignored, but some warning messages will be printed to the terminal. See xToSymap Length for help with this.

Sequence files
Select the input FASTA sequence file(s), or directories of sequence files. For formatting, see Sequence files. Default location: data/seq/<project-name>/sequence

If any of the above 3 parameters are changed:
▸ If the project has already been loaded, Reload Project.
▸ If A&S has previously been run, select Clear pair and remove the alignment files, then run A&S.

2.d Load annotation

Go to top
Anno keywords
A comma separated list of keywords. This can be used to reduce the annotation attribute keywords shown in the 2D display and Queries table. See the GFF Attributes section below.

Anno files
Select the input GFF3-formatted annotation files corresponding to your sequences. Note, using a GFF3 file directly can cause problems if it does not conform to what SyMAP expects; see Annotation files. Annotation is optional but highly recommended. Default location: data/seq/<project-name>/annotation

If either of the above are changed:
▸ If the annotation has already been loaded, Reload Annotation.
▸ If A&S has been previously run, re-run A&S (the existing alignments files will be reused).

2.e Alignment&Synteny

Go to top
Mask non-genes
If set to 'Yes', mask out all non-genic parts of the sequences before running MUMmer (gene annotation must be provided). This can save time for alignment (MUMmer) but prevents non-annotated anchors from being found.

▸ If this is changed after A&S, the alignment files need to be removed with Clear pair and A&S run again.

Order against
For draft contig sets, this allows you to order them using synteny to one of the other projects. See Ordering details.

▸ If the draft has been aligned to the Order against sequence, but this parameter was not set, it can be set and the A&S run with the existing alignment files.

2.f Rules for saving project parameters

Go to top
Before the project is loaded, the parameters are saved to the file
  data/seq/<project-name>/params
The params file parameters are shown on the Project Parameters window. These can only be viewed and changed using symap (not viewSymap).

When the project is loaded, all parameters are saved to the database except the A&S parameters (Mask-genes and Order-against). When the A&S is executed, then the corresponding parameters are saved to the database.

Any parameter not the defaults will be shown on the View popup.

2.g GFF Attributes

This section gives details on what GFF attributes are displayed in SyMAP, which refers to them as annotations.

The gene annotation is shown on the 2D display and as columns in the Queries results table. The attribues (annotations) comes from the last column of the GFF file. The attributes are a keyword=value list, e.g.

   ID=gene-AT1G01010;Name=NAC001;ID=rna-NM_099983.2;product=NAC domain containing protein 1
Defaults: Generally, all genes in a file have the same keywords, in which case, use the defaults. This will cause the entire attribute to be shown for the gene in the 2D display, and the Queries table will have columns for each keyword that has over 50 occurrences. In the example above, the columns will be ID, Name and product (the second ID will be ignored).

If there are many different keywords in the attribute list, this causes too many columns in the Queries table. This can be reduced by one (or both) of the following:

Anno key count: If there are many different keywords in the attribute list, set this count N to filter out all keywords with <N occurrences. The Anno Key Count can be modified at any time using symap (not viewSymap).

Anno keywords: The keyword=value pairs to be saved for each gene can be limited by listing the desired keywords separated by commas. Using this approach, it will also reduce annotation description per gene in the 2D display. For example, if the string "ID, product" was entered, the Name=value would not be part of any gene annotation. This must be set before Load Project is executed.


3. Pair Parameters

Go to top
 Pair Parameters
Sections:     Overview Alignment Cluster Hits Synteny Save MUMmer parameters

3.a Overview

Select a pair cell in the Available Syntenies table followed by the Parameters button, which will popup the window shown below. alignLine
pair parameters    

If a ✔ or A in pair cell exists, and the parameters are changed, do the following ( 'Section' refers to the section in the parameter panel):

SectionAction
Alignment Select Clear Pair to remove the existing alignments, then Selected Pair (ReDo).
Cluster Hits Use Selected Pair (ReDo) directly; the existing alignments for the pair will be used.
Synteny Use Selected Pair (ReDo) directly; the existing alignments for the pair will be used.

 

The parameters are described in the following three sections. The most important parameter is for Algo2; the Intergenic parameter must be increased for conserved genomes.

3.b Alignment

Go to top
The default MUMmer parameter seems to work fine with SyMAP, so probably do not need changing.
ParameterDescriptionDefault
  PROmer Args1 Arguments for PROmer. See MUMmer parameters.  -
  NUCmer Args1 Arguments for NUCmer. See MUMmer parameters.  -
  Self Args2 Arguments to use when aligning a chromosome to itself.  -
  PROmer Only3 Use PROmer for all alignments. Off
  NUCmer Only3 Use NUCmer for all alignments. Off

1 BEWARE: Entered PROmer and NUCmer arguments are NOT checked for correctness.
2When self-alignment is performed, standard arguments are used when comparing different chromosomes. However, additional arguments may be desired when a chromosome sequence is run against itself, e.g. --nosimplify.
3 By default, PROmer is used for alignments between different projects, while NUCmer is used for self alignments.

3.c Cluster Hits

Go to top
Algorithm 1 (modified original, updated v5.4.0, abbreviated Algo1):
Pros This is an generic algorithm that has knowledge of genes versus intergenic hits. It is recommended for ordering sequence contigs and when there are little or no gene annotation. It must be used for self-synteny. It has been used on 100's of genome comparisons.
Cons It does not distinguish between exon and intron hits. It is more likely to miss good homologous gene pairs.
Parameters It does not need parameter adjustment. However, this gives no control over what hits are filtered.

Algorithm 2 (exon-intron, last update v5.4.8, abbreviated Algo2):
Pros This is a new algorithm that has explicit knowledge of gene pairs and their exon-intron structure, and shows all gene pairs with hits unless filtered by the parameters (and a few internal filters). It has better filtering for small intergenic and intron-only hits.
Cons It does not perform self-synteny or take NUCmer files as input. It does not work when a given chromosome is split over multiple MUMmer files (this will not happen when SyMAP generates the MUMmer files).
Parameters The parameters generally will need adjusting. However, this gives more control over what gene pairs are shown and can surpress the minor intergenic hits. See Hints below the parameter explanation.

Algorithm 1 is the default for now; this is because Algorithm 2 generally requires some experimenting with the parameters whereas Algorithm 1 does not.

ParameterDescriptionDefault
Algorithm 1 (original)Yes
  Top N piles It will retain the top N hits of a pile of overlapping hits (Pile of Hits), as well as all hits with score at least 80% of the Nth hit. 2
Algorithm 2 (gene-centric)No
   Categories:EEExon-Exon    EIExon-Intron    EnExon-intergenic
   IIIntron-Intron    InIntron-intergenic    nnintergenic-intergenic
  Exon If the hit is EE, remove all hits that have less than N aligned bases (EQ 1). N=100
  Intron If the hit is EI, II, En or In remove all hits that have less than N aligned bases (EQ 1). N=300
  Intergenic If the hit is nn, remove all hits that have less than N aligned bases (EQ 1).
Increase this parameter for conserved genomes.
N=600
  Keep piles EE, EI, En, II, In
•This ONLY applies when there is a pile of overlapping hits (Pile of Hits); it tells the algorithm what type of cluster hits to retain if they are in a pile.
•Hits are filtered before pile analysis.
•Intergenic-intergenic pile hits, and any unchecked categories, are filtered as described in the next row.
EE, EI, En
  Top N piles Algo2 uses Algo1 Top N parameter for any uncheck categories, but in a more conservative way. It will retain the Top N hits of a piled region that are within 80% of the length of the longest hit. 2
EQ 1. Minimum matched bases = hit-length*identity, where hit-length is the maximum length of the query and target length reported by MUMmer, and identity is the percent identity reported by MUMmer. For clustered hits, EQ 1 is applied to the summed lengths and identity.

Hint for Algorithm 1: Increasing the Top N parameter can cause too many hits and reduce synteny. Decreasing it can remove more gene-pair hits. Hence, try Algorithm 2 if you want more gene pairs.

Hint for Algorithm 2: On the output to the terminal, if any chromosome pair shows over 10,000 hits, the parameters probably need to be made more stringent; too many hits (and piles) confuses the synteny algorithm, which results in synteny blocks not being found; it also results in very long execution time. For highly similar genomes, it is necessary to increase the Intergenic parameter, e.g. >1000; you may also need to increase the others parameters and uncheck EI and En to reduce hits. For distant genomes, decrease the parameters. Suggestion: for large genomes, experiment with the parameters on just one pair of the chromosomes.

Piles of hits: The below image shows a pile of hits on the left (Cabb Chr02) that link to repetitive genes on the right (Arab Chr05). These are important to keep.

The right image shows a pile of hits in an intergenic region (Cabbagge Chr03) to multiple other regions (B.rapa Chr01). There are MANY occurrence of repeats like this in the MUMmer file, which is why these piles must be filtered; if they are not, the synteny algorithm does not perform well.
Pile to repetitive gene

Pile to intergenic

Running with ./symap -s provides additional output for both algorithms.

Wrong orientation

It can happen that all hits in a cluster can have the same strand (+/+ or -/-), yet the cluster aligns to a positive and negative gene (or vice versa). By default, Algo2 includes these hits and writes the count to the terminal, e.g.
6,736 Both Genes - cluster strands differ from gene (Multi 1,425, Single 5,311)
If you would like to have them excluded or printed, use one of the following.
./symap -wse # exclude
./symap -wsp # print to terminal
./symap -wsp >ws.log   # print to file ws.log

You can view them in the Query where the Hit St column will be "=" but the gene Gst are not equal, or the other way around.

3.d Synteny

Go to top
ParameterDescriptionDefault
  Min Dots Minimum number of anchors required to define a synteny block. 7
  Merge Blocks Merge overlapping (or nearby) synteny blocks into larger blocks. Off

Selecting Merge Block may be beneficial when there are many small blocks. Below shows an example where the first image does not have merged blocks and the second does (the blue dots are hits that belong to the block).

demo Not Merge demoMerge

3.e Rules for saving pairs parameters

Go to top
Before the A&S is executed, the parameters are saved in
  data/seq_results/<proj1-to-proj2>/params
Once the A&S is executed, the parameters are stored in the database.

The file parameters are shown on the pair Parameter window. These can only be viewed and changed using symap (not viewSymap).

Any parameter not the default will be shown on the Summary page.

BEWARE: If you run A&S, then change the PROmer or NUCmer settings, but forget to Clear Pair before running A&S again, the parameters on the Summary page will be wrong (SyMAP does not check for this situation).

3.f MUMmer parameters

The arguments for MUMmer are NOT checked for correctness.

To see the parameters for the default MUMmer V3 on MacOS, from the symap directory:

./ext/mummer/mac/mummer -h
./ext/mummer/mac/promer -h
./ext/mummer/mac/nucmer -h
To see the parameters for the default MUMmer V3 on Linux:
./ext/mummer/linux/mummer -h
./ext/mummer/linux/promer -h
./ext/mummer/linux/nucmer -h
If you compiled V4 in the /ext directory:
./ext/mummer4/m4/bin/mummer -h
./ext/mummer4/m4/bin/promer -h
./ext/mummer4/m4/bin/nucmer -h
For a detail discussion of MUMmer running in SyMAP, see MUMmer.

After running MUMmer, all alignment files are removed except the ".mum" file; to prevent removal,
execute ./symap -mum.

Go to top

Email Comments To: symap@agcol.arizona.edu