SyMAP System Parameters

The original SyMAP was written for diverse plant genomes with short introns, but has been modified to work for the long introns of mammalian genomes, and less diverse genomes.

This document provide details of parameters and functions.

Start SyMAP

To start SyMAP, type at the command line: ./symap

To view the command line options: ./symap -h

For the first time user of SyMAP, see:

System Guide for setup and system requirements.
Demo to run the demo (do not skip this step!).
Create a new project.

Project Manager

1. Build database

1.a Selected

The project s selected from the Projects in the left panel are listed in the Selected section on the right. The possible functions vary with the state, as listed below:

♦ If there is any project not loaded to the database, you will see:
Load All Projects	Load all projects that have not been loaded yet.
♦ If a project is not* loaded to the database, you will see:*
Remove from disk	Only: Remove alignment directories from disk All: Remove alignment and project directory from disk Remove alignments removes alignments from `data/seq_results` for this project. You will be prompted for each one to confirm you want it removed. If there are no alignments, you will only see the prompt to remove the project directory. Remove from disk remove `data/seq/<project-name>` from disk. You will be prompted to confirm you want it removed. If removed, the project will no longer be shown on the left.
Load project	Loads the sequence and optional annotations to the database. After loading results, always verify them by selecting the View link, which provides a summary of what has been loaded.
Parameters	This brings up a panel of parameters, see Project Parameters. After the project is loaded, you can still change the Display parameters.
♦ If a project is loaded into the database, you will see:
Remove from database	The projects and its synteny pairs will be removed from the database, but the files stay on disk.
Reload project	Only: reload project only. All: reload project and remove alignments from disk. If All is selected, it first prompts for each alignment directory for this project before it is removed. It removes the alignments and final results, but it leaves the `params.txt` file. You will need to remove the alignment(s) if (1) there is a change in sequence, or (2) there is a change in the Minimal length parameter; see Load project parameters. For either option, it executes Remove from database followed by Load project.
Reload annotation	Removes the annotation from the database then load the annotations. This does not effect the alignment, so they do not have to be redone. The Alignment&Synteny commands will recognize if there is an existing alignment and will only perform the synteny computation.
Parameters	This brings up the Project Parameters panel.

For any action that will remove the project or alignments from disk, a popup will occur to confirm that you want this done! If it will be removing multiple alignment directories, it will prompt on each one.

1.b Available Syntenies

Sequence alignments are performed with MUMmer3, but can be changed to use MUMmer4 (see SyMAP MUMmer).

This section shows a table with the status of alignments between the selected loaded projects. Each cell in the table represents a pair of projects and the cell contains a status code showing whether or not that pair has been aligned (codes are listed below). Note that the table shows each pair cell twice, but only the lower cells are activated.

Clicking on a cell selects that pair of projects (the cell will be highlighted in green), and the buttons that can be selected are activated.

Code	Description
✔	Synteny for this project pair is ready to view.
A	The MUMmer alignment has been performed but the synteny computation has not been run. This status occurs if a pair is completed but then annotations are re-loaded for one of the projects, or if the MUMmer files have been added by the user.
?	The alignment have not been completed. In this case, select Selected Pair (Redo) and the alignments will be completed followed by the synteny algorithm.
	The alignment has not been started.

See Pair Parameters for additional information on the Available Syntenies and codes.

1.c Function Buttons

Alignment&Synteny (A&S)
All Pairs	Run (or complete) the synteny computation for all pairs in the Available Syntenies table. However, this will not run 'self-syntenies'; those need to be done individually by selecting the pair.
Selected Pair Selected Pair (Redo)	Run (or complete) the A&S computation for the selected project pair. If the pair is already complete, the button label changes to Redo, and only the synteny computation will be rerun. If you wish to rerun the MUMmer alignment, first use the Clear Pair function.
Clear Pair	Only: remove synteny from database All: remove synteny and alignments from disk for this pair If you have changed the Alignment parameters, or loaded new sequence for one of the projects, you need to have the alignments removed and redone; otherwise, you can just remove the synteny from database (where this step can be skip by selecting the Selected Pair (Redo)).
Parameters	Set the pair parameters for the selected pair cell .

For the remaining display buttons, see User Guide.

1.d CPU and Verbose

CPUs: Enter the maximum number of CPUs to use for an alignment. SyMAP will use up to that number (but may use fewer, depending on the number of sequences being aligned). Alternatively, the number of CPUs may be entered in the symap.config file or entered as a command line argument (-p N).

Verbose checkbox:

If checked, detailed summary information is written as it processes the MUMmer files. The information is written both to the terminal and the logs/<proj-to-proj>/symap.log file.
If this is not checked, it will write status information repeatedly on the same terminal line.
See Demo examples.
It is not saved as a parameter. This can also be turned on using a command line argument: symap -v

2. Project Parameters

Parameter
panel

Display

Load
annotation

2.a Parameter panel

Click the Parameters link for a project to open the parameters panel shown on the lower right.

Make sure these two parameters are correct before running the alignment: Minimum length, Sequence files. See Load project.

Sequence and Anno (annotation) files:
If the files are put in the default locations, nothing needs to be entered for them. Default locations:

  data/seq/<project-name>/sequence
  data/seq/<project-name>/annotation

See input for a description of the input files.

2.b Display

New values take immediate effect on Save. Most of the values are shown in the Selected section, as shown on the right. The values are saved in the symap_5/data/seq/<project-name>/params.txt file.

Parameter	Description	Default Value	Shown
Category	Category label for the project. This is only used to group projects on the left side of the Manager panel. Category labels must be composed of only letters, numbers, dash, underscore, or period. Either select an existing label from the drop-down or enter a new one in the text box. Do NOT enter the same label with different capitalization -- it may mess-up.	Uncategorized	Selected
Display name	A user-friendly name for the project. Shorter names will work better in the displays. Names must be composed of only letters, numbers, dash, underscore, period. It must be unique over all case-insensitive Display names and project-names.	project-name	Selected and all displays.
Abbreviation	A name this is exactly 4 characters. Names must be composed of only letters, numbers, dash, underscore, period. Uniqueness is not required over other Abbreviation. It can be the same as the corresponding Display name or project-name if they are only 4 characters.	Last 4 characters of Display name.	Queries column headings, and other places where a guaranteed short name is needed.
Description	Description of the project. Do NOT use quotes, backslash or #.	New project	Selected and View
Group type	How to refer to the sequences.	Chromosome	Selected
Anno key count	This applies to the annotation attributes columns shown in the Queries results table. See the GFF Attributes section below.	50	--

2.c Load project

The following parameters are under the Load project section of the parameters panel.

Group prefix
The term "Group" is used for any FASTA sequence type, e.g. chromosome, scaffold, contig. This option sounds trivial but is important for a good display, so please read carefully the following.

1.	When a Group prefix is entered, it allows SyMAP to remove the prefix from the chromosome names and use the remaining part as a shorter name (e.g. '1' instead of 'chr1', as shown on the lower right).
2.	If the sequence file has a mix of prefixes (e.g. Chr01, Scaf2345): If a Group prefix is entered, only sequences with that prefix will be loaded. Leave the Group prefix blank will load all sequences; their prefix will not be removed (e.g. the 'chr3' in the image on the right).
3.	Long names really clutter the display; it is best to use a consistent prefix that SyMAP can remove, otherwise, use really short prefixes, e.g. 's' for scaffold, 'c' for chromosome.
4.	You may remove the prefix after the project is loaded. For example, if your sequences had names "Scaf1", "Scaf2", etc, and "Scaf" was NOT entered as the Group prefix before load, you may enter it later and it will be removed from all sequence names. However, this is not reversible and a prefix cannot be added to the sequence names.

This parameter is finicky; after loading a project, select View for a popup of the input, and check the output to the terminal to make sure the annotation was loaded right. Also, see xToSymap as it may help you create the files with good prefixes.

Minimum length
This must be an integer, commas are allowed (e.g. 1,000,000).
This is the minimum length of the FASTA sequence that will be loaded; smaller sequences will be ignored. Note that annotations for ignored sequences will also be ignored, but some warning messages will be printed to the terminal. See xToSymap Length for help with setting this parameter.

Sequence files
Select the input FASTA sequence file(s), or directories of sequence files. For formatting, see Sequence files. Default location: data/seq/<project-name>/sequence

If any either the sequence files or Minimal length parameters are changed:
▸ If the project has already been loaded, Reload Project.
▸ If A&S has previously been run, select Clear Pair and remove the alignment files, then run A&S.

2.d Load annotation

The following parameters are under the Load annotation section of the parameters panel.

Anno keywords
A comma separated list of keywords. This can be used to reduce the annotation attribute keywords shown in the 2D display and Queries table. See the GFF Attributes section below.

Anno files
Select the input GFF3-formatted annotation files corresponding to your sequences. Note, using a GFF3 file directly can cause problems if it does not conform to what SyMAP expects; see Annotation files. Annotation is optional but highly recommended. Default location: data/seq/<project-name>/annotation

If either of the above are changed:
▸ If the annotation has already been loaded, Reload Annotation.
▸ If A&S has been previously run, re-run A&S (the existing alignments files will be reused).

2.e GFF Attributes

This section gives details on what GFF attributes are displayed in SyMAP, which refers to them as annotations.

The gene annotation is shown on the 2D display and as columns in the Queries results table. The attributes (annotations) comes from the last column of the GFF file. The attributes are a keyword=value list, e.g.

   ID=gene-AT1G01010;Name=NAC001;ID=rna-NM_099983.2;product=NAC domain containing protein 1

Defaults: Generally, all genes in a file have the same keywords, in which case, use the defaults. This will cause the entire attribute to be shown for the gene in the 2D display, and the Queries table will have columns for each keyword that has over 50 occurrences. In the example above, the columns will be ID, Name and product (the second ID will be ignored).

If there are many different keywords in the attribute list, this causes too many columns in the Queries table. This can be reduced by one (or both) of the following:

Anno key count: If there are many different keywords in the attribute list, set this count N to filter out all keywords with <N occurrences. The Anno key count can be modified at any time using symap (not viewSymap).

Anno keywords: The keyword=value pairs to be saved for each gene can be limited by listing the desired keywords separated by commas. Using this approach, it will also reduce annotation description per gene in the 2D display. Referring to the example above, if the string "ID, product" was entered, the Name=value would not be part of any gene annotation. This must be set before Load Annotations is executed.

2.f Rules for saving project parameters

Before the project is loaded, the parameters are saved to the file

  data/seq/<project-name>/params.txt

The params.txt file parameters are shown on the Project Parameters panel. These can only be viewed and changed using symap (not viewSymap).

When the project is loaded, all parameters are saved to the database except the A&S parameters (Mask-genes and Order-against). When the A&S is executed, then the corresponding parameters are saved to the database.

Any parameter that does not have the default value will be shown on the View popup.

3. Pair Parameters

Parameter
panel

Alignment

Synteny

3.a Parameter panel

The Available Syntenies explains the table in the lower right. The following provides more information in the context of the pair parameters.

The table on the right has cells that have the following completed:

Value	Alignment	Synteny
blank	No	No
A	Yes	No
✔	Yes	Yes

Alignment will not be redone if the cell contains an A. This is important because MUMmer is very time-consuming, but the synteny computation is not (see timing results); hence, one can make changes to the cluster or synteny parameters and re-run without redoing the alignments.

Select a pair cell in the Available Syntenies table followed by the Parameters button, which will popup the panel shown on the right.

If a ✔ or A in pair cell exists and the parameters for a section are changed, do the following:

Changed Section	Action
*Alignment*	Select Clear Pair to remove the existing alignments, then Selected Pair (ReDo).
*Cluster Hits*	Use Selected Pair (ReDo) directly; the existing alignments for the pair will be used.
*Synteny*	Use Selected Pair (ReDo) directly; the existing alignments for the pair will be used.

The parameters are described in the following 3 sections: Alignment, Cluster hits and Synteny.

3.b Alignment

Preparing the sequences

Parameter

Description

Default

Concat

By default, all sequences of the 1st genome are concatenated into one file and then multiple files are created from the 2nd genome to be searched against the first. For the 2nd genome, sequential short sequences are put into one file until the file length is >60M.

Concat unchecked: To reduce memory usage, you can uncheck Concat so that multiple files are created for each genomes, and all files from the 1st genome are searched against the all files from the 2nd genome. For both genomes, sequential short sequences are put into one file until the file length is >60M.

See below for timing differences.

Mask <abbrev>

Mask out all non-genic parts of the sequences before running MUMmer (gene annotation must be provided).

The <abbrev> is set in the Project parameters popup Abbreviation parameter, and the corresponding project's sequence is masked.

If Mask is changed after A&S, the alignment files need to be removed with Clear Pair and A&S run again.

Off

Concat: The following statistics are from comparing Arabidopsis thaliana (119M) against Brassica rapa (297M) on a MacOS using 1 CPU.

Concatenated		Not concatenated
48819 hits 334 synteny blocks 46319 gene hits 38334 synteny hits Finished in 1 hour 8 minutes		48846 hits 334 synteny blocks 46348 gene hits 38345 synteny hits Finished in 1 hour 35 minutes

MUMmer parameters

The default MUMmer parameter seems to work fine with SyMAP, so probably do not need changing.

Parameter	Description	Default
PROmer Args¹	Arguments for PROmer. See MUMmer parameters.	-
NUCmer Args¹	Arguments for NUCmer. See MUMmer parameters.	-
Self Args²	Arguments to use when aligning a chromosome to itself.	-
PROmer Only³	Use PROmer for all alignments.	Off
NUCmer Only³	Use NUCmer for all alignments.	Off

¹ BEWARE: Entered PROmer and NUCmer arguments are NOT checked for correctness.
²When self-alignment is performed, standard arguments are used when comparing different chromosomes. However, additional arguments may be desired when a chromosome sequence is run against itself, e.g. --nosimplify.
³ By default, PROmer is used for alignments between different projects, while NUCmer is used for self alignments.

3.c Cluster Hits

Algo1 vs Algo2 with hints

Parameter description

Pseudo and Piles

Go to top

3.c.I Algo1 vs Algo2

Algorithm 1 (modified original, updated v5.4.0, abbreviated Algo1):
Pros	This is an generic algorithm that has knowledge of genes versus intergenic hits. It is recommended for ordering sequence contigs and when there are little or no gene annotation. It must be used for self-synteny. It has been used on 100's of genome comparisons.
Cons	It does not distinguish between exon and intron hits. It is more likely to miss good homologous gene pairs.
Parameters	It only has one parameter, which is easier to run but there is no control over what hits are filtered.
Algorithm 2 (exon-intron, last update v5.6.0, abbreviated Algo2):
Pros	This is a new algorithm with explicit knowledge of gene pairs and their exon-intron structure. When there is good gene annotations for both genomes, this is definitely the superior algorithm.
Cons	It does not perform self-synteny. It does not work when a given chromosome is split over multiple MUMmer files; this will NOT happen when SyMAP generates the MUMmer files. This is also a "Pro" as it takes less memory.
Parameters	It has two set of parameters, hence, more control over results than Algo1. See Hints below the parameter explanation. (As of v5.6.0, the parameters generally do not need adjusting.)

Algo1 is the default since it works for all inputs.

Wrong strand The wrong strand is when all hits in a cluster are to the same strand (++/--) yet the cluster aligns to two genes on the different strands (+-/-+), or vice versa.

Algo1 does include these hits. You can view them in the Queries where the Hit St column will be different than the two gene Gst columns.

Algo2 does NOT include these hits. You can request to view the potential hits during the A&S by running it with the "-wsp" flag, i.e. ./symap -wsp . This will only show gene pairs with (1) multiple hits to exons (in one or multiple gene pairs), (2) at least one is not an overlapping gene. It is up to the user to determine what is real.

Hints about parameter settings

Hint for Algo1: Increasing the Top N parameter can cause too many hits and reduce synteny. Decreasing it can remove more gene-pair hits. Hence, try Algo2 if you want more gene pairs.

Hint for Algo2: On the output to the terminal (in Verbose mode), if any chromosome pair shows over 10,000 hits, the parameters probably need to be made more stringent; too many hits confuses the synteny algorithm, which results in synteny blocks not being found; it also results in very long execution time.

→ Suggestion: For large genomes, experiment with the parameters on just one pair of the chromosomes. (You can use xToSymap for the split.)

→ Using the v5.6.0 algorithm, I have experimented with the datasets (1) human, chimpanzee, mouse (2) Arabidopsis, Brassica rapa, Brassica oleracea. Only B. rapa to B. oleracea needed parameter adjustment: the number of G1 hits was over 200k, which is way more than typical; by increasing all parameters a small amount, this reduced to just over 100k.

3.c.II Parameter description

Parameter

Description

Default

Number Pseudo

If selected, the un-annotated ends of hits will be assigned a pseudo number. This is explained below in Pseudo Genes.

Algo1 (original)

Yes

Top N piles

It will retain the top N hits of a pile of overlapping hits (Pile of Hits), as well as all hits with score at least 80% of the Nth hit.

Algo2 (gene-centric)

Scale

Increase a scale to filter out more clustered hits, decrease to filter out less.

Gene	Determines the percentage of required gene coverage.
Exon	Determines the percentage of required exon coverage.
Len	For G2 and G1^, minimum size hit unless it completely covers a gene. Default minimum is 300bp (N300bp is minimum).
G0_Len	For G0^, minimum size hit. Default minimum is 1000bp (N1000bp is minimum). Suggestion: increase for closely related species.
^*G2=gene to gene, G1=gene to non-gene, G0=non-gene to non-gene

A clustered hit must pass multiple rules regarding coverage.
For example, the Exon coverage parameter may be made more stringent, yet there will still be some low exon scores due to good gene coverage, etc.

N=1.0

Keep piles

EE, EI, En, II, In (E=exon, I=intron, n=non-gene)
• This ONLY applies when there is a pile of overlapping hits (Pile of Hits);
it tells the algorithm what type of cluster hits to retain if they are in a pile.
• Hits are filtered before pile analysis.
• Intergenic-intergenic pile hits, and any unchecked categories, are filtered
as described in the next row.

EE, EI, En

Top N piles

Algo2 uses Algo1 Top N parameter for any uncheck categories, but in a more conservative way. It will retain the Top N hits of a piled region that are within 80% of the length of the longest hit.

3.c.III Pseudos and Piles

Pseudo genes

The end of a hit may not overlap an annotated gene; by default, this will just show a Gene# of 'N.~' where N is the chromosome number.

If Number Pseudo is selected, a pseudo Gene# will be created. The counts start after the annotated gene numbers and are suffixed by "~". For example, if the last Gene# for Chr03 is 5550 (e.g. 3.5550.), the first pseudo gene number will be 6000 (e.g. 3.6000.~).

If A&S was run without numbered pseudos, you can add them later by executing: ./symap -pseudo; the Selected Pair button will be replaced with Pseudo Only, which will just do the numbering (the Number Pseudo will need to be selected, but it will remind you). This cannot be undone; you would need to re-run A&S with the Number Pseudo unchecked to remove them.

Pros	If you would like the Queries Cluster and Report to include un-annotated hits. If you are exploring new candidate genes, numbered pseudos are easier to track. If your genome is not annotated, numbered pseudos are easier to track.
Cons	The Queries results can be easier to view with the 'N.~' as it is more distinct from a real Gene#.

If comparing more than 2 species, it makes the most sense to have them all numbered or not numbered (though a mix will work).

Piles of Hits

The below image shows a pile of hits on the left (Cabb Chr02) that link to repetitive genes on the right (Arab Chr05). These are important to keep.

The right image shows a pile of hits in an intergenic region (Cabbagge Chr03) to multiple other regions (B.rapa Chr01). There are MANY occurrence of repeats like this in the MUMmer file, which is why these piles must be filtered; if they are not, the synteny algorithm does not perform well.

3.d Synteny

Parameter	Description	Default
Min Hits	Minimum number of anchors required to define a synteny block.	7
Same orient¹	All hits in a block must have hits of the same orientation ('+/+' or '-/-') or different orientation ('+/-' or '-/+').	Off
Merge²	Merge overlapping (or nearby) synteny blocks into larger blocks.	Off

¹Same orient uses the same algorithm as the original except that it evaluates the hits for one orientation, than the other. It also adds a constraint that the correlation must be positive for blocks of hits in the same orientation, and negative for opposite orientation.

²Merge may be beneficial when there are many small blocks. Below shows an example where the first image does not have merged blocks and the second does (the blue dots are hits that belong to the block).

The following three images show the same regions when evaluated with the following 3 parameter sets:
Default (one block) Same orient (three blocks) Same orient with Merge (two blocks).
In the last image, the reverse orientation block is embedded in another block

Order against

For draft sequences, they may be ordered against another project. See Ordering details.

The Seq1->Seq2 and Seq2->Seq1 use the Abbreviation set in the Project parameters panel. The "->" indicates that the first sequence will be ordered against the second.

If the draft has been aligned to the Order against sequence, but this parameter was not set, it can be set and the A&S run with the existing alignment files.

3.e Rules for saving pairs parameters

Before the A&S is executed, the parameters are saved in

  data/seq_results/<proj1-to-proj2>/params.txt

Once the A&S is executed, the parameters are stored in the database.

The file parameters are shown on the pair Parameter panel. These can only be viewed and changed using symap (not viewSymap).

Any parameter not the default will be shown on the Summary page.

BEWARE: If you run A&S, then change the PROmer or NUCmer settings, but forget to Clear Pair before running A&S again, the parameters on the Summary page will be wrong (SyMAP does not check for this situation).

3.f MUMmer parameters

The arguments for MUMmer are NOT checked for correctness.

To see the parameters for the default MUMmer V3 on MacOS, from the symap directory:

./ext/mummer/mac/mummer -h
./ext/mummer/mac/promer -h
./ext/mummer/mac/nucmer -h

To see the parameters for the default MUMmer V3 on Linux:

./ext/mummer/linux/mummer -h
./ext/mummer/linux/promer -h
./ext/mummer/linux/nucmer -h

If you compiled V4 in the /ext directory:

./ext/mummer4/m4/bin/mummer -h
./ext/mummer4/m4/bin/promer -h
./ext/mummer4/m4/bin/nucmer -h

For a detail discussion of MUMmer running in SyMAP, see MUMmer.

After running MUMmer, all alignment files are removed except the ".mum" file; to prevent removal,
execute ./symap -mum.

Go to top

Email Comments To: symap@agcol.arizona.edu

Start SyMAP
1. Build database Selected Available Syntenies Function buttons CPU and Verbose	2. Project Parameters Parameter panel Display Load project Load annotation GFF Attributes Rules for saving project parameters	3. Pair Parameters Parameter panel Alignment Cluster Hits Synteny Rules for saving pair parameters MUMmer parameters