AGCoL

  Parameters

Build database Project Parameters Pair Parameters

  Go to top


       UA BIO5        SyMAP
Home
  
Download   Docs          System
Guide
  
Input   Parameters          User
Guide
  
Queries

The original SyMAP was written for diverse plant genomes with short introns, but has been modified to work for the long introns of mammalian genomes, and less diverse genomes.

Start SyMAP
1. Build database
  1. Selected
  2. Available Syntenies
  3. Function buttons
  4. CPU and Verbose
2. Project Parameters
  1. Parameter panel
  2. Display
  3. Load project
  4. Load annotation
  5. GFF Attributes
  6. Rules for saving project parameters
3. Pair Parameters
  1. Parameter panel
  2. Alignment
  3. Cluster Hits
  4. Synteny
  5. Rules for saving pair parameters
  6. MUMmer parameters
This document provide details of parameters and functions.

Start SyMAP

To start SyMAP, type at the command line: ./symap

To view the command line options: ./symap -h

 For the first time user of SyMAP, see:

Project Manager
manager

1. Build database

Selected Available Synteny Function buttons Go to top

1.a Selected

The project s selected from the Projects in the left panel are listed in the Selected section on the right. The possible functions vary with the state, as listed below:
If there is any project not loaded to the database, you will see:
   Load All Projects Load all projects that have not been loaded yet.
If a project is not loaded to the database, you will see:
   Remove from disk Only: Remove alignment directories from disk
All: Remove alignment and project directory from disk

Remove alignments removes alignments from data/seq_results for this project. You will be prompted for each one to confirm you want it removed. If there are no alignments, you will only see the prompt to remove the project directory.

Remove from disk remove data/seq/<project-name> from disk. You will be prompted to confirm you want it removed. If removed, the project will no longer be shown on the left.

   Load project Loads the sequence and optional annotations to the database. After loading results, always verify them by selecting the View link, which provides a summary of what has been loaded.
   Parameters This brings up a panel of parameters, see Project Parameters. After the project is loaded, you can still change the Display parameters.
If a project is loaded into the database, you will see:
   Remove from database The projects and its synteny pairs will be removed from the database, but the files stay on disk.
   Reload project Only: reload project only.
All: reload project and remove alignments from disk.

If All is selected, it first prompts for each alignment directory for this project before it is removed. It removes the alignments and final results, but it leaves the params.txt file. You will need to remove the alignment(s) if (1) there is a change in sequence, or (2) there is a change in the Minimal length parameter; see Load project parameters.

For either option, it executes Remove from database followed by Load project.

   Reload annotation Removes the annotation from the database then load the annotations.

This does not effect the alignment, so they do not have to be redone. The Alignment&Synteny commands will recognize if there is an existing alignment and will only perform the synteny computation.

   Parameters This brings up the Project Parameters panel.

For any action that will remove the project or alignments from disk, a popup will occur to confirm that you want this done! If it will be removing multiple alignment directories, it will prompt on each one.

1.b Available Syntenies

Sequence alignments are performed with MUMmer3, but can be changed to use MUMmer4 (see SyMAP MUMmer).

This section shows a table with the status of alignments between the selected loaded projects. Each cell in the table represents a pair of projects and the cell contains a status code showing whether or not that pair has been aligned (codes are listed below). Note that the table shows each pair cell twice, but only the lower cells are activated.

Clicking on a cell selects that pair of projects (the cell will be highlighted in green), and the buttons that can be selected are activated.

Code Description
Synteny for this project pair is ready to view.
A The MUMmer alignment has been performed but the synteny computation has not been run. This status occurs if a pair is completed but then annotations are re-loaded for one of the projects, or if the MUMmer files have been added by the user.
? The alignment have not been completed. In this case, select Selected Pair (Redo) and the alignments will be completed followed by the synteny algorithm.
  The alignment has not been started.

See Pair Parameters for additional information on the Available Syntenies and codes.

1.c Function Buttons

Alignment&Synteny (A&S)
All Pairs Run (or complete) the synteny computation for all pairs in the Available Syntenies table. However, this will not run 'self-syntenies'; those need to be done individually by selecting the pair.
Selected Pair
Selected Pair (Redo)
Run (or complete) the A&S computation for the selected project pair.
If the pair is already complete, the button label changes to Redo, and only the synteny computation will be rerun.
If you wish to rerun the MUMmer alignment, first use the Clear Pair function.
Clear Pair Only: remove synteny from database
All: remove synteny and alignments from disk for this pair

If you have changed the Alignment parameters, or loaded new sequence for one of the projects, you need to have the alignments removed and redone; otherwise, you can just remove the synteny from database (where this step can be skip by selecting the Selected Pair (Redo)).

Parameters Set the pair parameters for the selected pair cell .

For the remaining display buttons, see User Guide.

1.d CPU and Verbose

CPUs: Enter the maximum number of CPUs to use for an alignment. SyMAP will use up to that number (but may use fewer, depending on the number of sequences being aligned). Alternatively, the number of CPUs may be entered in the symap.config file or entered as a command line argument (-p N).

Verbose checkbox:


2. Project Parameters

Parameter
panel
   
Display Load
project
Load
annotation
   
GFF
Attributes
Save Go to top

2.a Parameter panel

Click the Parameters link for a project to open the parameters panel shown on the lower right.

Make sure these two parameters are correct before running the alignment: Minimum length, Sequence files. See Load project.

 

Sequence and Anno (annotation) files:
If the files are put in the default locations, nothing needs to be entered for them. Default locations:

  data/seq/<project-name>/sequence
  data/seq/<project-name>/annotation
See input for a description of the input files.
parameters1

2.b Display

New values take immediate effect on Save. Most of the values are shown in the Selected section, as shown on the right. The values are saved in the symap_5/data/seq/<project-name>/params.txt file. Project options

ParameterDescriptionDefault ValueShown
  Category Category label for the project. This is only used to group projects on the left side of the Manager panel.

Category labels must be composed of only letters, numbers, dash, underscore, or period.

Either select an existing label from the drop-down or enter a new one in the text box. Do NOT enter the same label with different capitalization -- it may mess-up.

Uncategorized Selected
  Display name A user-friendly name for the project.

Shorter names will work better in the displays. Names must be composed of only letters, numbers, dash, underscore, period.

It must be unique over all case-insensitive Display names and project-names.

project-name Selected and all displays.
  Abbreviation A name this is exactly 4 characters. Names must be composed of only letters, numbers, dash, underscore, period.

Uniqueness is not required over other Abbreviation. It can be the same as the corresponding Display name or project-name if they are only 4 characters.

Last 4 characters of Display name. Queries column headings, and other places where a guaranteed short name is needed.
  Description Description of the project.

Do NOT use quotes, backslash or #.

New project Selected and View
  Group type How to refer to the sequences. Chromosome Selected
  Anno key count This applies to the annotation attributes columns shown in the Queries results table. See the GFF Attributes section below. 50 --

2.c Load project

The following parameters are under the Load project section of the parameters panel.

Group prefix
The term "Group" is used for any FASTA sequence type, e.g. chromosome, scaffold, contig. This option sounds trivial but is important for a good display, so please read carefully the following.

1. When a Group prefix is entered, it allows SyMAP to remove the prefix from the chromosome names and use the remaining part as a shorter name (e.g. '1' instead of 'chr1', as shown on the lower right).
2.If the sequence file has a mix of prefixes (e.g. Chr01, Scaf2345):
  • If a Group prefix is entered, only sequences with that prefix will be loaded.
  • Leave the Group prefix blank will load all sequences; their prefix will not be removed (e.g. the 'chr3' in the image on the right).
3.Long names really clutter the display; it is best to use a consistent prefix that SyMAP can remove, otherwise, use really short prefixes, e.g. 's' for scaffold, 'c' for chromosome.
4. You may remove the prefix after the project is loaded. For example, if your sequences had names "Scaf1", "Scaf2", etc, and "Scaf" was NOT entered as the Group prefix before load, you may enter it later and it will be removed from all sequence names. However, this is not reversible and a prefix cannot be added to the sequence names.
Demo prefix
This parameter is finicky; after loading a project, select View for a popup of the input, and check the output to the terminal to make sure the annotation was loaded right. Also, see xToSymap as it may help you create the files with good prefixes.

Minimum length
This must be an integer, commas are allowed (e.g. 1,000,000).
This is the minimum length of the FASTA sequence that will be loaded; smaller sequences will be ignored. Note that annotations for ignored sequences will also be ignored, but some warning messages will be printed to the terminal. See xToSymap Length for help with setting this parameter.

Sequence files
Select the input FASTA sequence file(s), or directories of sequence files. For formatting, see Sequence files. Default location: data/seq/<project-name>/sequence

If any either the sequence files or Minimal length parameters are changed:
▸ If the project has already been loaded, Reload Project.
▸ If A&S has previously been run, select Clear Pair and remove the alignment files, then run A&S.

2.d Load annotation

The following parameters are under the Load annotation section of the parameters panel.

Anno keywords
A comma separated list of keywords. This can be used to reduce the annotation attribute keywords shown in the 2D display and Queries table. See the GFF Attributes section below.

Anno files
Select the input GFF3-formatted annotation files corresponding to your sequences. Note, using a GFF3 file directly can cause problems if it does not conform to what SyMAP expects; see Annotation files. Annotation is optional but highly recommended. Default location: data/seq/<project-name>/annotation

If either of the above are changed:
▸ If the annotation has already been loaded, Reload Annotation.
▸ If A&S has been previously run, re-run A&S (the existing alignments files will be reused).

2.e GFF Attributes

This section gives details on what GFF attributes are displayed in SyMAP, which refers to them as annotations.

The gene annotation is shown on the 2D display and as columns in the Queries results table. The attributes (annotations) comes from the last column of the GFF file. The attributes are a keyword=value list, e.g.

   ID=gene-AT1G01010;Name=NAC001;ID=rna-NM_099983.2;product=NAC domain containing protein 1
Defaults: Generally, all genes in a file have the same keywords, in which case, use the defaults. This will cause the entire attribute to be shown for the gene in the 2D display, and the Queries table will have columns for each keyword that has over 50 occurrences. In the example above, the columns will be ID, Name and product (the second ID will be ignored).

If there are many different keywords in the attribute list, this causes too many columns in the Queries table. This can be reduced by one (or both) of the following:

Anno key count: If there are many different keywords in the attribute list, set this count N to filter out all keywords with <N occurrences. The Anno key count can be modified at any time using symap (not viewSymap).

Anno keywords: The keyword=value pairs to be saved for each gene can be limited by listing the desired keywords separated by commas. Using this approach, it will also reduce annotation description per gene in the 2D display. Referring to the example above, if the string "ID, product" was entered, the Name=value would not be part of any gene annotation. This must be set before Load Annotations is executed.

2.f Rules for saving project parameters

Before the project is loaded, the parameters are saved to the file
  data/seq/<project-name>/params.txt
The params.txt file parameters are shown on the Project Parameters panel. These can only be viewed and changed using symap (not viewSymap).

When the project is loaded, all parameters are saved to the database except the A&S parameters (Mask-genes and Order-against). When the A&S is executed, then the corresponding parameters are saved to the database.

Any parameter that does not have the default value will be shown on the View popup.


3. Pair Parameters

Parameter
panel
   
Alignment Cluster
Hits
Synteny    Save MUMmer
parameters
Go to top

3.a Parameter panel

The Available Syntenies explains the table in the lower right. The following provides more information in the context of the pair parameters.
The table on the right has cells that have the following completed:

ValueAlignmentSynteny
blankNoNo
AYesNo
YesYes
demo table with A

 

Alignment will not be redone if the cell contains an A. This is important because MUMmer is very time-consuming, but the synteny computation is not (see timing results); hence, one can make changes to the cluster or synteny parameters and re-run without redoing the alignments.

Select a pair cell in the Available Syntenies table followed by the Parameters button, which will popup the panel shown on the right.

If a ✔ or A in pair cell exists and the parameters for a section are changed, do the following:

Changed SectionAction
Alignment Select Clear Pair to remove the existing alignments, then Selected Pair (ReDo).
Cluster Hits Use Selected Pair (ReDo) directly; the existing alignments for the pair will be used.
Synteny Use Selected Pair (ReDo) directly; the existing alignments for the pair will be used.

 

The parameters are described in the following 3 sections: Alignment, Cluster hits and Synteny.

    pair parameters

3.b Alignment

Preparing the sequences

ParameterDescriptionDefault
Concat By default, all sequences of the 1st genome are concatenated into one file and then multiple files are created from the 2nd genome to be searched against the first. For the 2nd genome, sequential short sequences are put into one file until the file length is >60M.

Concat unchecked: To reduce memory usage, you can uncheck Concat so that multiple files are created for each genomes, and all files from the 1st genome are searched against the all files from the 2nd genome. For both genomes, sequential short sequences are put into one file until the file length is >60M.

See below for timing differences.

On
Mask <abbrev>Mask out all non-genic parts of the sequences before running MUMmer (gene annotation must be provided).

The <abbrev> is set in the Project parameters popup Abbreviation parameter, and the corresponding project's sequence is masked.

If Mask is changed after A&S, the alignment files need to be removed with Clear Pair and A&S run again.

Off

Concat: The following statistics are from comparing Arabidopsis thaliana (119M) against Brassica rapa (297M) on a MacOS using 1 CPU.

Concatenated       Not concatenated
48819 hits
  334 synteny blocks
46319 gene hits
38334 synteny hits

Finished in 1 hour 8 minutes
48846 hits
  334 synteny blocks
46348 gene hits
38345 synteny hits

Finished in 1 hour 35 minutes

MUMmer parameters

The default MUMmer parameter seems to work fine with SyMAP, so probably do not need changing.
ParameterDescriptionDefault
  PROmer Args1 Arguments for PROmer. See MUMmer parameters.  -
  NUCmer Args1 Arguments for NUCmer. See MUMmer parameters.  -
  Self Args2 Arguments to use when aligning a chromosome to itself.  -
  PROmer Only3 Use PROmer for all alignments. Off
  NUCmer Only3 Use NUCmer for all alignments. Off

1 BEWARE: Entered PROmer and NUCmer arguments are NOT checked for correctness.
2When self-alignment is performed, standard arguments are used when comparing different chromosomes. However, additional arguments may be desired when a chromosome sequence is run against itself, e.g. --nosimplify.
3 By default, PROmer is used for alignments between different projects, while NUCmer is used for self alignments.

3.c Cluster Hits

Algo1 vs Algo2 with hints Parameter description Pseudo and Piles Go to top

3.c.I Algo1 vs Algo2

Algorithm 1 (modified original, updated v5.4.0, abbreviated Algo1):
ProsThis is an generic algorithm that has knowledge of genes versus intergenic hits.
It is recommended for ordering sequence contigs and when there are little or no gene annotation.
It must be used for self-synteny.
It has been used on 100's of genome comparisons.
ConsIt does not distinguish between exon and intron hits. It is more likely to miss good homologous gene pairs.
ParametersIt only has one parameter, which is easier to run but there is no control over what hits are filtered.
Algorithm 2 (exon-intron, last update v5.6.0, abbreviated Algo2):
ProsThis is a new algorithm with explicit knowledge of gene pairs and their exon-intron structure.
When there is good gene annotations for both genomes, this is definitely the superior algorithm.
ConsIt does not perform self-synteny.
It does not work when a given chromosome is split over multiple MUMmer files; this will NOT happen
when SyMAP generates the MUMmer files. This is also a "Pro" as it takes less memory.
ParametersIt has two set of parameters, hence, more control over results than Algo1. See Hints below the parameter explanation. (As of v5.6.0, the parameters generally do not need adjusting.)

Algo1 is the default since it works for all inputs.

Wrong strand The wrong strand is when all hits in a cluster are to the same strand (++/--) yet the cluster aligns to two genes on the different strands (+-/-+), or vice versa.

Algo1 does include these hits. You can view them in the Queries where the Hit St column will be different than the two gene Gst columns.

Algo2 does NOT include these hits. You can request to view the potential hits during the A&S by running it with the "-wsp" flag, i.e. ./symap -wsp . This will only show gene pairs with (1) multiple hits to exons (in one or multiple gene pairs), (2) at least one is not an overlapping gene. It is up to the user to determine what is real.

Hints about parameter settings

Hint for Algo1: Increasing the Top N parameter can cause too many hits and reduce synteny. Decreasing it can remove more gene-pair hits. Hence, try Algo2 if you want more gene pairs.

Hint for Algo2: On the output to the terminal (in Verbose mode), if any chromosome pair shows over 10,000 hits, the parameters probably need to be made more stringent; too many hits confuses the synteny algorithm, which results in synteny blocks not being found; it also results in very long execution time.

→ Suggestion: For large genomes, experiment with the parameters on just one pair of the chromosomes. (You can use xToSymap for the split.)

→ Using the v5.6.0 algorithm, I have experimented with the datasets (1) human, chimpanzee, mouse (2) Arabidopsis, Brassica rapa, Brassica oleracea. Only B. rapa to B. oleracea needed parameter adjustment: the number of G1 hits was over 200k, which is way more than typical; by increasing all parameters a small amount, this reduced to just over 100k.

3.c.II Parameter description

ParameterDescriptionDefault
Number PseudoIf selected, the un-annotated ends of hits will be assigned a pseudo number. This is explained below in Pseudo Genes. No
Algo1 (original) Yes
  Top N piles It will retain the top N hits of a pile of overlapping hits (Pile of Hits), as well as all hits with score at least 80% of the Nth hit. 2
Algo2 (gene-centric) No
  Scale Increase a scale to filter out more clustered hits, decrease to filter out less.
Gene Determines the percentage of required gene coverage.
Exon Determines the percentage of required exon coverage.
Len For G2 and G1*, minimum size hit unless it completely covers a gene.
Default minimum is 300bp (N*300bp is minimum).
G0_Len For G0*, minimum size hit.
Default minimum is 1000bp (N*1000bp is minimum).
Suggestion: increase for closely related species.
*G2=gene to gene, G1=gene to non-gene, G0=non-gene to non-gene
A clustered hit must pass multiple rules regarding coverage.
For example, the Exon coverage parameter may be made more stringent, yet there will still be some low exon scores due to good gene coverage, etc.
N=1.0
  Keep piles EE, EI, En, II, In (E=exon, I=intron, n=non-gene)
• This ONLY applies when there is a pile of overlapping hits (Pile of Hits);
  it tells the algorithm what type of cluster hits to retain if they are in a pile.
• Hits are filtered before pile analysis.
• Intergenic-intergenic pile hits, and any unchecked categories, are filtered
  as described in the next row.
EE, EI, En
  Top N piles Algo2 uses Algo1 Top N parameter for any uncheck categories, but in a more conservative way. It will retain the Top N hits of a piled region that are within 80% of the length of the longest hit. 2

3.c.III Pseudos and Piles

Pseudo genes
The end of a hit may not overlap an annotated gene; by default, this will just show a Gene# of 'N.~' where N is the chromosome number.

If Number Pseudo is selected, a pseudo Gene# will be created. The counts start after the annotated gene numbers and are suffixed by "~". For example, if the last Gene# for Chr03 is 5550 (e.g. 3.5550.), the first pseudo gene number will be 6000 (e.g. 3.6000.~).

Pseudo genes

If A&S was run without numbered pseudos, you can add them later by executing: ./symap -pseudo; the Selected Pair button will be replaced with Pseudo Only, which will just do the numbering (the Number Pseudo will need to be selected, but it will remind you). This cannot be undone; you would need to re-run A&S with the Number Pseudo unchecked to remove them.

Pros
  • If you would like the Queries Cluster and Report to include un-annotated hits.
  • If you are exploring new candidate genes, numbered pseudos are easier to track.
  • If your genome is not annotated, numbered pseudos are easier to track.
Cons
  • The Queries results can be easier to view with the 'N.~' as it is more distinct from a real Gene#.

If comparing more than 2 species, it makes the most sense to have them all numbered or not numbered (though a mix will work).

Piles of Hits

The below image shows a pile of hits on the left (Cabb Chr02) that link to repetitive genes on the right (Arab Chr05). These are important to keep.

The right image shows a pile of hits in an intergenic region (Cabbagge Chr03) to multiple other regions (B.rapa Chr01). There are MANY occurrence of repeats like this in the MUMmer file, which is why these piles must be filtered; if they are not, the synteny algorithm does not perform well.
Pile to repetitive gene

Pile to intergenic

3.d Synteny

ParameterDescriptionDefault
  Min Hits Minimum number of anchors required to define a synteny block. 7
  Same orient1 All hits in a block must have hits of the same orientation ('+/+' or '-/-') or different orientation ('+/-' or '-/+'). Off
  Merge2 Merge overlapping (or nearby) synteny blocks into larger blocks. Off

1Same orient uses the same algorithm as the original except that it evaluates the hits for one orientation, than the other. It also adds a constraint that the correlation must be positive for blocks of hits in the same orientation, and negative for opposite orientation.

2Merge may be beneficial when there are many small blocks. Below shows an example where the first image does not have merged blocks and the second does (the blue dots are hits that belong to the block).

demo Not Merge demoMerge

The following three images show the same regions when evaluated with the following 3 parameter sets:
Default (one block)                Same orient (three blocks)          Same orient with Merge (two blocks). Pair Orient
In the last image, the reverse orientation block is embedded in another block

Order against

For draft sequences, they may be ordered against another project. See Ordering details.
The Seq1->Seq2 and Seq2->Seq1 use the Abbreviation set in the Project parameters panel. The "->" indicates that the first sequence will be ordered against the second. pair parameters synteny

If the draft has been aligned to the Order against sequence, but this parameter was not set, it can be set and the A&S run with the existing alignment files.

3.e Rules for saving pairs parameters

Before the A&S is executed, the parameters are saved in
  data/seq_results/<proj1-to-proj2>/params.txt
Once the A&S is executed, the parameters are stored in the database.

The file parameters are shown on the pair Parameter panel. These can only be viewed and changed using symap (not viewSymap).

Any parameter not the default will be shown on the Summary page.

BEWARE: If you run A&S, then change the PROmer or NUCmer settings, but forget to Clear Pair before running A&S again, the parameters on the Summary page will be wrong (SyMAP does not check for this situation).

3.f MUMmer parameters

The arguments for MUMmer are NOT checked for correctness.

To see the parameters for the default MUMmer V3 on MacOS, from the symap directory:

./ext/mummer/mac/mummer -h
./ext/mummer/mac/promer -h
./ext/mummer/mac/nucmer -h
To see the parameters for the default MUMmer V3 on Linux:
./ext/mummer/linux/mummer -h
./ext/mummer/linux/promer -h
./ext/mummer/linux/nucmer -h
If you compiled V4 in the /ext directory:
./ext/mummer4/m4/bin/mummer -h
./ext/mummer4/m4/bin/promer -h
./ext/mummer4/m4/bin/nucmer -h
For a detail discussion of MUMmer running in SyMAP, see MUMmer.

After running MUMmer, all alignment files are removed except the ".mum" file; to prevent removal,
execute ./symap -mum.

Go to top

Email Comments To: symap@agcol.arizona.edu