Data analysis plugins

Locus explorer

The locus explorer is a sequence definition database plugin. It can create schematics showing the polymorphic sites within a locus, calculate the GC content and generate aligned translated sequences.

Click ‘Locus Explorer’ from the sequence definition database contents page.

_images/locus_explorer.png

Polymorphic site analysis

Select the locus you would like to analyse in the Locus dropdown box. The page will reload.

_images/locus_explorer2.png

Select the alleles that you would like to include in the analysis. Variable length loci are limited to 2000 sequences or fewer since these need to be aligned. Select ‘Polymorphic Sites’ in the Analysis selection and click ‘Submit’.

_images/locus_explorer3.png

If an alignment is necessary, the job will be submitted to the job queue and the analysis performed. If no alignment is necessary, then the analysis is shown immediately.

The first part of the page shows the schematic.

_images/locus_explorer4.png

Clicking any of the sequence bases will calculate the exact frequencies of the different nucleotides at that position.

_images/locus_explorer5.png _images/locus_explorer6.png

The second part of the page shows a table listing nucleotide frequencies at each of the variable positions.

_images/locus_explorer7.png

Codon usage

Select the alleles that you would like to include in the analysis. Again, variable length loci are limited to 200 sequences or fewer since these need to be aligned. Click ‘Codon’.

_images/locus_explorer8.png

The GC content of the alleles will be determined and a table of the codon frequencies displayed.

_images/locus_explorer9.png

Aligned translations

If a DNA coding sequence locus is selected, an aligned translation can be produced.

Select the alleles that you would like to include in the analysis. Again, variable length loci are limited to 200 sequences or fewer since these need to be aligned. Click ‘Translate’.

_images/locus_explorer10.png

An aligned amino acid sequence will be displayed.

_images/locus_explorer11.png

If there appear to be a lot of stop codons in the translation, it is possible that the orf value in the locus definition is not set correctly.

Field breakdown

The field breakdown plugin for isolate databases displays the frequency of each value for fields, alleles and schemes.

The breakdown function can be selected for the whole database by clicking the ‘Single field’ link in the Breakdown section of the main contents page.

_images/field_breakdown.png

Alternatively, a breakdown can be displayed of the dataset returned from a query by clicking the ‘Fields’ button in the Breakdown list at the bottom of the results table. Please note that the list of functions here may vary depending on the setup of the database.

_images/field_breakdown2.png

A chart will be displayed for the first field.

_images/field_breakdown3.png

Other fields can be chosen by selecting them in the dropdown list box.

_images/field_breakdown4.png

You can also breakdown loci and schemes by clicking the appropriate button. This will re-populate the dropdown list.

_images/field_breakdown5.png

The charts are dynamic and you can manipulate some aspects of them using controls shown on the screen.

Pie charts

The maximum number of segments shown can be modified by sliding the ‘Max segments’ control. Low frequency values will be grouped in to a segment called ‘Others’.

_images/field_breakdown6.png

The chart can be transformed in to a donut chart by clicking the donut icon.

_images/field_breakdown7.png

The icon changes to a pie chart image (clicking this will return to the pie chart).

_images/field_breakdown7a.png

Values can also be removed from the analysis by clicking their label in the legend below the chart. The percentages of the other values will be recalculated. Clicking the label again will re-add the value.

Bar charts

Integer fields will be displayed as a bar chart.

_images/field_breakdown8.png

You can modify the height and the orientation of the chart using the controls.

Line charts

Date fields will be displayed as a line chart. By default this shows the cumulative values.

_images/field_breakdown9.png

The chart can be converted in to a bar chart showing discrete values by clicking the bar chart icon.

_images/field_breakdown10.png

The icon changes to a line chart image (clicking this will return to the line chart).

_images/field_breakdown11.png

Summary tables

The field breakdown can be displayed as a summary table containing values and percentages of all values. This can be selected by clicking the table icon below the displayed chart.

_images/field_breakdown12.png

The table can be re-ordered by clicking any of the headings.

_images/field_breakdown13.png

The same table can be exported as an Excel file by clicking the Excel icon.

_images/field_breakdown14.png

Alternatively, it can be exported as a tab-delimited text file by clicking the text file icon.

_images/field_breakdown15.png

Exporting allele sequences

If a locus breakdown is being display, you can choose to export the allele sequences in FASTA format by clicking the FASTA file icon.

_images/field_breakdown16.png

Two field breakdown

The two field breakdown plugin displays a table breaking down one field against another, e.g. breakdown of serogroup by year.

The analysis can be selected for the whole database by clicking the ‘Two field breakdown’ link on the main contents page.

_images/two_field_breakdown.png

Alternatively, a two field breakdown can be displayed of the dataset returned from a query by clicking the ‘Two field’ button in the Breakdown list at the bottom of the results table. Please note that the list of functions here may vary depending on the setup of the database.

_images/two_field_breakdown2.png

Select the two fields you wish to breakdown and how you would like the values displayed (percentage/absolute values and totaling options).

_images/two_field_breakdown3.png

Click submit. The breakdown will be displayed as a table. Bar charts will also be displayed provided the number of returned values for both fields are fewer than 30.

_images/two_field_breakdown4.png

The table values can be exported in a format suitable for copying in to a spreadsheet by clicking ‘Download as tab-delimited text’ underneath the table.

Note

The job will be submitted to the offline job queue if the query returns 10,000 or more isolates. In this case, the buttons to reverse the axes or to change whether values or percentages are shown will not be available.

Sequence bin breakdown

The sequence bin breakdown plugin calculates statistics based on the number and length of contigs in the sequence bin as well as the number of loci tagged for an isolate record.

The function can be selected by clicking the ‘Sequence bin’ link on the Breakdown section of the main contents page.

_images/seqbin_breakdown.png

Alternatively, it can be accessed following a query by clicking the ‘Sequence bin’ button in the Breakdown list at the bottom of the results table. Please note that the list of functions here may vary depending on the setup of the database.

_images/seqbin_breakdown2.png

Select the isolate records to analyse - these will be pre-selected if you accessed the plugin following a query. You can also select loci and/or schemes which will be used to calculate the totals and percentages of loci designated and tagged. This may be useful as a guide to assembly quality if you use a scheme of core loci where a good assembly would be expected to include all member loci. To determine the total of all loci designated or tagged, click ‘All loci’ in the scheme tree.

There is also an option to determine the mean G+C content of the sequence bin of each isolate.

Click submit.

_images/seqbin_breakdown3.png

If there are fewer than 100 isolates selected, the table will be generated immediately. Otherwise it will be submitted to the job queue.

A table of sequence bin stats will be generated.

_images/seqbin_breakdown4.png

You can choose to export the data in tab-delimited text or Excel formats by clicking the appropriate link at the bottom of the table.

_images/seqbin_breakdown5.png

Sequence bin records can also be accessed by clicking the ‘Display’ button for each row of the table.

_images/seqbin_breakdown6.png

Genome comparator

Genome Comparator is an optional plugin that can be enabled for specific databases. It is used to compare whole genome data of isolates within the database using either the database defined loci or the coding sequences of an annotated genome as the comparator.

Output is equivalent to a whole genome MLST profile, a distance matrix calculated based on allelic differences and a NeighborNet graph generated from this distance matrix.

Genome Comparator can be accessed on databases where it is enabled from the contents page by clicking the ‘Genome Comparator’ link.

_images/genome_comparator.png

Alternatively, it can be accessed following a query by clicking the ‘Genome Comparator’ button at the bottom of the results table. Isolates with sequence data returned in the query will be automatically selected within the Genome Comparator interface.

_images/genome_comparator2.png

Analysis using defined loci

Select the isolate genomes that you wish to analyse. These will either be in a dropdown list or, if there are too many in the database, a text input where a list can be entered. You can also upload your own genomes for analysis - these should be either a single file in FASTA format (if you have just one genome), or a zip file containing multiple FASTA files. Select either the loci from the list or a set of schemes. Press submit.

_images/genome_comparator3.png

The job will be submitted to the job queue and will start running shortly. Click the link to follow the job progress and view the output.

_images/genome_comparator4.png

There will be a series of tables displaying variable loci, colour-coded to indicate allelic differences. Finally, there will be links to a distance matrix which can be loaded in to SplitsTree for further analysis and to a NeighborNet chart showing relatedness of isolates. Due to processing constraints on the web server, this NeighborNet is only calculated if 200 or fewer genomes are selected for analysis, but this can be generated in the stand-alone version of SplitsTree using the distance matrix if required.

_images/genome_comparator5.png

Analysis using annotated reference genome

Select the isolate genomes that you wish to analyse and then either enter a Genbank accession number for the reference genome, or select from the list of reference genomes (this list will only be present if the administrator has set it up). Selecting reference genomes will hide the locus and scheme selection forms.

_images/genome_comparator6.png

Output is similar to when comparing against defined loci, but this time every coding sequence in the annotated reference will be BLASTed against the selected genomes. Because allele designations are not defined, the allele found in the reference genome is designated allele 1, the next different sequence is allele 2 etc.

_images/genome_comparator10.png

Include in identifiers fieldset

This selection box allows you to choose which isolate provenance fields will be included in the results table. This does not affect the output of the alignments as taxa names are limited in length by the alignment programs.

_images/genome_comparator7.png

Multiple values can be selected by clicking while holding down Ctrl.

Reference genome fieldset

This section allows you to choose a reference genome to use as the source of comparator sequences.

_images/genome_comparator8.png

There are three possibilities here:

  1. Enter accession number - Enter a Genbank accession number of an annotated reference and Genome Comparator will automatically retrieve this from Genbank.
  2. Select from list - The administrator may have selected some genomes to offer for comparison. If these are present, simply select from the list.
  3. Upload genome - Click ‘Browse’ and upload your own reference. This can either be in Genbank, EMBL or FASTA format. Ensure that the filename ends in the appropriate file extension (.gb, .embl, .fas) so that it is recognized.

Parameters/options fieldset

This section allows you to modify BLAST parameters. This affects sensitivity and speed.

_images/genome_comparator9.png
  • Min % identity - This sets the threshold identity that a matching sequence has to be in order to be considered (default: 70%). Only the best match is used.
  • Min % alignment - This sets the percentage of the length of reference allele sequence that the alignment has to cover in order to be considered (default: 50%).
  • BLASTN word size - This is the length of the initial identical match that BLAST requires before extending a match (default: 20). Increasing this value improves speed at the expense of sensitivity. The default value gives good results in most cases. The default setting used to be 15 but the new default of 20 is almost as good (there was 1 difference among 2000 loci in a test run) but the analysis runs twice as fast.

Distance matrix calculation fieldset

This section provides options for the treatment of incomplete and paralogous loci when generating the distance matrix.

_images/genome_comparator11.png

For incomplete loci, i.e. those that continue beyond the end of a contig so are incomplete you can:

  • Completely exclude from analysis - Any locus that is incomplete in at least one isolate will be removed from the analysis completely. Using this option means that if there is one bad genome with a lot of incomplete sequences in your analysis, a large proportion of the loci may not be used to calculate distances.
  • Treat as a distinct allele - This treats all incomplete sequences as a specific allele ‘I’. This varies from any other allele, but all incomplete sequences will be treated as though they were identical.
  • Ignore in pairwise comparison (default) - This is probably the best option. In this case, incomplete alleles are only excluded from the analysis when comparing the particular isolate that has it. Other isolates with different alleles will be properly included. The effect of this option will be to shorten the distances of isolates with poorly sequenced genomes with the others.

Paralogous loci, i.e. those with multiple good matches, can be excluded from the analysis (default). This is the safest option since there is no guarantee that differences seen between isolates at paralogous loci are real if the alternative matches are equally good. NB: Loci are also only classed as paralogous when the alternative matches identify different sequences, otherwise multiple contigs of the same sequence region would result in false positives.

Alignments fieldset

This section enables you to choose to produce alignments of the sequences identified.

_images/genome_comparator12.png

Available options are:

  • Produce alignments - Selecting this will produce the alignment files, as well as XMFA and FASTA outputs of aligned sequences. This will result in the analysis taking longer to run.
  • Include ref sequences in alignment - When doing analysis using an annotated reference, selecting this will include the reference sequence in the alignment files.
  • Align all loci - By default, only loci that vary among the isolates are aligned. You may however wish to align all if you would like the resultant XMFA and FASTA files to include all coding sequences.
  • Aligner - There are currently two choices of alignment algorithm (provided they have both been installed)
    • MAFFT (default) - This is the preferred option as it is significantly quicker than MUSCLE, uses less memory, and produces comparable results.
    • MUSCLE - This was originally the only choice. It is still included to enable previous analyses to be re-run and compared but it is recommended that MAFFT is used otherwise.

Core genome analysis fieldset

This section enables you to modify the inclusion threshold used to calculate whether or not a locus is part of the core genome (of the dataset).

_images/genome_comparator13.png

The default setting of 90% means that a locus is counted as core if it appears within 90% or more of the genomes in the dataset.

There is also an option to calculate the mean distance among sequences of the loci. Selecting this will also select the option to produce alignments.

Filter fieldset

This section allows you to further filter your collection of isolates and the contigs to include.

_images/genome_comparator14.png

Available options are:

  • Sequence method - Choose to only analyse contigs that have been generated using a particular method. This depends on the method being set when the contigs were uploaded.
  • Project - Only include isolates belonging to the chosen project. This enables you to select all isolates and filter to a project.
  • Experiment - Contig files can belong to an experiment. How this is used can vary between databases, but this enables you to only include contigs from a particular experiment.

Understanding the output

Distance matrix

The distance matrix is simply a count of the number of loci that differ between each pair of isolates. It is generated in NEXUS format which can be used as the input file for SplitsTree. This can be used to generate NeighborNet, Split decomposition graphs and trees offline. If 200 isolates or fewer are included in the analysis, a Neighbor network is automatically generated from this distance matrix.

Unique strains

The table of unique strains is a list of isolates that are identical at every locus. Every isolate is likely to be classed as unique if a whole genome analysis is performed, but with a constrained set of loci, such as those for MLST, this will group isolates that are indistinguishable at that level of resolution.

BLAST

The BLAST plugin enables you to BLAST a sequence against any of the genomes in the database, displaying a table of matches and extracting matching sequences.

The function can be accessed by selecting the ‘BLAST’ link on the Analysis section of the main contents page.

_images/blast.png

Alternatively,it can be accessed following a query by clicking the ‘BLAST’ button in the Analysis list at the bottom of the results table. Please note that the list of functions here may vary depending on the setup of the database.

_images/blast2.png

Select the isolate records to analyse - these will be pre-selected if you accessed the plugin following a query. Paste in a sequence to query - this be either a DNA or peptide sequence.

_images/blast3.png

Click submit.

A table of BLAST results will be displayed.

_images/blast4.png

Clicking any of the ‘extract’ buttons will display the matched sequence along with a translated sequence and flanking sequences.

_images/blast5.png _images/blast6.png

At the bottom of the results table are links to export the matching sequences in FASTA format, (optionall) including flanking sequnces. You can also export the table in tab-delimited text or Excel formats.

_images/blast11.png

Include in results table fieldset

This selection box allows you to choose which isolate provenance fields will be included in the results table.

_images/blast7.png

Multiple values can be selected by clicking while holding down Ctrl.

Parameters fieldset

This section allows you to modify BLAST parameters. This affects sensitivity and speed.

_images/blast8.png
  • BLASTN word size - This is the length of the initial identical match that BLAST requires before extending a match (default: 11). Increasing this value improves speed at the expense of sensitivity.
  • BLASTN scoring - This is a dropdown box of combinations of identical base rewards; mismatch penalties; and gap open and extension penalties. BLASTN has a constrained list of allowed values which reflects the available options in the list.
  • Hits per isolate - By default, only the best match is shown. Increase this value to the number of hits you’d like to see per isolate.
  • Flanking length - Set the size of the upstream and downstream flanking sequences that you’d like to include.
  • Use TBLASTX - This compares the six-frame translation of your nucleotide query sequence against the six-frame translation of the contig sequences. This is significantly slower than using BLASTN.

No matches

_images/blast9.png

Click this option to create a row in the table indicating that a match was not found. This can be useful when screening a large number of isolates.

Filter fieldset

This section allows you to further filter your collection of isolates and the contig sequences to include.

_images/blast10.png

Available options are:

  • Sequence method - Choose to only analyse contigs that have been generated using a particular method. This depends on the method being set when the contigs were uploaded.
  • Project - Only include isolates belonging to the chosen project. This enables you to select all isolates and filter to a project.
  • Experiment - Contig files can belong to an experiment. How this is used can vary between databases, but this enables you to only include contigs from a particular experiment.

BURST

BURST is an algorithm used to group MLST-type data based on a count of the number of profiles that match each other at specified numbers of loci. The analysis is available for both sequence definition database and isolate database schemes that have primary key fields set. The algorithm has to be specifically enabled by an administrator. Analysis is limited to 1000 or fewer records.

The plugin can be accessed following a query by clicking the ‘BURST’ button in the Analysis list at the bottom of the results table. Please note that the list of functions here may vary depending on the setup of the database.

_images/burst.png

If there multiple schemes that can be analysed, these can then be selected along with the group definition.

_images/burst2.png

Modifying the group definition affects the size of groups and how they link together. By default, the definition is n-2 (where n is the number of loci), so for example on a 7 locus MLST scheme groups contain STs that match at 5 or more loci to any other member of the group.

Click Submit.

A series of tables will be displayed indicating the groups of profiles. Where one profile can be identified as a central genotype, i.e. the profile that has the greatest number of other profiles that are single locus variants (SLV), double locus variants (DLV) and so on, a graphical representation will be displayed. The central profile is indicated with an asterisk.

_images/burst3.png

SLV profiles that match the central profile are shown within a red circle surrounding the central profile. Most distant profiles (triple locus variants) may be linked with a line. Larger groups may additionally have DLV profiles. These are shown in a blue circle.

_images/burst4.png

Groups can get very large, where linked profiles form sub-groups and an attempt is made to depict these.

_images/burst5.png

Codon usage

The codon usage plugin for isolate databases calculates the absolute and relative synonymous codon usage by isolate and by locus.

The function can be selected by clicking the ‘Codon usage’ link in the Analysis section of the main contents page.

_images/codon_usage.png

Alternatively, it can be accessed following a query by clicking the ‘Codons’ button in the Analysis list at the bottom of the results table. Please note that the list of functions here may vary depending on the setup of the database.

_images/codon_usage2.png

Enter the ids of the isolate records to analyse - these will be already entered if you accessed the plugin following a query. Select the loci you would like to analyse, either from the dropdown loci list, and/or by selecting one or more schemes.

_images/codon_usage3.png

Click submit. The job will be submitted to the queue and will start running shortly. Click the link to follow the job progress and view the output.

_images/codon_usage4.png

Four tab-delimited text files will be created.

  • Absolute frequency of codon usage by isolate
  • Absolute frequency of codon usage by locus
  • Relative synonymous codon usage by isolate
  • Relative synonymous codon usage by locus
_images/codon_usage5.png

Unique combinations

The Unique Combinations plugin calculates the frequencies of unique file combinations within an isolate dataset. Provenance fields, composite fields, allele designations and scheme fields can be combined.

The function can be selected by clicking the ‘Unique combinations’ link in the Breakdown section of the main contents page. This will run the analysis on the entire database.

_images/unique_combinations.png

Alternatively, it can be accessed following a query by clicking the ‘Combinations’ button in the Breakdown list at the bottom of the results table. This will run the analysis on the dataset returned from the query. Please note that the list of functions here may vary depending on the setup of the database.

_images/unique_combinations2.png

Select the combination of fields to analyse, e.g. serogroup and finetyping antigens.

_images/unique_combinations3.png

Click submit. When the analysis has completed you will see a table showing the unique combinations of the selected fields along with the frequency and percentage of the combination.

_images/unique_combinations4.png

The table can be downloaded in tab-delimited text or Excel formats by clicking the links at the bottom of the page.

_images/unique_combinations5.png

Polymorphisms

The Polymorphisms plugin generates a Locus Explorer polymorphic site analysis on the alleles designated in an isolate dataset following a query.

The analysis is accessed by clicking the ‘Polymorphic sites’ button in the Breakdown list at the bottom of a results table following a query.

_images/polymorphisms.png

Select the locus that you would like to analyse from the list.

_images/polymorphisms2.png

Click ‘Analyse’.

A schematic of the locus is generated showing the polymorphic sites. A full description of this can be found in the Locus Explorer polymorphic site analysis section.

_images/polymorphisms3.png

Gene Presence

The Gene Presence analysis tool will determine whether loci are present or absent, incomplete, have alleles designated, or sequence regions tagged for selected isolates and loci. If a genome is present and a locus designation not set in the database, then the presence and completion status are determined by scanning the genomes. The results can be displayed as interactive pivot tables or a heatmap. The analysis is limited to 500,000 data points (locus x isolates).

The Gene Presence tool can be accessed from the contents page by clicking the ‘Gene Presence’ link.

_images/gene_presence1.png

Alternatively, it can be accessed following a query by clicking the ‘Gene Presence’ button at the bottom of the results table. Isolates returned from the query will be automatically selected within the plugin interface.

_images/gene_presence2.png

Select the isolates to include. Analysis can be performed on any selection of loci, or more conveniently, you can select a scheme in the scheme selector to include all loci belonging to that scheme.

The parameters of the BLAST query used to determine presence or absence can be modified, but in most cases the default options should work well. Click ‘Submit’ to start the analysis.

_images/gene_presence3.png

The job will be sent to the job queue. When it has finished, you will have two options to display the output: ‘Pivot Table’ or ‘Heatmap’.

_images/gene_presence4.png

Pivot Table

Clicking the ‘Pivot Table’ button will display an interactive pivot table. The default display shows the number of isolates for which each locus is present or absent.

_images/gene_presence5.png

You can break down any combination of fields by dragging them from the field area at the top of the table to either of the axes. For example, to show how many isolates have alleles designated and sequence regions tagged for each locus, drag the ‘designated’ and ‘tagged’ fields to the x-axis selector.

_images/gene_presence6.png

The table will be re-drawn including these fields.

_images/gene_presence7.png

Note

If your dataset has more than 100,000 data points (locus x isolates), then be aware that combining both id (or isolate) and locus within the table will result in sluggish performace. Any other combination of fields should be fine.

Heatmap

Clicking the ‘Heatmap’ button will display an interactive heatmap. By default the display shows the presence or absence of a locus for each isolate.

Hovering the mouse cursor or touching a region will identify the isolate and locus in a tooltip.

_images/gene_presence8.png

Change the attribute that is displayed by changing the selection in the attribute dropdown box:

_images/gene_presence9.png

The heatmap does scale to the number of records required to be displayed. If you find individual points to be too small, then choose a smaller subset of data to display:

_images/gene_presence10.png

GrapeTree

GrapeTree is a tool for generating and visualising minimum spanning trees. It has been developed to handle large datasets (in the region of 1000s of genomes) and works with 1000s of loci as used in cgMLST. It uses an improved minimum spanning algorithm that is better able to handle missing data than alternative algorithms and is able to produce publication quality outputs. Datasets can include metadata which allows nodes in the resultant tree to be coloured interactively.

GrapeTree can be accessed from the contents page by clicking the ‘GrapeTree’ link.

_images/grapetree.png

Alternatively, it can be accessed following a query by clicking the ‘GrapeTree’ button at the bottom of the results table. Isolates returned from the query will be automatically selected within the GrapeTree interface.

_images/grapetree2.png

Select the isolates to include. The tree can be generated from allelic profiles of any selection of loci, or more conveniently, you can select a scheme in the scheme selector to include all loci belonging to that scheme.

Additional fields can be selected to be included as metadata for use in colouring nodes - select any fields you wish to include. Multiple selections can be made by holding down shift or ctrl while selecting. Click ‘Submit’ to start the analysis.

_images/grapetree3.png

The job will be sent to the job queue. When it has finished, click the button marked ‘Launch GrapeTree’.

_images/grapetree4.png

The generated tree will be rendered in the GrapeTree application page.

_images/grapetree5.png

The image can be manipulated in various ways. These include modifying the tree layout, customising node labels and size, modifying branch lengths and collapsing branches. The image can be saved in SVG format which can be further edited in image publishing software such as Inkscape.

As an example, the default cgMLST tree (above) has been modified (below) as follows:

  • Nodes coloured by clonal complex
  • Labels removed
  • Branches collapsed where <=100 loci different
  • Node size set to 200%
  • Kurtosis (node size relative to number of isolates) set to 75%
  • Dynamic rendering allowed to run to fan out nodes
_images/grapetree6.png

Full details can be found in the GrapeTree manual.

Note

GrapeTree has been described in the following publication:

Z Zhou, NF Alikhan, MJ Sergeant, N Luhmann, C Vaz, AP Francisco, JA Carrico, M Achtman (2018) GrapeTree: Visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res 28:1395-1404.