metaXplor v1.1-RELEASE

Index

General informations

metaXplor is a scalable, distributable, fully web-interfaced application for managing, sharing and exploring metagenomic data. Being based on a flexible NoSQL data model, it has very few constraints regarding dataset contents, and thus proves useful for handling outputs from both shot-gun and metabarcoding techniques. By supporting incremental data feeding and providing means to combine filters on all imported fields, it allows for exhaustive content browsing, as well as rapid narrowing to find very specific records. The application also features various interactive data visualization tools, ways to query contents by BLASTing external sequences, and an integrated pipeline to enrich assignments with phylogenetic placements.

Project homepage / source-code: https://github.com/SouthGreenPlatform/metaXplor

Current CIRAD instance: https://metaxplor.cirad.fr

Import

In order to import data into metaXplor, an account is required. Data can be public (visible by anybody) or private (visible only by authorized users). You can request an account from the login page. Demonstration project data is available here.

Import archive structure

The structure of the project archive should be as follows:

					 
					[optional directory]
					    |--[optional prefix]samples.tsv
					    |--[optional prefix]assignments.tsv
					    |--[optional prefix]sequences.tsv
					    |--[whatever].fasta 
				

You must upload project data as a .zip archive containing all 4 files.

Project meta-information

At the beginning of the upload procedure, a form describing the project has to be filled manually. Here are the available fields:

Fields Description Required
Project code Project code. There can't be two projects with the same code yes
Project name Project name. There can't be two projects with the same name yes
Project description Short description of the data no
Contact information How to contact data owner / manager (email / phone) no
Sequencing technology (454, Illumina, PacBio, Sanger...) no
Author(s) Name(s) of the person(s) who generated the data no
Max # accessions per assignment How many accessions to take into account when several are provided for a single assignment no
Sequencing date Date of sequencing no
Samples available Are original samples still available (yes/no) no
Assembly method (Abyss, CAP3, SPAdes, Velvet, ...) no
Publication reference(s) References or links to publications associated with the data no
Other informations Additional information, comments ... no

*samples.tsv content

files with the tsv extension are tab delimited flat text files with the fields described below (order of the fields is not important as long as field names are correct - the case does matter). Any non-required fields may contain an empty string or a dot “.”

We strongly advise to use, whenever possible, standard attribute names such as defined in the BioSample database.

Fields Description Required Example
sample_name Code of the analysed sample yes C14_A88
lat_lon Localisation of the host WGS 84 standard, formatted as latitude,longitude Column must exist, cells may be empty 43.5985,3.8794
collection_date Collection date of the sample. Unix timestamp standard ISO 8601 as YYYY-mm-ddThh:mm:ss+00:00. YYYY-mm-dd also accepted Column must exist, cells may be empty 2016-12-07
Additional fields (unlimited) Other informations no -

*sequences.tsv content

all project sequences must appear here

files with the tsv extension are tab delimited flat text files with the fields described below (order of the fields is not important as long as field names are correct - the case does matter). Zero values may be omitted and replaced with an empty string (saves space!)”

Fields Description Required Example
qseqid Sequence name. This field has to be unique yes Contig38_C1-4_A88_(24)
sample Numeric value indicating how much each sample contributed to the sequence (may be a read depth for shotgun data, or a count of cluster sequences for metabarcoding) Column must exist for each sample, (not all) cells may be empty 1

*.fasta content

a standard fasta-format file is expected here, involving all sequences referred to in the *sequences.tsv file

*assignments.tsv content

only assigned sequences must appear here

files with the tsv extension are tab delimited flat text files with the fields described below (order of the fields is not important as long as field names are correct - the case does matter). Any non-required fields may contain an empty string or a dot “.”

A bash script able to convert BLAST outputs to the expected format may be downloaded here.

Fields Description Required Example
qseqid Sequence name. This field may appear on multiple lines when working with several assignments per sequence yes Contig38_C1-4_A88_(24)
assignment_method Method used for assigning sequence yes BlastX
sseqid Subject accession(s), comma-separated if multiple (assignment taxon will be set to first common ancestor in this case)
NB: prefixing the accession number with "n:" (for nucleotide) or "p:" (for protein) is mandatory
depends: required in the absence of taxonomy_id p:AUW34315.1,p:WP_081775933.1
taxonomy_id NCBI taxonomy ID
depends: required in the absence of sseqid 2045190
best_hit Flag telling whether an assignment is considered the best hit for the current sequence AND assignment method. Any non-empty value other than a dot sets the flag to true. depends: not required if only one assignment is provided for the current sequence and method; required (and unique) otherwise Y
Additional fields (unlimited) Other informations no ...

Taxonomic information is retrieved from NCBI using the ENTREZ API with values passed in the sseqid field.

Permission rules

Read-access to projects is granted according to the following rules:

Public database Private database

Public project Private project Public project Private project
Anonymous user Y N N N
Authenticated user without specific permissions Y N Y N
Authenticated user with specific permissions Y Y Y Y

Data exploration

All assigned sequences present in the system are searchable via the exploration interface, that allows to work simultaneously on any combination of projects from the selected database. Color codes are applied to sample-level, sequence-level and assignment-level fields for quick identification. This versatile interface provides means to combine filters on any of the fields added via project imports. Various kinds of advanced filtering widgets are thus proposed depending on the field’s data type.

Search results can be browsed in four different ways. The default display is a sortable table with selectable fields supporting pagination, which can be configured to group results at the sample, sequence or assignment level. Table rows are clickable and lead to a dialog box with all the information related to the selected record. The other three displays, all interactive, allow browsing search results as a taxonomic tree, a Krona pie chart, and a zoomable geographic map showing sample collection locations.

In the case when the user is working on projects containing multiple assignments per sequence along with multiple assignment methods, for consistency reasons, only one method is taken into account when building taxonomy trees or pies. Therefore a method needs to be selected to proceed, which is the reason for the assignment (and best-hit where applicable) filter(s) to be active by default in such cases. The user still has the ability to disable them but should expect result counts not to match between the table view and the taxonomy views.

Data exports

Once a dataset of interest has been selected, it may be downloaded in the same formats as supported for imports: a FASTA sequence file, and tab-delimited text files providing sample metadata, sequence composition or assignment information. Data may also be exported in the popular BIOM format, thus allowing easy manipulation of exported data in a variety of visualization or analysis tools such as Phinch or Calypso. Because this format enforces a precise and limited set of taxonomy ranks, sequence metadata are enriched with a full_taxonomy field that may include ranks beyond those defined in the BIOM format, e.g., several ranks associated with virus classification.

Exports are automatically compressed into zip archives and may be either directed to the client computer for direct download, or temporarily materialized as physical files on the web server. In the latter case, a download URL is provided, making it easy to share with collaborators or feed into external systems. Indeed, next to the export button, a "sharing" icon provides means to configure "online output tools" that metaXplor will be able to push exported data to. As an example, this feature is compatible with Galaxy data sources and thus allows to transfer any exported file into a Galaxy history, by a simple button click. The metaXplor instance administrator may configure up to 5 default output tools, and each user may define his own custom output tool.

Phylogenetic assignment

A "Phylogenetic assignment" link is available from the main menu. Clicking it leads to a page offering the functionality to run this pipeline on user-provided sequences only. In order to run it on sequences identified in metaXplor, you must first export a FASTA file to the web server (see Data exports section above)

Having either provided his own FASTA file or created one by exporting metaXplor data to the web server, the user is then invited to select a reference package among those available with the system (de novo generated or obtained from paprica), or upload a custom reference package archive. A nucleotide sequence alignment is first applied using mafft v7.313 before pplacer v1.1.alpha19 proceeds with positioning exported sequences onto the existing reference tree. Then, guppy v1.1.alpha19 is used for sequence classification (classify option) and to generate an XML version of the pplacer tree (fat option). Last, Archeopteryx.js is invoked to display an interactive solution for the end-user to investigate the results. After classification is performed, users with write permissions on any involved project have the facility to save newly found assignments to the database, thus enriching its contents for the benefit of all users.

BLASTing external sequences against metaXplor contents

Another section in the application provides means to search for similarities between an external set of sequences and those present in the system, the latter being used as a reference bank. Available algorithms are BLAST algorithm v2.6.0 and Diamond v2.0.4. Job results consist of a standard BLAST output file per selected target project, which may be investigated online in an interactive manner thanks to the BlasterJS library. Matching sequences may also be downloaded in FASTA format for further analyses (e.g. alignment, viral genome reconstruction).

Several BLAST types are supported: BLASTx (comparison of a DNA query sequence, after having translated this DNA query sequence into the 6 possible frames, with a protein sequence database) with Diamond as a faster alternative, BLASTp (comparison of a protein query sequence with a protein sequence database) with Diamond as a faster alternative, BLASTn (comparison of a DNA query sequence to a DNA sequence database), tBLASTn (comparison of a protein query with a DNA database, in the 6 possible frames of the database), tBLASTx (comparison of the six-frame translations of a nucleotide query sequence with the six-frame translations of a nucleotide sequence database). This functionality was designed to provide means to quickly check whether newly obtained, locally held sequences share similarity with material already stored in previous projects.

Configuration properties

WEB-INF/classes/config.properties file may be used to set values for the following parameters:

NB: webapp must be restarted for any change in config.properties to be taken into account