HELP



The website includes a submission page with a form to be compiled with contact information, the details required to run the analysis and the file to be uploaded. The user has to provide the presumptive (or known) Mendelian disorder associated to the sample, using a fixed vocabulary implementing the MEDIC hierarchical disease ontology, the mode of inheritance and the platform used for exome target enrichment. The user can submit multiple samples at once if the samples correspond to related individuals. Each sample has to be uploaded as a pair of sequence files in FastQ format.



NEW ANALYSIS

Analysis name

The user has to provide a unique name (if an existing analysis is already present a warning will ask to change it) and since this name will be used to create in folder results, please use only characters or numbers and "_", special characters should be avoided (e.g. spaces, /, \, |, :, ', ", etc.).

Family

This flag indicates that the user has multiple samples of individuals of the same family. In this case the number of samples has to be indicated.

Disease

Controlled list of Mendelian Disorders and their hierarchical relationships extracted by MEDIC disease vocabulary (http://www.ncbi.nlm.nih.gov/pubmed/22434833). You can search the list by starting to type a keyword or an OMIM ID (http://www.ncbi.nlm.nih.gov/omim/) and selecting the most appropriate definition for the patient phenotype. Please note that the disease hierarchy is extracted from the CTD MEDIC and only child terms to {MeSH ID D009358: "Congenital, Hereditary, and Neonatal Diseases and Abnormalities" }. If you are interested in a disease not present in our list - e.g. diseases where a genetic etiology is only suspected, unproven or recently discovered - please contact us.

If the diagnosis is unclear or ambiguous, please choose the best current diagnosis definition, it can be later edited to a more specific description. The typical case could be an unclear initial diagnosis that gets clarified after the analysis: in this case the use will choose a more general disease definition initially and change it to a more specific after receiving the analysis results.

Confirm Disease Association

Please check this flag only if the patient(s) diagnosis is validated. Please confirm the diagnosis after having validated the mutations found of after the diagnosis has been proven by biochemical or other diagnostic tests, it can be confirmed also after performing the analysis.

Mode of Inheritance

If known, please indicate the mode of inheritance of the patological phenotype.

Target Enrichment

Please indicate the platform used for exome target enrichment. The corresponding coordinates will be used to generate target coverage statistics. Please contact us if the coordinates of your target are not present.




ANALYSIS STATUS AND REPORTS

Queued QC and
Trimming
Alignment and
pre-processing
Statistics generation Variant Calling
and Annotation
Completed


Queued

This is the initial status of every analysis. It indicates the analysis was created but that it is queued and it didn't start yet.

QC and Trimming

QC assessment is the first step in the analysis. The program FastQC generates a quality report of each fastq file (two for every sample). Reads are then trimmed to remove the Illumina adapter sequence and low quality ends (with quality score threshold of 20) using Trim Galore and cutadapt; a FastQC report is generated also on the trimmed sequences. For each sample both FastQC reports, before and after trimming, will be available in the results page.

Trim Galore! http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
FastQC http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
cutadapt http://code.google.com/p/cutadapt/

Alignment and pre-processing

Paired sequencing reads are aligned to the reference genome (UCSC, hg19 build) using BWA. Post-alignment process, including SAM conversion, sorting and duplicate removal are performed using Picard and SAMtools. The Genome Analysis Toolkit (GATK) is then used to prepare the raw alignment for the variation calling with local realignment around small insertions-deletions (INDELs) and Base Quality Score Recalibration. The local realignment around INDEL is an important step. It finds a consensus alignment among all the reads spanning a deletion or an insertion to both improve INDEL detection sensitivity and accuracy and to reduce SNV false calls due to misalignment of the flanking bases. The Base Quality Score Recalibration is a procedure through which the raw quality scores provided by the instrument are recalibrated according an empirical error model derived by the sequences. For each sample, a processed alignment file in bam format will be available in the Results page.

BWA http://bio-bwa.sourceforge.net/
Picard http://picard.sourceforge.net/
SAMtools http://samtools.sourceforge.net/
GATK http://www.broadinstitute.org/gatk/

Statistics generation

This module is followed by a small module computing the read summary, target enrichment and target coverage statistics with SAMtools and BEDTools. The summary statistics will be available in the Results page.

SAMtools http://samtools.sourceforge.net/
BEDTools http://bedtools.readthedocs.org/en/latest/

Variant Calling and Annotation

The identification of Single Nucleotide Variants (SNVs) and INDELs are separately performed using GATK UnifiedGenotyper, followed by Variant Quality Score Recalibration when applicable. The SNV and INDEL calls are then merged and annotated using ANNOVAR to add the following information: the position in genes and amino acid change relative to the RefSeq gene model, presence in dbSNP, OMIM, frequency in NHLBI Exome Variant Server and 1000 Genomes Project, prediction of the potential damaging effect on protein activity with different algorithms and evolutionary conservation scores. The annotated results are then imported into the variation database. The complete list of variation in vcf format will be available in the Results page.

GATK http://www.broadinstitute.org/gatk/
ANNOVAR http://www.openbioinformatics.org/annovar/

Completed

After the analysis is completed, a variation report is generated and the Results page can be reached trough a link in the Analysis list. The Results page contains all links to the output of each analysis module and an additional variation report in excel format. The final variation report includes all the variations found in the samples accompanied by the available annotations and the allele frequencies stratified by disease groups, using the MEDIC hierarchical disease vocabulary. The allele frequencies are calculated for every disease with at least 3 associated individuals. In case the number of samples is not enough at a specific level, it is possible that frequencies are only calculated by grouping samples at higher levels in the disease hierarchy. The variation reports of all the archived analysis are periodically refreshed to update allele frequencies on the analyses gradually added to the database.

MEDIC http://ctdbase.org/help/diseaseDetailHelp.jsp

Example Variation Report

Example Report

Legend Variation Report