The Bioinformatics team at IGM is responsible for processing Next Generation Sequencing data generated internally at IGM as well as through external collaborations/partnerships. We apply standardized best practices pipelines for over 1000 exomes a month. All exome and genome samples are processed using a consistent alignment and variant calling pipeline, consisting of primary alignment using bwa, duplicate removal using Picard tools, index realignment and variant calling using best practices outlined by GATK, and variant annotation using snpEff with Ensembl annotations. Resultant calls and their underlying quality statistics are stored in annoDB, a database of variants, variant calls and metadata that powers ATAV analyses. With calls stored in database format, we are able to set and easily adjust stringent call thresholds in in any analysis we perform. We thus ensure that the alignment and variant call data is obtained exactly the same way, and that all variant calls in samples involved in an analysis are of comparable quality.
Data Analysis using AnnoDB/ATAV:
AnnoDB Description: Following annotation of variants, we import all variants from the individual’s vcf file into an in-house MySQL database called AnnoDB. The AnnoDB database is our central repository for all called variants. The database is implemented in a MySQL fully normalized schema that allows for rapid complex querying. It houses sample-level variant calls, coverage depths at variant and invariant sites and sample-associated metadata. AnnoDB database also hosts information from external datasets including human genome reference set, population allele frequencies from ExAC, EVS, CADD scores, HGMD, OMIM and ClinVar variant- and gene-annotations.
Single-sample VCF files are parsed by a in-house AnnoDB pipeline. Variants, variant calls and associated quality data are extracted from vcf files. Coverage information from all non-variant sites is recovered from bam files, binned and compressed and imported into AnnoDB. To date, AnnoDB houses over 16 billion called variants at over 125 million positions in the genome.
Analysis Tool for Annotated Variants (ATAV): Both single-sample and cohort-level analysis queries are implemented and executed though our analytics engine - Analysis Tool for Annotated Variants (ATAV). This command-line ATAV tool allows for predefined user queries to be executed on the database with defined parameter settings. ATAV is designed to detect complex disease-associated rare genetic variants by performing association analysis on annotated variants derived from whole-genome or whole-exome sequencing data stored in AnnoDB. The catalog of supported functions and features in ATAV are described in our wiki: http://redmine.igm.cumc.columbia.edu/projects/atav/wiki
ATAV allows a user to run case-control analysis by performing single variant Fisher’s Exact Tests, rare-variant burden tests across genomic regions as well as Linear Regression for quantitative traits. Family-based analyses such as trio analysis for identifying de-novo mutations, compound heterozygosity, evaluation of parental mosaicism, and others are also supported. We have used ATAV to query, analyze and interpret several thousand exome sequencing datasets and have identified novel genes involved in epilepsy (PMID: 23934111, 25262651), ALS (25700176), AHC (22842232) and in diagnostic interpretation of over 400 familial trios (25590979, 27148561, 26138499).
Daniel S. T. Hughes
Director of Genome Informatics & Software Engineering
Macrina M. Lobo
Zhong (Nick) Ren
Big Data Software Developer
- Bioinformatics Programmer