giab ashg webinar 160224

23
Genome in a Bottle Consortium February 24, 2016 Reference Materials for Human Genome Sequencing Justin Zook, Ph.D and Marc Salit, Ph.D. National Institute of Standards and Technology

Upload: genomeinabottle

Post on 17-Jan-2017

462 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: Giab ashg webinar 160224

Genome in a Bottle Consortium February 24, 2016

Reference Materials for Human Genome Sequencing

Justin Zook, Ph.D and Marc Salit, Ph.D. National Institute of Standards and Technology

Page 2: Giab ashg webinar 160224

Outline

• Genome in a Bottle (GIAB) products

• Current and future work• Best practices for using

GIAB products to benchmark variant calls

• Genome in a Bottle– Open consortium to

develop well-characterized genomes for benchmarking

– 100-150 public, private, and academic participants at workshops

Page 3: Giab ashg webinar 160224

GIAB Scope• The Genome in a Bottle Consortium is

developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls. • Priority is authoritative characterization of

human genomes.GIAB steering committee, Aug 2015

Page 4: Giab ashg webinar 160224

Well-characterized, stable RMs• Obtain metrics for

validation, QC, QA, PT• Determine sources and

types of bias/error• Learn to resolve difficult

structural variants• Improve reference

genome assembly• Optimization• Enable regulated

applications

Page 5: Giab ashg webinar 160224

Analytical Performance

• Use well-characterized genomic DNA reference materials to benchmark performance

• Tools to facilitate their use– With the Global Alliance

Data Working Group Benchmarking Team

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

gene

ric m

easu

rem

ent p

roce

ss

Page 6: Giab ashg webinar 160224

High-confidence SNP/indel calls

• Methods to develop SNP/indel call set described in manuscript

• Broad and quick adoption of call set for benchmarking– struck nerve

Zook et al., Nature Biotechnology, 2014.

Page 7: Giab ashg webinar 160224

Candidate NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #

CEPH Mother/Daughter

N/A GM12878 HG001 RM8398

AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)

AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)

AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)

Asian Son hu91BD69 GM24631 HG005 RM8393

Asian Father huCA017E GM24694 N/A N/A

Asian Mother hu38168C GM24695 N/A N/A

Note: RMs 8391 to 8393 are planned for release by end of Q2 2016

Page 8: Giab ashg webinar 160224

Dataset AJ Son AJ Parents Chinese son Chinese parents

NA12878

Illumina Paired-end X X X X XIllumina Long Mate pair X X X X XIllumina “moleculo” X X X X XComplete Genomics X X X X XComplete Genomics LFR X X XIon exome X X X XBioNano X X X X10X X X XPacBio X X XSOLiD single end X X XIllumina exome X X X XOxford Nanopore X

Page 9: Giab ashg webinar 160224

Paper describing the data…

Page 10: Giab ashg webinar 160224

Data Release: Real-time, Open, Public Release

Individual Datasets• Uploaded to GIAB FTP site

as data are collected• Includes raw reads, aligned

reads, and variant/reference calls

• 12 datasets described in bioRxiv paper

• Develop SNP, indel, and homozygous reference calls similar to NA12878

• Developing methods to form high-confidence calls for difficult variant types and regions

• Released calls are versioned• Preliminary call-sets will be

made available to be critiqued

Integrated High-confidence Calls

Page 11: Giab ashg webinar 160224

SNP/Indel Integration Method Update• Implementing refined integration methods

– Developed so others can readily reproduce results– Consistent results for all GIAB genomes– Simpler process taking advantage of best practices

for each technology• Validating with released NA12878 RM data

– Preliminary comparisons show minor changes• Application to PGP trios

– Plan to analyze AJ trio by Q2 2016– Release of NIST RMs in Q2 2016– Develop calls for GRCh38

Page 12: Giab ashg webinar 160224

Proposed approach to form high-confidence SV (and non-SV) calls

Generate Candidate Calls

Compare/evaluate calls using Parliament/MetaSV/svclassify/others?;

manual inspection

Integrate new and revised calls; manual inspection

Combine integrated calls; manual inspection; targeted experimental validation?

Aug/Dec 2015

Aug 2015-Jan 2016

Planning in Jan-Feb 2016

Feb 2016 and beyond

Page 13: Giab ashg webinar 160224

Preliminary comparisons of 17 Deletion CallsetsSensitivity to calls in 2 technologies

NOTE: These are preliminary comparisons of data under active development and likely different from true sensitivity of callers

Page 14: Giab ashg webinar 160224

Preliminary comparisons of 17 Deletion CallsetsDifference between predicted size and median predicted size

NOTE: These are preliminary comparisons of data under active development and likely different from true size accuracy

Page 15: Giab ashg webinar 160224

Preliminary comparisons of 17 Deletion CallsetsNumber of unique calls

NOTE: These are preliminary comparisons of data under active development without filtering and unique calls may be correct

Page 16: Giab ashg webinar 160224

GeT-RM Browser from NCBI and CDC• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of data underlying call each call

Page 17: Giab ashg webinar 160224

Global Alliance for Genomics and Health Benchmarking Task Team

Progress:

• Initial version of standardized definitions for performance metrics like TP, FP, and FN.

• Continued development of sophisticated benchmarking tools– vcfeval – Len Trigg– hap.py – Peter Krusche– vgraph – Kevin Jacobs

• Standardized intermediate and final file formats• Standardized bed files with difficult genome contexts for

stratification• github.com/ga4gh/benchmarking-tools

Page 18: Giab ashg webinar 160224

Proposed Performance Metrics Definitions

• Define TP/FP/FN/TN in 4 ways depending on required stringency of match:

• Loose match: TP if within x-bp of a true variant• Allelle match: TP if ALT allele matches• Genotype match: TP if genotype and ALT allele

match• Phasing match: TP if genotype, ALT allele, and

phasing with nearby variants all match• True negatives are difficult to define because an

infinite number of potential alleles exist

Page 19: Giab ashg webinar 160224

Approaches to Benchmarking Variant Calling

• Well-characterized whole genome Reference Materials

• Many samples characterized in clinically relevant regions

• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time

Page 20: Giab ashg webinar 160224

Challenges in Benchmarking Small Variant Calling

• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)

• Easiest to benchmark only within high-confidence bed file, but…

• Benchmark calls/regions tend to be biased towards easier variants and regions– Some clinical tests are enriched for difficult sites

• Challenges with benchmarking complex variants near boundaries of high-confidence regions

• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance metrics

Page 21: Giab ashg webinar 160224

Benchmarking on PrecisionFDA

Page 22: Giab ashg webinar 160224

Acknowledgments

• FDA • Many members of

Genome in a Bottle– New members

welcome!– Sign up on website

for email newsletters

GIAB Steering Committee– Marc Salit – Justin Zook– David Mittelman – Andrew Grupe – Michael Eberle– Steve Sherry – Deanna Church – Francisco De La Vega– Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland– Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao– Kara Norman– Luke Hickey

Page 23: Giab ashg webinar 160224

For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails

github.com/genome-in-a-bottle – Guide to GIAB data & ftp

www.slideshare.net/genomeinabottle

www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser

Data: http://biorxiv.org/content/early/2015/09/15/026468

Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools

Twice yearly public workshops – Winter at Stanford University, California, USA– Summer at NIST, Maryland, USA

Justin Zook: [email protected] Salit: [email protected]