giab ashg webinar 160224
Post on 17-Jan-2017
459 views
Embed Size (px)
TRANSCRIPT
Genome in a Bottle Consortium August 2015 NIST, Gaithersburg, MD
Genome in a Bottle Consortium February 24, 2016Reference Materials for Human Genome Sequencing
Justin Zook, Ph.D and Marc Salit, Ph.D. National Institute of Standards and Technology
genomeinabottle.org
OutlineGenome in a Bottle (GIAB) productsCurrent and future workBest practices for using GIAB products to benchmark variant callsGenome in a BottleOpen consortium to develop well-characterized genomes for benchmarking100-150 public, private, and academic participants at workshops
genomeinabottle.org
GIAB ScopeThe Genome in a Bottle Consortium is developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls. Priority is authoritative characterization of human genomes.GIAB steering committee, Aug 2015
genomeinabottle.orgNOT ctDNA, analytical controls for panels
Well-characterized, stable RMsObtain metrics for validation, QC, QA, PTDetermine sources and types of bias/errorLearn to resolve difficult structural variantsImprove reference genome assemblyOptimizationEnable regulated applications
genomeinabottle.org
Analytical PerformanceUse well-characterized genomic DNA reference materials to benchmark performanceTools to facilitate their useWith the Global Alliance Data Working Group Benchmarking Team
generic measurement process
genomeinabottle.orgHigh-confidence SNP/indel calls
Methods to develop SNP/indel call set described in manuscriptBroad and quick adoption of call set for benchmarkingstruck nerveZook et al., Nature Biotechnology, 2014.
genomeinabottle.org
Candidate NIST Reference MaterialsGenomePGP IDCoriell IDNIST IDNIST RM #CEPH Mother/DaughterN/AGM12878HG001RM8398AJ SonhuAA53E0GM24385HG002RM8391 (son)/RM8392 (trio)AJ Fatherhu6E4515GM24149HG003RM8392 (trio)AJ Motherhu8E87A9GM24143HG004RM8392 (trio)Asian Sonhu91BD69GM24631HG005RM8393Asian FatherhuCA017EGM24694N/AN/AAsian Motherhu38168CGM24695N/AN/A
Note: RMs 8391 to 8393 are planned for release by end of Q2 2016
genomeinabottle.orgDatasetAJ SonAJ ParentsChinese sonChinese parentsNA12878Illumina Paired-endXXXXXIllumina Long Mate pairXXXXXIllumina moleculoXXXXXComplete GenomicsXXXXXComplete Genomics LFRXXXIon exomeXXXXBioNanoXXXX10XXXXPacBioXXXSOLiD single endXXXIllumina exomeXXXXOxford NanoporeX
genomeinabottle.orgPaper describing the data
genomeinabottle.orgData Release: Real-time, Open, Public ReleaseIndividual DatasetsUploaded to GIAB FTP site as data are collectedIncludes raw reads, aligned reads, and variant/reference calls12 datasets described in bioRxiv paperDevelop SNP, indel, and homozygous reference calls similar to NA12878Developing methods to form high-confidence calls for difficult variant types and regionsReleased calls are versionedPreliminary call-sets will be made available to be critiquedIntegrated High-confidence Calls
genomeinabottle.org
SNP/Indel Integration Method UpdateImplementing refined integration methods Developed so others can readily reproduce resultsConsistent results for all GIAB genomesSimpler process taking advantage of best practices for each technologyValidating with released NA12878 RM dataPreliminary comparisons show minor changesApplication to PGP triosPlan to analyze AJ trio by Q2 2016Release of NIST RMs in Q2 2016Develop calls for GRCh38
genomeinabottle.org
Proposed approach to form high-confidence SV (and non-SV) callsAug/Dec 2015Aug 2015-Jan 2016Planning in Jan-Feb 2016Feb 2016 and beyond
genomeinabottle.orgPreliminary comparisons of 17 Deletion CallsetsSensitivity to calls in 2 technologies
NOTE: These are preliminary comparisons of data under active development and likely different from true sensitivity of callers
genomeinabottle.orgPreliminary comparisons of 17 Deletion CallsetsDifference between predicted size and median predicted size
NOTE: These are preliminary comparisons of data under active development and likely different from true size accuracy
genomeinabottle.orgPreliminary comparisons of 17 Deletion CallsetsNumber of unique calls
NOTE: These are preliminary comparisons of data under active development without filtering and unique calls may be correct
genomeinabottle.orgGeT-RM Browser from NCBI and CDC
http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/Allows visualization of data underlying call each call
genomeinabottle.org
Global Alliance for Genomics and Health Benchmarking Task TeamProgress:Initial version of standardized definitions for performance metrics like TP, FP, and FN.Continued development of sophisticated benchmarking toolsvcfeval Len Trigghap.py Peter Kruschevgraph Kevin Jacobs Standardized intermediate and final file formatsStandardized bed files with difficult genome contexts for stratificationgithub.com/ga4gh/benchmarking-tools
genomeinabottle.org
Proposed Performance Metrics DefinitionsDefine TP/FP/FN/TN in 4 ways depending on required stringency of match:Loose match: TP if within x-bp of a true variantAllelle match: TP if ALT allele matchesGenotype match: TP if genotype and ALT allele matchPhasing match: TP if genotype, ALT allele, and phasing with nearby variants all matchTrue negatives are difficult to define because an infinite number of potential alleles exist
genomeinabottle.orgApproaches to Benchmarking Variant CallingWell-characterized whole genome Reference MaterialsMany samples characterized in clinically relevant regionsSynthetic DNA spike-insCell lines with engineered mutationsSimulated readsModified real readsModified reference genomesConfirming results found in real samples over time
genomeinabottle.orgChallenges in Benchmarking Small Variant CallingIt is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)Easiest to benchmark only within high-confidence bed file, butBenchmark calls/regions tend to be biased towards easier variants and regionsSome clinical tests are enriched for difficult sitesChallenges with benchmarking complex variants near boundaries of high-confidence regionsAlways manually inspect a subset of FPs/FNsStratification by variant type and region is importantAlways calculate confidence intervals on performance metrics
genomeinabottle.orgBenchmarking on PrecisionFDA
genomeinabottle.orgAcknowledgmentsFDA Many members of Genome in a BottleNew members welcome!Sign up on website for email newsletters
GIAB Steering CommitteeMarc Salit Justin ZookDavid Mittelman Andrew Grupe Michael EberleSteve Sherry Deanna Church Francisco De La VegaChristian Olsen Monica Basehore Lisa Kalman Christopher Mason Elizabeth Mansfield Liz Kerrigan Leming Shi Melvin Limson Alexander Wait Zaranek Nils Homer Fiona HylandSteve Lincoln Don Baldwin Robyn Temple-Smolkin Chunlin XiaoKara NormanLuke Hickey
genomeinabottle.orgFor More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails
github.com/genome-in-a-bottle Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking Teamhttps://github.com/ga4gh/benchmarking-tools
Twice yearly public workshops Winter at Stanford University, California, USASummer at NIST, Maryland, USA
Justin Zook: jzook@nist.govMarc Salit: salit@nist.gov
genomeinabottle.orgtumor-normal-somatic-copy-numbeChrPosLenQualCopy NumberLOHchr116901000700003.013FALSEchr1142569000960003.033FALSEchr11448390004090003.023FALSEchr3195409000660003.023FALSEchr41905380002870003.013FALSEchr6573500002170003.043FALSEchr1042371000340003.023FALSEchr10469470002060003.053FALSEchr1154907000200003.013FALSEchr1418998000960003.023FALSEchr1419435000350003.023FALSEchr14201920002240003.533FALSEchr15203920001230003.013FALSEchr16323770002200003.013FALSEchr16338780001440003.013FALSEchr1924509000920003.023FALSEchr2194060002500003.023FALSEchr21103660002310003.033FALSEchr21107320001260003.033FALSErow.namesCGCNVCGSVCSHLassemblysnifflesBioNanoSpiralCortexCommonLawPBHoneySpotsPBHoneyTailsMetaSVParliamentPacBioParliamentAssembly10XMultibreakSVCNVnatorParliamentPacBioForceParliamentAssemblyForceBionanoHaplo1sens2techlt1000.000.000.560.000.000.910.290.000.500.000.020.010.010.000.740.000.580.530.002sens2tech100to1k0.000.420.590.020.010.540.060.000.620.000.790.760.670.000.770.350.710.610.003sens2tech1kto3k0.000.730.560.040.200.430.020.000.110.350.850.860.510.000.740.770.360.120.274sens2techgt3k0.260.570.430.030.460.390.000.000.010.650.730.460.030.010.500.900.190.010.555sens2tech0.030.410.560.020.080.570.090.000.470.110.670.620.480.000.730.400.590.480.09bp differenceSpiralParliamentAssemblyForceMultibreakSVCortexPBHoneySpotsCSHLassemblyCGSVMetaSVParliamentPacBioParliamentPacBioForceParliamentAssemblyPBHoneyTailsCNVnatorBionanoHaploBioNanosnifflesCGCNV10X03345712131717204312318920743217484401proportion different