By applying array-based sequence capture and pyrosequencing, D’Ascenzo and co-workers successfully identiﬁed one known (positive control) and four previously unknown mutations in the mouse Kit gene. These data represent the first documentation and validation of the fact that the new technologies can be used to efficiently discover causative mutations.
The Kit series is an archetypal allelic series in the laboratory mouse that has greatly contributed to the current understanding of KIT receptor function and its multifaceted role in stem cell proliferation, migration, and development. In addition, allelic series often provide better models of human disease, where phenotypes are rarely the result of null or amorphic alleles.
Forward genetics, like phenotype-driven approaches, remain the primary source for allelic variants in the mouse. However, the gap between observable phenotype and causative genotype has limited the widespread use of spontaneous and induced mouse mutants. As alternatives to traditional positional cloning and mutation detection approaches, sequence capture and next-generation sequencing technologies can be used to rapidly sequence subsets of the genome. Application of these technologies to mutation detection efforts in the mouse has the potential to significantly reduce the time and resources required for mutation identification by abrogating the need for high-resolution genetic mapping, long-range PCR, and sequencing of individual PCR amplimers.
As proof of principle that array enrichment and next-generation sequencing technology can be used to rapidly identify mutations, D’Ascenzo and co-workers have chosen to apply these new technologies to identify novel alleles in the archetypal Kit allelic series.
Materials and Methods
The ﬁve Kit alleles selected for targeted resequencing were spontaneous mutations that all arose at The Jackson Laboratory between 1967 and 1988 (Table 1). One of the alleles, KitW-41J (W-41J), was a known Kit mutation and, therefore, its V831M mutation served as a positive control. The other four Kit strains selected were previously shown to carry Kit mutations by noncomplementation testing, but the speciﬁc molecular lesion was unknown. The Kit allele KitW-73J (W-73J) was included because this mutation arose on a nonreference strain background (DBA/2J). The other three alleles, KitW-39J (W-39J), KitW-40J (W-40J), and Kit W-20J (W-20J), were selected based on the severity of phenotypes reported where the ‘‘mildest’’ allele, W-39J, is a viable allele and the most ‘‘severe’’ allele, W-40J, causes early lethality and detrimental effects on gametogenesis in heterozygotes.
Sequence capture array and sequencing
A custom tiling 385K sequence capture array targeting the c-Kit locus was designed and manufactured by Roche NimbleGen. The array was designed using NimbleGen’s
standard 15-mer frequency masking to minimize repeat content within capture probes. The probe spacing, tiling overlap, and probe length were determined using proprietary algorithms.
DNA samples from each of the ﬁve strains were processed and assigned unique, nonidentifying sample numbers prior to shipping to Roche NimbleGen. A commercially acquired mouse genomic DNA (msgDNA) was included because the DNA samples from the ﬁve Kit strains had been in storage for as long as 23 years and the extent of degradation and potential loss of sequence quality was unclear in advance of the experiment.
Sequence capture libraries were constructed for all samples and enriched for the target locus using an array capture-mediated approach (Figure 1). The regions of interest (color-coded DNA segments) were nominated, and an array tiling across the Kit locus was manufactured. Each capture library was hybridized to arrays, washed stringently, eluted under high temperature, and ampliﬁed via the added adapters. Enriched libraries were then subjected to downstream library construction and sequencing steps.
To assess the degree of the enrichment, mapping data for each sample were evaluated by quantifying the number of reads that mapped within the capture target region (i.e., the Kit locus) as a fraction of the total reads mapping to mm9 (i.e., % on-target).
454 Sequencing System
Following evaluation by agarose electrophoresis, the ampliﬁed capture libraries were processed into sequencing libraries for the 454 Genome Sequencer FLX using the Shotgun DNA Library Construction Kit and low-molecular-weight DNA (no nebulization) protocols (454 Life Sciences, Branford, CT) according to the manufacturer’s recommended conditions in the 454 Life Sciences Sequencing Center. Each captured sample library was sequenced either as a full 2- region PicoTiterPlate Device run on the Genome Sequencer FLX platform or, in the case of W-39, as eight lanes from a 16-region PicoTiterPlate Device across two runs. All resulting sequence data were transferred to the bioinformatics groups at NimbleGen and 454 Life Sciences for assembly and analysis.
Sequence and data analysis was performed independently by bioinformatics groups at both Roche NimbleGen (RNG) and Roche 454 Life Sciences (454) as part of an overall collaborative analysis effort.
Results and Discussion
D’Ascenzo and co-workers successfully identiﬁed one known (positive control) and four previously unknown mutations in the Kit gene, as well as one new SNP and seven of eight known DBA/2J SNPs by using the array capture-mediated resequencing approach. For each sample, one coding SNP or small deletion was detected and independently validated. Importantly, the positive control mutation was also conﬁrmed correctly. Furthermore, the nature of the mutations, their positions within the KIT protein, and a wealth of genetic data, including noncomplementation with previously characterized Kit alleles, support the conclusion that the coding variants found in W-20J, W-39J, W-40J, and W-73J were the causative mutations.
W-39J spontaneously arose in C57BL/6J; therefore, it was interesting to ﬁnd and confirm a variant in addition to the W-39J mutation from reference (C57BL/6J). Pedigree data showed that the W-39J mutation arose prior to 1970 and was then maintained in a reproductively isolated research colony until DNA was archived in 1985. This additional novel variant was not present in the other W alleles that were used in this study, all of which arose spontaneously in the same C57BL/6J breeding colony at The Jackson Laboratory. The research confirmed that neither do the alleles that arose before (W-20J) or after (-40J, and -41J) W-39J carry this SNP, nor does the strain upon which the W-39J allele is carried. Therefore, it is likely that this variant arose after the W-39J mutation and, given its proximity to the causative W-39J mutation, tight linkage resulted in maintenance of the SNP in subsequent generations (~ 62 generations).
Close inspection of the position of the validated nonsynonymous SNPs and the deletion within Kit revealed that four of variants, including the positive control W-41J, were in the critical domains for kinase function (Figure 2). Three of the ﬁve variants were in the activation loop, which is important for activating the kinase’s phosphate transfer. The known mutation W-41J falls into the ATP-binding loop, which is important for coordinating the phosphate donor within the active site. The strain bearing the W-20J allele had a mutation in a universally conserved position [the 50 anchor G in the GXGXXGK(N20)K in the ATP-binding loop, which replaces the glycine with an glutamate residue (G595E)]. The W-73J allele contains an isoleucine instead of a methionine at position 623. The W-39J allele lies adjacent to and just outside the GXGXXGK(N20)K ATP-binding loop. The 5-bp deletion allele identiﬁed in W-40J results in a frame-shift yielding two nonsynonymous substitutions (R186C and A188S) and an in-frame, premature STOP at position 190. This is predicted to result in nonsense-mediated decay of the resulting transcript and/or translation to a polypeptide truncated to less than one third the normal length with no fully formed domains.
According to the authors, the results of the study showed that sequence capture and next-generation sequencing technology can be used to rapidly sequence select regions from the mouse genome, and that heterozygous variations within these regions can be predicted with high accuracy. The positive-prediction accuracy was 86% and the negative-prediction accuracy was close to perfect. A variant allele frequency cutoff of 25% appeared sufficient to capture all true variants. In addition, validation of candidate variants can be further prioritized by coverage and ﬂanking sequence complexity. The researchers concluded that an allele ratio threshold of greater than 25% and read depth of greater than 20 are sufficient to sensitively detect true heterozygous variants while maintaining adequate speciﬁcity to yield a manageably small collection of SNP for validation.
Most important from a practical perspective were the potential savings in terms of time and therefore cost for mutation discovery in the mouse genome. In as little as 2 weeks, an approximately 160-kb genomic region from ﬁve strains was captured and sequenced with a maximum median depth of coverage of 215 reads, comprising a total of over 340 Mb from the genetically identiﬁed interval.
By using array-based sequence capture and pyrosequencing to sequence an allelic series from the classically defined Kit locus (~200 kb) from each of five noncomplementing Kit mutants (one known allele and four unknown alleles), D’Ascenzo and co-workers discovered and validated a nonsynonymous coding mutation for each allele. These data represent the first documentation and confirmation of the fact that the new technologies can be used to efficiently discover causative mutations. Importantly, these results also provide the framework for constructing the efficient detection of causative mutations in mutant mouse strains moving forward. The standards presented here, coupled with the latest sequence capture methods (e.g., whole-exome arrays or solution-based methods) and next-generation sequencing technologies, promise to signiﬁcantly close the gap between phenotype and genotype in the mouse.
D’Ascenzo M et al. (2009) Mamm Genome 20:424–436