Sequencing Technology Development and Applications
With the advent of affordable DNA-sequencing technologies, methods have been developed for examining nuclear organization, chromatin state/histone post translation modifications, chromatin accessibility and methylation state. But these methods do not directly interrogate the DNA strand, and the reads are typically too short to provide critical correlative information. We propose the development of a novel epigenetic characterization methodology, fundamentally through the practical implementation of the sequencing of modified bases using a nanopore sequencing platform. Nanopore sequencing directly probes the chemical structure of the molecule in the pore with exquisite sensitivity. Its long reads enable correlation of epigenetic state over large (>10kb) stretches of the genome; each of these reads originates from a single cell, probing the epigenetic heterogeneity of the sample. This work on developing methylation calling on the Oxford Nanopore platform (with Jared Simpson from OICR) was published in Nature Methods here. (Image credit: Schatz, M. Nature Methods News & Views )
We have taken this a step further – using exogenous labeling methods, we have adapted NOMe-seq (PMID: 22960375) to nanopore sequencing. Essentially we can label accessible chromatin with non native methylation, in this case GpC methylation, to simultaneously measure chromatin and methylation state using long-read nanopore sequencing. We call this method nanoNOMe – see our paper led by Isac Lee here.
In conjunction with Oxford Nanopore, we have developed a targeted enrichment strategy tailored to long reads leveraging Cas9. As illustrated (see left), essentially genomic DNA is dephosphorylated, then cut with Cas9 flanking the region of interest (ROI). These fresh cut sites are phosphorylated, so preferential ligation of the sequencing adaptors occurs to these locations. Using this method, nanopore Cas9 Targeted Sequencing or nCATS, we have shown that multiple regions ~18kb long can be captured and run on either a minION or flongle flowcell to generate high coverage of specific regions of interest. We can measure methylation, structural variations and even single nucleotide variations with this data. For more information, please see our paper led by Tim Gilpatrick here.
Working with Mike Schatz’s group (led by Sam Kovaka), we developed an adaptive sequencing methodology known as UNCALLED. This method assesses the molecule as it is being sequenced, allowing us to decide whether or not to sequence as it goes through the pore. Though this gives less fold enrichment than the Cas9 approach, it is also more dynamic allowing for a user to just program in the regions of interest, without the need for guide RNA design.
After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist. With the Telomere-to-Telomere (T2T) consortium we used ultra long (100 kb+) nanopore sequencing reads to assemble a human genome that surpasses the continuity of GRCh38, thus – enabling exploration of the full epigenome. Ultra long nanopore reads not only allowed for complete assemblies of the human centromeres, a notoriously difficult genomic region to probe, but simultaneously allowed for epigenetic analyses within these regions. By analyzing CpG methylation in human centromeres, we discovered a drop in methylation in the otherwise hypermethylated centromere of chromosome X (Miga et al. 2020). This drop in methylation was later confirmed in another human centromere, chromosome 8 (Logsdon et al. 2021). We have since named this region the centromeric dip region (CDR) and have associated it with localization of kinetochore associated proteins (CENP-A and CENP-B) in all chromosomes.
Nanopore Direct RNA Sequencing
Nanopore sequencing doesn’t just measure DNA – it really can characterize any polymer placed in a pore. As part of a larger consortia (UCSC, UBC, OICR, JHU, UNottingham, UBirmingham). Long RNA sequencing reads provide exon connectivity, accurate measurement of gene fusion events, and in the case of direct RNA sequencing, an estimate of poly-A tail length and the ability to directly detect RNA modifications. We have generated a comprehensive dataset composed of 13M direct RNA and 24M cDNA sequences based on poly-A RNA isolated from the human GM12878 reference cell line. We have made this dataset publically available here. With median RNA read identity around 86%, excellent correlation (R=0.875) between dRNA and cDNA datasets, 73% of annotated human reference transcripts captured, and aligned read lengths up to 22kb (116 exons), our dataset is a novel resource for benchmarking of single molecule technology and RNA sequencing. In this dataset we generated new tools to characterize the polyA tail length (nanopolish polyA) and examined RNA modifications using the electrical data. We have published our work characterizing the human poly(A) transcriptome (Workman et al. Nature Methods) and the C. elegans poly(A) transcriptome (Roach et al. Genome Research). We have even started to apply this method to detecting clinical splicing variations with Becca Riggins (Tiek et al. bioRxiv).
Continuing work along this vein, we have been interested in probing nanopore sequencing dynamics with Stirling Churchman – as in her recent work with nano-COP. There she used direct RNA sequencing to profile pre-mRNA molecules to understand their splicing dynamics – we are working with her to develop this further to directly detect metabolic labels to RNA and coordinate them with other variations we observe in splicing. We have also been developing RNA structural profiling approaches with the NYGC Technology Innovation Lab. We applied SHAPE reagents to study how nanopore sequencing could read out the structure of different RNA molecules – this work is preprinted here.
Genome Assembly: Conifers
The California redwoods Sequoia sempervirens and Sequoiadendron giganteum are well-known for their size, long lifespan, economic importance, and their role in the American conservation movement. In addition to these superlatives, the coast redwood (Sequoia sempervirens) boasts a ~38 Gb genome, made even more complex by its hexaploid status. Giant sequoia, as a diploid, rests at a modest 9Gb. These genome assemblies will provide the foundation for the development of genomic tools to aid in the conservation and restoration of California’s remaining redwood forests. Our lab is leveraging long reads from nanopore sequencing to help in assembling the genome of these iconic California endemics for the first time. We are working with both David Neale’s group at UC Davis and Steven Salzberg at Johns Hopkins to generate these assemblies. To date, we have completed preliminary nanopore sequencing on the giant sequoia, and are beginning work on the coast redwood next. One of the problems we encountered in this project was extraction of large amounts of high quality, high molecular weight (HMW) DNA from adult trees. Although many extraction methodologies exist for recalcitrant plant species, most either yield DNA of quality “fit for PCR” and not for sensitive nanopore sequencing applications, or DNA too fragmented to obtain sequencing reads of sufficient length to improve assembly contiguity. Obtaining 60kb+ and “nanopore clean” DNA places higher demands on sample extraction and preparation than existing methodology can provide in adult trees. Working with Circulomics and their Nanobind technology, we have developed a protocol that generates high quality and high yield data. A recent presentation on this work is available here. Our genome assembly of the giant sequoia is published (Scott et al. G3), with our work on coast redwood forthcoming. We have since moved on to work on the whitebark pine (Pinus albicaulis).
Genome/Transcriptome Assembly: Hummingbird
Hummingbirds have evolved to be the most elite athletes and possess the highest mass-specific metabolic rate known in the vertebrate kingdom. To sustain hovering flight, a hummingbird needs to maintain a wing beat up to 60-80 beats per second and a heart rate of up to 1200 per minute. The flight muscles of hovering hummingbirds oxidize circulating glucose at rates as much as 55 × greater than the muscles of non-flying vertebrates. Just as impressive, migrating hummingbirds that weigh just 2-3 grams can build lipid stores at phenomenal rates, doubling their body weight in a matter of days, and then oxidize these stores at rates sufficient to sustain a non-stop, ~500-600 miles, flight across the Gulf of Mexico. In achieving these metabolic feats, some metabolic enzymes, such as hexokinase, a key regulator of glycolytic flux, operate at 75% of their maximal capacity and others, such as pyruvate carboxylase, operate near the theoretical upper limit of catalytic efficiency. But we don’t know the sequences of their GENES. With PacBio’s IsoSeq, we are attempting to decipher a de novo transcriptome of the ruby-throated hummingbird (Archilochus colubris). Our paper on this work has been published in Gigascience (Workman et al). Next, we have begun characterizing the epigenome and transcriptome of 2 different tissue types (liver and muscle) under two different conditions (fasted or fed), to examine changes in gene expression that occur from fat burning to carbohydrate burning.
In the United States, infectious disease is primarily a health threat to only the very old, or the very young. In the 20th century, childhood mortality due to infectious disease declined from 466 to 0.7 deaths per 100,000. Though vaccines and antibiotics reduced granted this fantastic decline, we are now facing a new, post-antibiotics era where this threat is re-emerging. The problem is often one of both diagnosis and treatment. Pediatricians are often faced with children unable to accurately describe their symptoms, and physical exams only tell so much. Current laboratory tests are either narrowly targeted towards a specific infection type, or so time-consuming as to be ineffective. Many different infections show the same common symptoms, making medical diagnosis and hence treatment difficult; it is even hard to tell a viral infection from a bacterial one, causing clinicians to err on the side of caution, giving antibiotics when not required. An unambiguous, quantifiable method for diagnosis and identification of infections would be a great aid in treatment. We are applying sequencing to these problems, using Illumina, PacBio and nanopore technologies for shotgun metagenomic sequencing and clinical isolate genome characterization. These assays can provide faster and more accurate information about pathogens and therapeutic resistance. So, we are using metagenomic sequencing to identify infection types, and sequencing cultured isolates to generate clear and complete genome assemblies to characterize the mechanism and type of antimicrobial resistance. We are performing this work primarily with Trish Simner, a clinical bacteriologist, on remnant samples from Johns Hopkins Hospital, funded by an R21 (NIAID) and resulting in several publications(PMID: 30373801, 296478629, 29588357). We are continuing this work to develop methods for metagenomic assembly and integration of sequencing into the clinical workflow.
Most recently we have been working on viral sequencing as well. With the COVID19 pandemic, it is important for genomic epidemiology and response to novel viral variants to have clear survelliance of the virus and its mutations worldwide. Leveraging work by the ARTIC network, we implemented a SARS-CoV-2 sequencing approach at Johns Hopkins, with our initial work and characterization published in Thielen, Wohl et al.
We intend to directly measure the effect of methylation and nucleosome distance on protein-DNA interactions using a solid-state nanopore employed as a force sensor. Nanopores, nanometer size holes in a membrane, can be used to balance electric field vs molecular interaction forces; this can measure DNA unzipping, DNA hairpins, biotin/streptavidin interactions, and protein unfolding. Our novelty lies in using this tool to quantify protein-DNA interactions in the context of genetic mutation/epigenetic modification. Though this has been done in some contexts using restriction enzymes(51, 52), they did not consider epigenetic modifications, nor attempt to assay transcription factors or other nucleic acid binding proteins of critical interest. This idea is innovative because it is high throughput probing of DNA-protein interactions based not just on the kinetics of interaction, but directly on the strength of the binding force. This will grant a clearer idea of the true specificity of interaction, providing insight into the general mechanism of interaction. The current state-of-the-art tools are limiting development in this area, as they apply primarily in a qualitative manner on the whole genome or quantitatively at the single-molecule level.