Sequencing Technology Development and Applicatons
The California redwoods Sequoia sempervirens and Sequoiadendron giganteum are well-known for their size, long lifespan, economic importance, and their role in the American conservation movement. In addition to these superlatives, the coast redwood (Sequoia sempervirens) boasts a ~38 Gb genome, made even more complex by its hexaploid status. Giant sequoia, as a diploid, rests at a modest 9Gb. These genome assemblies will provide the foundation for the development of genomic tools to aid in the conservation and restoration of California’s remaining redwood forests. Our lab is leveraging long reads from nanopore sequencing to help in assembling the genome of these iconic California endemics for the first time. We are working with both David Neale’s group at UCDavis and Steven Salzberg at Johns Hopkins to generate these assemblies. To date, we have completed preliminary nanopore sequencing on the giant sequoia, and are beginning work on the coast redwood next. One of the problems we encountered in this project was extraction of large amounts of high quality, high molecular weight (HMW) DNA from adult treesAlthough many extraction methodologies exist for recalcitrant plant species, most either yield DNA of quality “fit for PCR” and not for sensitive nanopore sequencing applications, or DNA too fragmented to obtain sequencing reads of sufficient length to improve assembly contiguity. Obtaining 60kb+ and “nanopore clean” DNA places higher demands on sample extraction and preparation than existing methodology can provide in adult trees. Working with a Circulomics and their Nanobind technology, we have developed a protocol that generates high quality and high yield data. A recent presentation on this work is available here.
Hummingbirds have evolved to be the most elite athletes and possess the highest mass-specific metabolic rate known in the vertebrate kingdom. To sustain hovering flight, a hummingbird needs to maintain a wing beat up to 60-80 beats per second and a heart rate of up to 1200 per minute. The flight muscles of hovering hummingbirds oxidize circulating glucose at rates as much as 55 × greater than the muscles of non-flying vertebrates. Just as impressive, migrating hummingbirds that weigh just 2-3 grams can build lipid stores at phenomenal rates, doubling their body weight in a matter of days, and then oxidize these stores at rates sufficient to sustain a non-stop, ~500-600 miles, flight across the Gulf of Mexico. In achieving these metabolic feats, some metabolic enzymes, such as hexokinase, a key regulator of glycolytic flux, operate at 75% of their maximal capacity and others, such as pyruvate carboxylase, operate near the theoretical upper limit of catalytic efficiency. But we don’t know the sequences of their GENES. With PacBio’s IsoSeq, we are attempting to decipher a de novo transcriptome of the ruby-throated hummingbird (Archilochus colubris). Rachael’s recent presentation on this work is here (AGBT2017_Timp_ruby_wta_small), and a preprint is available here.
Nanopore Methylation Calling
With the advent of affordable DNA-sequencing technologies, methods have been developed for examining nuclear organization, chromatin state/histone post translation modifications, chromatin accessibility and methylation state. But these methods do not directly interrogate the DNA strand, and the reads are typically too short to provide critical correlative information. We propose the development of a novel epigenetic characterization methodology, fundamentally through the practical implementation of the sequencing of modified bases using a nanopore sequencing platform. Nanopore sequencing directly probes the chemical structure of the molecule in the pore with exquisite sensitivity. Its long reads enable correlation of epigenetic state over large (>10kb) stretches of the genome; each of these reads originates from a single cell, probing the epigenetic heterogeneity of the sample. This work on developing methylation calling on the Oxford Nanopore platform (with Jared Simpson from OICR) was published in Nature Methods here. (Image credit: Schatz, M. Nature Methods News & Views )
Nanopore Direct RNA Sequencing
Nanopore sequencing doesn’t just measure DNA – it really can characterize any polymer placed in a pore. As part of a larger consortia (UCSC, UBC, OICR, JHU, UNottingham, UBirmingham) Long RNA sequencing reads provide exon connectivity, accurate measurement of gene fusion events, and in the case of direct RNA sequencing, an estimate of poly-A tail length and the ability to directly detect RNA modifications. They have generated a comprehensive dataset composed of 13M direct RNA and 24M cDNA sequences based on poly-A RNA isolated from the human GM12878 reference cell line. We have made this dataset publically available here. With median RNA read identity around 86%, excellent correlation (R=0.875) between dRNA and cDNA datasets, 73% of annotated human reference transcripts captured, and aligned read lengths up to 22kb (116 exons), our dataset is a novel resource for benchmarking of single molecule technology and RNA sequencing.
In the United States, infectious disease is primarily a health threat to only the very old, or the very young. In the 20th century, childhood mortality due to infectious disease declined from 466 to 0.7 deaths per 100,000. Though vaccines and antibiotics reduced granted this fantastic decline, we are now facing a new, post-antibiotics era where this threat is re-emerging. The problem is often one of both diagnosis and treatment. Pediatricians are often faced with children unable to accurately describe their symptoms, and physical exams only tell so much. Current laboratory tests are either narrowly targeted towards a specific infection type, or so time-consuming as to be ineffective. Many different infections show the same common symptoms, making medical diagnosis and hence treatment difficult; it is even hard to tell a viral infection from a bacterial one, causing clinicians to err on the side of caution, giving antibiotics when not required. An unambiguous, quantifiable method for diagnosis and identification of infections would be a great aid in treatment. We propose DNA sequencing using a new nanotechnology based sequencer – to read the genetic information of the infection. This not only will uniquely identify what is causing the child to be ill, but also identify precisely the most effective treatment specific to the infection. Specifically, we intend to use a new DNA sequencing device, a nanopore sequencer, which recently commercialized by Oxford Nanopore. This instrument is the size of a USB stick and requires low capital investment($1k) meaning that such a test could be distributed to doctor’s offices and hospitals across the country, as opposed to current sequencing technology which is relegated to major research hospitals. We will generate a new sample preparation method, to make sure that the genetic material sequenced is only from the infection, not from the individual, side-stepping privacy concerns and making the test more efficient and affordable. Winston’s recent presentation on this work is here.
Structural variants, including large deletions, translocations, amplifications, and inversions are the hallmark of genomic instability associated with cancer. In cancer cells, oncogenes are activated by amplification and translocation, in addition to point mutations, and tumor suppressor genes (TSGs) are inactivated by large deletions and inversions. CDKN2A/p16 and SMAD4/DPC4 are two of the most commonly deleted TSGs in human cancer, and complex SVs have been found to underlie approximately half of these deletions in pancreatic ductal adenocarcinoma (PDAC). Sensitive detection of SVs in malignant cells is critical for applications in molecular relapse, early detection, and therapeutic monitoring of cancer patients. But the logistics of detection are complicated as cancer samples are often mixed with samples of normal genotype. The ability to easily and quickly study structural variations in cancer cannot be understated. 2nd generation, or short-read (<300bp), sequencing approaches have trouble detecting these structural variations due to repetitive regions in the vicinity. Even when they can be detected, the depth of coverage required to accurately call SVs is logistically challenging for affordable diagnostic use. We propose a combination of a brand-new sequencing technology – Oxford Nanopore’s MinION – with hybridization based targeting of structural variant hotspot DNA. Our recent paper on this work is here. We have subsequently been working with Agilent Technologies to develop a targeted capture method to look at “hotspots” for structural variations, working around the (relatively) low yield of current long-read sequencing methods. We have generated an application note with Agilent Technologies, available here.
Nanopore Force Spectroscopy
We intend to directly measure the effect of methylation and nucleosome distance on protein-DNA interactions using a solid-state nanopore employed as a force sensor. Nanopores, nanometer size holes in a membrane, can be used to balance electric field vs molecular interaction forces; this can measure DNA unzipping, DNA hairpins, biotin/streptavidin interactions, and protein unfolding. Our novelty lies in using this tool to quantify protein-DNA interactions in the context of genetic mutation/epigenetic modification. Though this has been done in some contexts using restriction enzymes(51, 52), they did not consider epigenetic modifications, nor attempt to assay transcription factors or other nucleic acid binding proteins of critical interest. This idea is innovative because it is high throughput probing of DNA-protein interactions based not just on the kinetics of interaction, but directly on the strength of the binding force. This will grant a clearer idea of the true specificity of interaction, providing insight into the general mechanism of interaction. The current state-of-the-art tools are limiting development in this area, as they apply primarily in a qualitative manner on the whole genome or quantitatively at the single-molecule level.
Sequencing-based Transcription Factor Binding Quantification for Synthetic Biology:
The fields of synthetic biology and next-generation sequencing have experienced rapid growth in recent years, yet a tremendous untapped potential lies at the interface between these two disciplines. Synthetic biology seeks to apply engineering principles to the assembly of complex, novel biological systems for a wide range of applications, including therapeutics, bioremediation, and biochemical production. The first requirement is a set of well-characterized genetic ‘parts’ that can be assembled to achieve a given design goal. Typically, new parts are harvested from one organism and transferred to a different target host. This process is fraught with challenges, and the diversity of genetic parts used in synthetic systems to date represents only a small fraction of the vast array of biological behaviors observed in nature. A particular challenge is de novo design of genetic components. Natural constructs, which are optimal for the source organism’s function, are not necessarily ideal for engineering goals in the target organism. Thus, it is typically desirable to tune the ‘device physics’ of genetic components, e.g., the response strengths and sensitivities of biosensor components. However, without a thorough understanding of the mapping between genetic changes and biological performance, de novo design is impossible. As a result, the development and optimization of genetic parts for synthetic biology often proceeds in an ad hoc fashion. There is a need for methods to develop and characterize genetic components, offering a major opportunity for next-gen sequencing approaches. Our aim is to harness next-generation sequencing techniques for the development of new synthetic biology tools. As a demonstration, we will focus on genetic components that respond to intercellular communication signals, e.g. quorum sensing (QS) signals. We will use affinity based enrichment and massively parallel DNA sequencing (SELEX) to probe the DNA sequence determinants of response strength and specificity, and we will harness this information to develop and optimize genetic parts that function in a target host cell.
Chromatin immunoprecipitation sequencing is one of the primary tools used to determine, genome-wide, the locations of bound proteins in cells. The technique typically relies on breaking up the chromatin, the nuclear material, into small pieces. The pieces which have the protein of interest, in our case CTCF, Sp1/Sp3, MBD family proteins, or nucleosomes, bound via affinity interactions. After washing away unbound material, the proteins are removed via digestion and the resulting DNA sequenced. When the sequenced DNA is aligned, we can determine where on the genome the bound fraction was localized via pileups of the sequenced fragment, or read, locations. But this does not take into account the methylation patterns present on the bound DNA. To profile methylation in DNA samples, the current gold standard method is bisulfite sequencing. Essentially, sodium bisulfite treatment of DNA converts unmodified cytosines to uracil, while keeping methylated cytosines unchanged. Subsequent downstream sequencing of this bisulfite treated material allows for measurement of the methylated versus unmethylated percentage at a fraction by treating the cytosine as a potential C/T SNP. The conversion of the genome from, essentially, a 4 base to a 3 base genome does create problems in both library preparation and bioinformatics analysis, but we have experience in solving these issues.