Abstract: This chapter explores the use of Bayesian networks in the study of genome-scale deoxyribonucleic acid (DNA) methylation. It begins by describing different experimental methods for the genome-scale annotation of DNA methylation. The Methyl-seq protocol is detailed and the biases induced by this technique are depicted, which constitute as many challenges for further analysis. These challenges are addressed introducing a Bayesian network framework for the analysis of Methyl-seq data. This previous model is extended to incorporate more information from the genomic sequence. Genomic structure is used as a prior on methylation status. A recurring theme is the interplay between the model used to glean information from the technology, and the view of methylation that drives the model specification. Finally, a study is described, in which such models were used, leading to both interesting biological conclusions and to insights about the nature of methylation.

ID: CaltechAUTHORS:20170303-140414225

]]>

Abstract: Knowledge of RNA structure is critical to understanding both the important functional roles of RNA in biology and the engineering of RNA to control biological systems. This article contains a protocol for selective 2′-hydroxyl acylation analyzed by primer extension and sequencing (SHAPE-Seq) that, through a combination of structure-dependent chemical probing and next-generation sequencing technologies, achieves structural characterization of hundreds of RNAs in a single experiment. This protocol is applicable in a variety of conditions, and represents an important tool for understanding RNA biology. The protocol includes methods for the design and synthesis of RNA mixtures for study, and the construction and analysis of structure-dependent sequencing libraries that reveal structural information of the RNAs in the mixtures. The methods are generally applicable to studying RNA structure and interactions in vitro in a variety of conditions, and allows for the rapid characterization of RNA structures in a high-throughput manner.

ID: CaltechAUTHORS:20170303-160835568

]]>

Abstract: Despite great interest in solving RNA secondary structures due to their impact on function, it remains an open problem to determine structure from sequence. Among experimental approaches, a promising candidate is the "chemical modification strategy", which involves application of chemicals to RNA that are sensitive to structure and that result in modifications that can be assayed via sequencing technologies. One approach that can reveal paired nucleotides via chemical modification followed by sequencing is SHAPE, and it has been used in conjunction with capillary electrophoresis (SHAPE-CE) and high-throughput sequencing (SHAPE-Seq). The solution of mathematical inverse problems is needed to relate the sequence data to the modified sites, and a number of approaches have been previously suggested for SHAPE-CE, and separately for SHAPE-Seq analysis. Here we introduce a new model for inference of chemical modification experiments, whose formulation results in closed-form maximum likelihood estimates that can be easily applied to data. The model can be specialized to both SHAPE-CE and SHAPE-Seq, and therefore allows for a direct comparison of the two technologies. We then show that the extra information obtained with SHAPE-Seq but not with SHAPE-CE is valuable with respect to ML estimation.

ID: CaltechAUTHORS:20170306-092934189

]]>

Abstract: Recent advances in high-throughput genomics technologies have resulted in the sequencing of large numbers of (near) complete genomes. These genome sequences are being mined for important functional elements, such as genes. They are also being compared and contrasted in order to identify other functional sequences, such as those involved in the regulation of genes. In cases where DNA sequences from different organisms can be determined to have originated from a common ancestor, it is natural to try to infer the an- cestral sequences. The reconstruction of ancestral genomes can lead to insights about genome evolution, and the origins and diversity of function. There are a number of interesting foundational questions associated with reconstructing ancestral genomes: Which statistical models for evolution should be used for making inferences about ancestral sequences? How should extant genomes be compared in order to facilitate ancestral reconstruction? Which portions of ancestral genomes can be reconstructed reliably, and what are the limits of ancestral reconstruction? We discuss recent progress on some of these questions, offer some of our own opinions, and highlight interesting mathematics, statistics, and computer science problems.

No.: 64
ID: CaltechAUTHORS:20170307-135106127

]]>

Abstract: We study partitions of the symmetric group which have desirable geometric properties. The statistical tests defined by such partitions involve counting all permutations in the equivalence classes. These permutations are the linear extensions of partially ordered sets specified by the data. Our methods refine rank tests of non-parametric statistics, such as the sign test and the runs test, and are useful for the exploratory analysis of ordinal data. Convex rank tests correspond to probabilistic conditional independence structures known as semi-graphoids. Submodular rank tests are classified by the faces of the cone of submodular functions, or by Minkowski summands of the permutohedron. We enumerate all small instances of such rank tests. Graphical tests correspond to both graphical models and to graph associahedra, and they have excellent statistical and algorithmic properties.

ID: CaltechAUTHORS:20170307-095347077

]]>

Abstract: Micro-indels are small insertion or deletion events (indels) that occur during genome evolution. The study of micro-indels is important, both in order to better understand the underlying biological mechanisms, and also for improving the evolutionary models used in sequence alignment and phylogenetic analysis. The inference of micro-indels from multiple sequence alignments of related genomes poses a difficult computational problem, and is far more complicated than the related task of inferring the history of point mutations. We introduce a tree alignment based approach that is suitable for working with multiple genomes and that emphasizes the concept of indel history. By working with an appropriately restricted alignment model, we are able to propose an algorithm for inferring the optimal indel history of homologous sequences that is efficient for practical problems. Using data from the ENCODE project as well as related sequences from multiple primates, we are able to compare and contrast indel events in both coding and non-coding regions. The ability to work with multiple sequences allows us to refute a previous claim that indel rates are approximately fixed even when the mutation rate changes, and allows us to show that indel events are not neutral. In particular, we identify indel hotspots in the human genome.

No.: 3909 ISSN: 0302-9743

ID: CaltechAUTHORS:20170307-163715632

]]>

Abstract: We describe pair hidden Markov models, with an emphasis on their relationship to evolutionary models and hidden Markov models. We then explain the statistical interpretation of alignment with pair hidden Markov models, and highlight connections to the Needleman–Wunsch algorithm and other dynamic programming–based alignment algorithms.

ID: CaltechAUTHORS:20170308-113640795

]]>

Abstract: The Gibbs sampling method has been widely used for sequence analysis after it was successfully applied to the problem of identifying regulatory motif sequences upstream of genes. Since then numerous variants of the original idea have emerged, however in all cases the application has been to finding short motifs in collections of short sequences (typically less than 100 nucleotides long). In this paper we introduce a Gibbs sampling approach for identifying genes in multiple large genomic sequences up to hundreds of kilobases long. This approach leverages the evolutionary relationships between the sequences to improve the gene predictions, without explicitly aligning the sequences. We have applied our method to the analysis of genomic sequence from 14 genomic regions, totaling roughly 1.8Mb of sequence in each organism. We show that our approach compares favorably with existing ab-initio approaches to gene finding, including pairwise comparison based gene prediction methods which make explicit use of alignments. Furthermore, excellent performance can be obtained with as little as 4 organisms, and the method overcomes a number of difficulties of previous comparison based gene finding approaches: it is robust with respect to genomic rearrangements, can work with draft sequence, and is fast (linear in the number and length of the sequences). It can also be seamlessly integrated with Gibbs sampling motif detection methods.

ID: CaltechAUTHORS:20170308-141248581

]]>

Abstract: The application of Needleman-Wunsch alignment techniques to biological sequences is complicated by two serious problems when the sequences are long: the running time, which scales as the product of the lengths of sequences, and the difficulty in obtaining suitable parameters that produce meaningful alignments. The running time problem is often corrected by reducing the search space, using techniques such as banding, or chaining of high scoring pairs. The parameter problem is more difficult to fix, partly because the probabilistic model, which Needleman-Wunsch is equivalent to, does not capture a key feature of biological sequence alignments, namely the alternation of conserved blocks and seemingly unrelated non-conserved segments. We present a solution to the problem of designing efficient search spaces for pair hidden Markov models that align biological sequences by taking advantage of their associated features. Our approach leads to an optimization problem, for which we obtain a 2-approximation algorithm, and that is based on the construction of Manhattan networks, which are close relatives of Steiner trees. We describe the underlying theory and show how our methods can be applied to alignment of DNA sequences in practice, successfully reducing the Viterbi algorithm search space of alignment PHMMs by three orders of magnitude.

ID: CaltechAUTHORS:20170309-094423019

]]>

Abstract: We consider the problem of navigating between points in the plane so as to minimize the exposure to a radiating source. Specifically, given two points z_1, z_2 in the complex plane, we solve the problem of finding the path C(t) (0 ≤ t ≤ 1) such that C(0)=z_1, C(1)=z_2 and ∫^1_0 |C'(t)|/|C(t)|^k dt is minimized. The parameter k specializes to a number of interesting cases: in particular k=2 pertains to the passive sensor avoidance problem and k=4 entails the active radar avoidance problem. The avoidance paths which minimize exposure may have infinite arc-length. To overcome this problem we introduce a weighted exposure and path length optimization problem whose solution requires a variational approach. The optimal trajectory results we obtain are surprisingly intuitive in the cases of interest.

Vol.: 4
ID: CaltechAUTHORS:20170309-101645949

]]>

Abstract: Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology, ranging from alignment problems to gene finding and annotation. Alignment problems can be solved with pair HMMs, while gene finding programs rely on generalized HMMs in order to model exon lengths. In this paper we introduce the generalized pair HMM (GPHMM), which is an extension of both pair and generalized HMMs. We show how GPHMMs, in conjunction with approximate alignments, can be used for cross-species gene finding, and describe applications to DNA-cDNA and DNA-protein alignment. GPHMMs provide a unifying and probabilistically sound theory for modeling these problems.

ID: CaltechAUTHORS:20170309-084730577

]]>

Abstract: We describe a novel analytical approach to gene recognition based on cross-species comparison We first undertook a comparison of orthologous genomic look from human and mouse, studying the extent of similarity in the number, size and sequence of exons and introns We then developed an approach for recognizing genes within such orthologous regions, by first aligning the regions using an iterative global alignment system and then identifying genes based on conservation of exonic features at aligned positions in both species The alignment and gene recognition are performed by new programs called GLASS and ROSETTA, respectively ROSETTA performed well at exact identification of coding exons in 117 orthologous pairs tested.

ID: CaltechAUTHORS:20170309-111631818

]]>

Abstract: This paper describes a fast and fully automated dictionary based approach to gene annotation and exon prediction. Two dictionaries are constructed, one from the nonredundant protein OWL database and the other from the dbEST database. These dictionaries are used to obtain O(1) time lookups of tuples in the dictionaries (4 tuples for the OWL database and 11 tuples for the dbEST database). These tuples can be used to rapidly find the longest matches at every position in an input sequence to the database sequences. Such matches provide very useful information pertaining to locating common segments between exons, alternative splice sites, and frequency data of long tuples for statistical purposes. These dictionaries also provide the basis for both homology determination, and statistical approaches to exon prediction. For instance, using the OWL protein database on a benchmark test set of 130 genes, and after removing sequences from the database with exact amino acid homology to genes in our test set, we find 88% of coding nucleotides, and 99% of our predictions of coding nucleotides are correct. Also, 81% of coding exons are predicted exactly, while 82% of our predictions of exons agree exactly with the published annotation of their genes.

ID: CaltechAUTHORS:20170309-113403506

]]>

Abstract: The pebbling number of a graph G, f(G), is the least m such that, however m pebbles are placed on the vertices of G, we can move a pebble to any vertex by a sequence of moves, each move taking two pebbles off one vertex and placing one on an adjacent vertex. We give another proof that f(Q^n) = 2^n (Chung) and show that for most graphs f(G) = |V(G)| or |V(G)| + 1. We also find explicitly for certain classes of graphs (i.e. for odd cycles and squares of paths). characterize efficient graphs, show that most graphs have the 2-pebbling property, and obtain some results on optimal pebbling.

No.: 107
ID: CaltechAUTHORS:20170309-150137723

]]>