Advisor Feed
https://feeds.library.caltech.edu/people/Bruck-J/advisor.rss
A Caltech Library Repository Feedhttp://www.rssboard.org/rss-specificationpython-feedgenenThu, 30 Nov 2023 19:05:21 +0000Neural logic : theory and implementation
https://resolver.caltech.edu/CaltechETD:etd-05132005-143440
Authors: Bohossian, Vasken Z.
Year: 1998
DOI: 10.7907/N0T5-7J92
NOTE: Text or symbols not renderable in plain ASCII are indicated by [...]. Abstract is included in .pdf document.
Human brains are by far superior to computers in solving hard problems like combinatorial optimization and image and speech recognition, although their basic building blocks are several orders of magnitude slower. This observation has boosted interest in the field of artificial networks [20], [37]. The latter are built by interconnecting artificial neurons whose behavior is inspired by that of biological neurons. In this thesis we consider the Boolean version of an artificial neuron, namely, a Linear Threshold (LT) element, which computes a neural-like Boolean function of n binary inputs [32]. An LT element outputs the sign of a weighted sum of its Boolean inputs. The main issues in the study of networks (circuits) consisting of LT elements, called LT circuits, include the estimation of their computational capabilities and limitations and the comparison of their properties with those of traditional Boolean logic circuits based on AND, OR and NOT gates (called AON circuits). For example, there is a strong evidence that LT circuits are more efficient than AON circuits in implementing a number of important functions including the addition, product and division of integers [44], [45].
It is easy to see that an LT element is more powerful than an AON gate, simply because of the freedom one has in selecting the weights. Indeed, different choices of weights produce different Boolean functions. As a matter of fact, the number of n-input Boolean functions that can be implemented by a single LT element is of the order of […], [42], [22]. That additional power comes at the cost of added complexity. Some LT functions require weights that are very different in magnitude, potentially rendering difficult hardware or software implementations of the corresponding LT elements. For that reason, theoretical research in the field of LT circuits has focused on the weights, in particular the power of LT elements with restricted weights. As early as 1971, Muroga, [32], proved that any linear threshold element can be implemented with integer weights. That is, by restricting the magnitudes of the weights to natural numbers, one does not lose any power of the original LT element. We generalize this result to arbitrary subsets of the set of real numbers. For example, we show that one can restrict the weights to be the square of integers, and still be able to realize all LT functions. We ask the following question. What are the conditions on the subset […] which guarantee that all LT functions can be implemented with weights drawn from it?
Another aspect of the complexity of the weights is their growth as the number of inputs increases. It has been shown [17], [33], [38], [43] that there exist linear threshold functions that can be implemented by a single threshold element with exponentially growing weights, but cannot be implemented by a threshold element with smaller polynomialy growing weights. In light of that result the above question was dealt with by defining a class, called […], within the set of linear threshold functions: the class of functions with "small" (i.e. polynomialy growing) weights [43]. We focus on a single LT element. Our contribution consists in two novel methods for constructing threshold functions with minimal weights, which allow us to fill up the gap between polynomial and exponential weight growth by further refining the separation. Namely, we prove that the class of linear threshold functions with polynomial-size weights can be divided into subclasses […], according to the degree, d, of the polynomial. In fact, we prove a more general result-—that there exists a linear threshold function for any arbitrary number of inputs and any weight size.
Even though some LT functions require weights that grow exponentially with the number of input variables, it has been shown recently, in [13], [18], that such functions can be replaced by a two-layer circuit composed of LT gates with polynomially growing, i.e., small weights. We improve the best known bound on the size of that circuit, presented in [18] by focusing on a particular function with large coefficients. We also derive explicit two-layer circuits. Two layer LT circuits are in general composed of different linear threshold elements, but for some useful Boolean functions, such as parity, addition and product, the gates of the first layer are almost identical. To take advantage of this fact we introduce a new Boolean computing element. Instead of the sign function, it computes an arbitrary (with polynomialy many transitions) Boolean function of the weighted sum of its inputs. We call the new computing element an LTM element, which stands for Linear Threshold with Multiple transitions. The advantages of LTM become apparent in the context of VLSI implementation. Indeed, this new model reduces the layout area of the corresponding symmetric function from […] to O(n). We present a VLSI implementations of both LT and LTM elements. Two kinds of elements were fabricated, programmable and hardwired. The programmable elements use the charge on a floating gate in order to store the values of the weights.
For many years, the topic of linear threshold logic, has been approached in two different ways, theory, i.e. computational circuit complexity, [38], [56], and hardware implementation, [48], [40]. Surprisingly, there has been very little interaction between those two approaches. As a whole, the present thesis is one step towards establishing a connection between the theory and implementation of threshold circuits. Its contributions are at three levels. At the theoretical level, new classes of functions such as […] and LTM are defined and their computational power is estimated. At the algorithmic level, we show how to convert real weights to weights drawn from arbitrary subset of the real numbers, e.g., integer weights, we also show how to construct LT functions with minimal weights, and finally we present an algorithm that produces an […] circuit (circuit composed of gates with small weights), that computes the comparison function, COMP. We also present LTM circuits computing useful functions, such as XOR, ADD, PRODUCT. At the implementation level, we show the design, layout and testing of the VLSI implementation of LT and LTM. Establishing a connection between the theoretical and practical aspects of threshold logic will profit both domains by providing solutions for practical problems and by defining new theoretical questions inspired by implementation issues.
https://thesis.library.caltech.edu/id/eprint/1775Highly available distributed storage systems
https://resolver.caltech.edu/CaltechETD:etd-05162005-084223
Authors: Xu, Lihao
Year: 1999
DOI: 10.7907/EQK9-8C84
As the need for data explodes with the passage of time and the increase of computing power, data storage becomes more and more important. Distributed storage, as distributed computing before it, is coming of age as a good solution to make systems highly available, i.e., highly scalable, reliable and efficient. The focus of this thesis is how to achieve data reliability and efficiency in distributed storage systems. This thesis consists of two parts. The first part deals with the reliability of distributed storage systems. Reliability is achieved by computationally efficient MDS array codes that eliminate single points of failure in the systems, thus providing more reliability and flexibility to the systems. Such codes can be used as general MDS error-correcting codes. They are particularly suitable for use in distributed storage systems. The second part deals with the efficiency of distributed storage systems. Methods are proposed to improve the performance of data server and storage systems significantly through the proper use of data redundancy. These methods are based on error-correcting codes, particularly the MDS array codes developed in the first part.
Two new classes of MDS array codes are presented: the X-Code and the B-Code. The encoding operations of both codes are optimal, i.e., their update complexity achieves the theoretical lower bound. They distribute parity bits over all columns rather than concentrating them on some parity columns. As with other array codes, the error model for both codes is that errors or erasures are columns of the array, i.e., if at least one bit of a column is an error or erasure, then the whole column is considered to be an error or erasure. Both codes are of distance 3, i.e., they can either: correct two erasures, detect two errors or correct one error. In addition to encoding algorithms, efficient decoding algorithms are proposed, both for erasure-correcting and for error-correcting. In fact, the erasure-correcting algorithms are also optimal in terms of computation complexity.
The X-Code has a very simple geometrical structure: the parity bits are constructed along two groups of parallel parity lines of slopes 1 and -1. This is the origin of the name X-Code. This simple geometrical structure allows simple erasure-decoding and error-decoding algorithms, using only XORs and vector cyclic-shift operations.
The significance of the B-code not only includes all its optimality properties: MDS, optimal encoding and optimal decoding, but also its relation with a 3-decade old graph theory problem. It is proven in this thesis that constructing a B-Code of odd length is exactly equivalent to constructing a perfect one-factorization (or P1F) of a complete graph. Constructing a P1F of an arbitrary complete graph has remained a conjecture since the early 1960's. Though the P1F conjecture remains unsolved, the B-code as the first real application of the P1F problem will hopefully spur more research on it. It is also conjectured in this thesis that constructing a B-Code of any length, even or odd, is equivalent to constructing a P1F of a complete graph. An efficient error-correcting algorithm for the B-Code is also presented, which is based on the relations between the B-Code and its dual. The algorithm might give a hint of how to develop efficient decoding algorithms for other codes.
While it is intuitive that redundancy can bring reliability to a system, this thesis gives another direction: using redundancy actively to improve performance (efficiency) of distributed data systems. The results in this direction are both theoretical and experimental. System models are extracted from experiments in real practical systems; analytical results are derived using these and are then fed back to experiments for verification.
In this thesis, a novel deterministic voting scheme that uses error-correcting codes is proposed. The voting scheme generalizes all known simple deterministic voting algorithms. It can be tuned to various application environments with different error rates to drastically reduce average communication complexity, i.e., the amount of information that must be transmitted in order to get correct voting results.
Two problems are identified to improve the performance of general data server systems, namely the data distribution problem and the data acquisition problem. Solutions to these are proposed, as are general analytical results on performance of (n, k) systems. A simple service time model of a practical disk-based distributed server system is given. This model, which is based on experimental results, is a starting point for data distribution and data acquisition schemes. These results, both experimental and analytical, can be further used for more sophisticated scheduling schemes to optimize or improve the performance of data server systems that serve multiple clients simultaneously.
Finally, some research problems related to storage systems are proposed as future directions.https://thesis.library.caltech.edu/id/eprint/1834Computational methods for stochastic biological systems
https://resolver.caltech.edu/CaltechETD:etd-05132005-154222
Authors: Gibson, Michael Andrew
Year: 2000
DOI: 10.7907/G4X0-D149
The virtual completion of the genome project and prodigious amounts of work by different biologists throughout the world have elucidated many of the components of biological systems. The genes (and hence proteins) are largely known, and the tools of molecular biology allows one to manufacture and express them, so as to understand their function. Given this increased understanding of components, the next step in understanding complex biology will be understanding systems, which will almost certainly involve formal, detailed, and quantitative models.
One of the great challenges of modeling biological systems is that they tend to "break the math." Biological systems have small numbers of molecules, operate far from equilibrium, change shape and size, etc. This thesis develops mathematical and computational tools for biological systems with few molecules. Such systems are particularly problematic because the usual macroscopic view of chemistry, in which concentrations of molecules vary continuously, continually, and deterministically, does not work. Rather, one needs to use the mesoscopic view of chemistry: molecules undergo discrete reaction events, and the timing of these events is probabilistic. There are many standard numerical computational techniques for the macroscopic view, but far fewer for the mesoscopic view.
This thesis develops (1) an efficient, exact stochastic simulation algorithm, to generate trajectories of mesoscopic biological systems, (2) a sensitivity analysis algorithm, to quantify how a model’s predictions depend on the exact values of parameters (e.g., rate constants) used, and (3) a parameter estimation algorithm, to estimate the values of model parameters from observed trajectories.https://thesis.library.caltech.edu/id/eprint/1778Fault-tolerant cluster of networking elements
https://resolver.caltech.edu/CaltechETD:etd-08152001-144501
Authors: Fan, Chenggong Charles
Year: 2001
DOI: 10.7907/R15B-VD58
The explosive growth of the Internet demands higher reliability and performance than what the current networking infrastructure can provide. This dissertation explores novel architectures and protocols that provide a methodology for grouping together multiple networking elements, such as routers, gateways, and switches, to create a more reliable and performant distributed networking system. Clustering of networking elements is a novel concept that requires the invention of distributed computing protocols that facilitate efficient and robust support of networking protocols. We introduce the Raincore protocol architecture that achieves these goals by bridging the fields of computer networks and distributed systems.
In designing Raincore, we paid special attention to the unique requirements from the networking environment. First, networking clusters need to scale up the networking throughput in addition to the scaling up of computing power. Second, task switching between the different services supported by a networking element has a major negative impact on performance. Third, fast fail-over time is critical for maintaining network connections in the event of failures. We discuss in depth the design of Raincore Group Communication Manager that addresses the forgoing requirements and provides group membership management and reliable multicast transport. It is based on a novel token-ring protocol. We prove that this protocol is formally correct, namely, it satisfies the set of formal specifications that defines the Group Membership problem.
The creation of Raincore has already made a substantial impact both at Caltech and the academic community as well as in the industry. The first application is SNOW, a scalable web server cluster that is part of RAIN, a collaborative project between Caltech and JPL/NASA. The second application is RainWall, a commercial solution created by Rainfinity, a Caltech spin-off company, that provides the first fault-tolerant and scalable firewall cluster. These applications exhibit the fast fail-over response, low overhead, and near-linear scalability of the Raincore protocols.
In addition, we studied fault-tolerant networking architectures. In particular, we considered efficient constructions of extra-stage fault-tolerant Multistage Interconnection Networks. Multistage Interconnection Networks provide a way to construct a larger switching network using smaller switching elements. We discovered an optimal family of constructions, in the sense that it requires the least number of extra components to tolerate multiple switching element failures. We prove that this is the only family of constructions that has this optimal fault-tolerance property.
https://thesis.library.caltech.edu/id/eprint/3121Periodic Broadcast Scheduling for Data Distribution
https://resolver.caltech.edu/CaltechETD:etd-05132005-151145
Authors: Foltz, Kevin E.
Year: 2002
DOI: 10.7907/980Q-RQ20
As wireless computer networks grow in size and complexity, we are faced with the problem of providing scalable, high-bandwidth service to their users. Wired networks typically use "data pull," where users send requests to a server and the server responds with the desired information. In the wireless domain, "data push" promises to provide better performance for many applications [1]. The broadcast domain that is typical of wireless communication is very effective in distributing information to large audiences.
The idea of broadcast disks has been around since the Teletext system [3]. There is now an interest in applying these ideas to wireless computer networks. There are some interesting research questions about scheduling for data distribution. Computing optimal schedules has been shown to be difficult [18]. The optimal schedules themselves, however, seem to be less complex, and often periodic [4]. Xu [24] looks at the scheduling of streaming data, which involves splitting the data into smaller pieces. The idea of error correction is also important for wireless transmission due to the noisy nature of the channel [6].
We look at scheduling data for broadcast. We compare time-division scheduling and frequency-division scheduling for data items of equal length. We show that time-division is better for sending dynamic data. We then find optimal time-division schedules for two items. We show how the freedom to split items into smaller pieces can give improvements in performance. With a single split, where each of two items is split in half, we find the optimal schedules for items of equal length.
We continue with the idea of splitting items, and show what happens when the number of splits is very large. Then, we examine what happens when we add streaming data to our broadcast. We compare time-division and frequency-division as before, and now also look at a mix of the two. We prove bounds on where the mix is the best broadcast method.https://thesis.library.caltech.edu/id/eprint/1777Wireless Networks, from Collective Behavior to the Physics of Propagation
https://resolver.caltech.edu/CaltechETD:etd-05202003-154451
Authors: Franceschetti, Massimo
Year: 2003
DOI: 10.7907/SCTG-FN57
This thesis addresses some of the key challenges in the emerging wireless scenario. It focuses on the problems of connectivity, coverage, and wave propagation, following a mathematically rigorous approach. The questions addressed are very basic and extremely easy to state. Their solution, however, can be difficult and leads to the development of a new kind of percolation theory, to a new theorem in geometry, and to a new model of wave propagation in urban environments. The problems are connected together to provide guidelines in the design of wireless networks.
https://thesis.library.caltech.edu/id/eprint/1887Optimized Network Data Storage and Topology Control
https://resolver.caltech.edu/CaltechETD:etd-05272004-163315
Authors: Jiang, Anxiao (Andrew)
Year: 2004
DOI: 10.7907/91R7-MH71
<p>This thesis addresses two key challenges for network data-storage systems: optimizing data placement for highly efficient and robust data access, and constructing network topologies that facilitate data transmission scalable to both network sizes and network dynamics. It focuses on two new topics — data placement using erasure-correcting codes, and topology control for nodes in normed spaces. The first topic generalizes traditional file-assignment problems, and has the distinct feature of interleavingly placing data in networks. The second topic emphasizes the construction of network topologies that achieve excellent global performance in comprehensive measurements, through purely local decisions on connectivity. The results of the thesis deepen the current understanding on these important and intriguing topics, and follow a mathematically rigorous approach.</p>
https://thesis.library.caltech.edu/id/eprint/2137Cyclic Combinational Circuits
https://resolver.caltech.edu/CaltechETD:etd-05032004-153842
Authors: Riedel, Marcus D.
Year: 2004
DOI: 10.7907/410B-XR25
<p>A collection of logic gates forms a combinational circuit if the outputs can be described as Boolean functions of the current input values only. Optimizing combinational circuitry, for instance, by reducing the number of gates (the area) or by reducing the length of the signal paths (the delay), is an overriding concern in the design of digital integrated circuits.</p>
<p>The accepted wisdom is that combinational circuits must have acyclic (i.e., loop-free or feed-forward) topologies. In fact, the idea that "combinational" and "acyclic" are synonymous terms is so thoroughly ingrained that many textbooks provide the latter as a definition of the former. And yet simple examples suggest that this is incorrect. In this dissertation, we advocate the design of cyclic combinational circuits (i.e., circuits with loops or feedback paths). We demonstrate that circuits can be optimized effectively for area and for delay by introducing cycles.</p>
<p>On the theoretical front, we discuss lower bounds and we show that certain cyclic circuits are one-half the size of the best possible equivalent acyclic implementations. On the practical front, we describe an efficient approach for analyzing cyclic circuits, and we provide a general framework for synthesizing such circuits. On trials with industry-accepted benchmark circuits, we obtained significant improvements in area and delay in nearly all cases. Based on these results, we suggest that it is time to re-write the definition: combinational might well mean cyclic.</p>https://thesis.library.caltech.edu/id/eprint/1591Networks of Relations
https://resolver.caltech.edu/CaltechETD:etd-06032005-140944
Authors: Cook, Matthew M.
Year: 2005
DOI: 10.7907/CVKM-D684
<p>Relations are everywhere. In particular, we think and reason in terms of mathematical and English sentences that state relations. However, we teach our students much more about how to manipulate functions than about how to manipulate relations. Consider functions. We know how to combine functions to make new functions, how to evaluate functions efficiently, and how to think about compositions of functions. Especially in the area of boolean functions, we have become experts in the theory and art of designing combinations of functions to yield what we want, and this expertise has led to techniques that enable us to implement mind-bogglingly large yet efficient networks of such functions in hardware to help us with calculations. If we are to make progress in getting machines to be able to reason as well as they can calculate, we need to similarly develop our understanding of relations, especially their composition, so we can develop techniques to help us bridge between the large and small scales. There has been some important work in this area, ranging from practical applications such as relational databases to extremely theoretical work in universal algebra, and sometimes theory and practice manage to meet, such as in the programming language Prolog, or in the probabilistic reasoning methods of artificial intelligence. However, the real adventure is yet to come, as we learn to develop a better understanding of how relations can efficiently and reliably be composed to get from a low level representation to a high level representation, as this understanding will then allow the development of automated techniques to do this on a grand scale, finally enabling us to build machines that can reason as amazingly as our contemporary machines can calculate.</p>
<p>This thesis explores new ground regarding the composition of relations into larger relational structures. First of all a foundation is laid by examining how networks of relations might be used for automated reasoning. We define exclusion networks, which have close connections with the areas of constraint satisfaction problems, belief propagation, and even boolean circuits. The foundation is laid somewhat deeper than usual, taking us inside the relations and inside the variables to see what is the simplest underlying structure that can satisfactorily represent the relationships contained in a relational network. This leads us to define zipper networks, an extremely low-level view in which the names of variables or even their values are no longer necessary, and relations and variables share a common substrate that does not distinguish between the two. A set of simple equivalence operations is found that allows one to transform a zipper network while retaining its solution structure, enabling a relation-variable duality as well as a canonical form on linear segments. Similarly simple operations allow automated deduction to take place, and these operations are simple and uniform enough that they are easy to imagine being implemented by biological neural structures.</p>
<p>The canonical form for linear segments can be represented as a matrix, leading us to matrix networks. We study the question of how we can perform a change of basis in matrix networks, which brings us to a new understanding of Valiant's recent holographic algorithms, a new source of polynomial time algorithms for counting problems on graphs that would otherwise appear to take exponential time. We show how the holographic transformation can be understood as a collection of changes of basis on individual edges of the graph, thus providing a new level of freedom to the method, as each edge may now independently choose a basis so as to transform the matrices into the required form.</p>
<p>Consideration of zipper networks makes it clear that "fan-out," i.e., the ability to duplicate information (for example allowing a variable to be used in many places), is most naturally itself represented as a relation along with everything else. This is a notable departure from the traditional lack of representation for this ability. This deconstruction of fan-out provides a more general model for combining relations than was provided by previous models, since we can examine both the traditional case where fan-out (the equality relation on three variables) is available and the more interesting case where its availability is sub ject to the same limitations as the availability of other relations. As we investigate the composition of relations in this model where fan-out is explicit, what we find is very different from what has been found in the past.</p>
<p>First of all we examine the relative expressive power of small relations: For each relation on three boolean variables, we examine which others can be implemented by networks built solely from that relation. (We also find, in each of these cases, the complexity of deciding whether such a network has a solution. We find that solutions can be found in polynomial time for all but one case, which is NP-complete.) For the question of which relations are able to implement which others, we provide an extensive and complete answer in the form of a hierarchy of relative expressive power for these relations. The hierarchy for relations is more complex than Post's well-known comparable hierarchy for functions, and parts of it are particularly difficult to prove. We find an explanation for this phenomenon by showing that in fact, the question of whether one relation can implement another (and thus should be located above it in the hierarchy) is undecidable. We show this by means of a complicated reduction from the halting problem for register machines. The hierarchy itself has a lot of structure, as it is rarely the case that two ternary boolean relations are equivalent. Often they are comparable, and often they are incomparable—the hierarchy has quite a bit of width as well as depth. Notably, the fan-out relation is particularly difficult to implement; only a very few relations are capable of implementing it. This provides an additional ex post facto justification for considering the case where fan-out is absent: If you are not explicitly provided with fan-out, you are unlikely to be able to implement it.</p>
<p>The undecidability of the hierarchy contrasts strongly with the traditional case, where the ubiquitous availability of fan-out causes all implementability questions to collapse into a finite decidable form. Thus we see that for implementability among relations, fan-out leads to undecidability. We then go on to examine whether this result might be taken back to the world of functions to find a similar difference there. As we study the implementability question among functions without fan-out, we are led directly to questions that are independently compelling, as our functional implementability question turns out to be equivalent to asking what can be computed by sets of chemical reactions acting on a finite number of species. In addition to these chemical reaction networks, several other nondeterministic systems are also found to be equivalent in this way to the implementability question, namely, Petri nets, unordered Fractran, vector addition systems, and "broken" register machines (whose decrement instruction may fail even on positive registers). We prove equivalences between these systems.</p>
<p>We find several interesting results in particular for chemical reaction networks, where the standard model has reaction rates that depend on concentration. In this setting, we analyze questions of possibility as well as questions of probability. The question of the possibility of reaching a target state turns out to be equivalent to the reachability question for Petri nets and vector addition systems, which has been well studied. We provide a new proof that a form of this reachability question can be decided by primitive recursive functions. Ours is the first direct proof of this relationship, avoiding the traditional excursion to Diophantine equations, and thus providing a crisper picture of the relationship between Karp's coverability tree and primitive recursive functions.</p>
<p>In contrast, the question of finding the probability (according to standard chemical kinetics) of reaching a given target state turns out to be undecidable. Another way of saying this is that if we wish to distinguish states with zero probability of occurring from states with positive probability of occurring, we can do so, but if we wish to distinguish low probability states from high probability states, there is no general way to do so. Thus, if we wish to use a chemical reaction network to perform a computation, then if we insist that the network must always get the right answer, we will only be able to use networks with limited computational power, but if we allow just the slightest probability of error, then we can use networks with Turing-universal computational ability. This power of probability is quite surprising, especially when contrasted with the conventional computational complexity belief that BPP = P.</p>
<p>Exploring the source of this probabilistic power, we find that the probabilities guiding the network need to depend on the concentrations (or perhaps on time)—fixed probabilities aren’t enough on their own to achieve this power. In the language of Petri nets, if one first picks a transition at random, and then fires it if it is enabled, then the probability of reaching a particular target state can be calculated to arbitrary precision, but if one first picks a token at random, and then fires an enabled transition that will absorb that token, then the probability of reaching a particular target state cannot in general be calculated to any precision whatsoever.</p>
<p>In short, what started as a simple thorough exploration of the power of composition of relations has led to many decidability and complexity questions that at first appear completely unrelated, but turn out to combine to paint a coherent picture of the relationship between relations and functions, implementability and reachability, possibility and probability, and decidability and undecidability.</p>https://thesis.library.caltech.edu/id/eprint/2424Interval Modulation: A New Paradigm for the Design of High Speed Communication Systems
https://resolver.caltech.edu/CaltechETD:etd-07072004-154316
Authors: Mukhtar, Saleem
Year: 2005
DOI: 10.7907/6F03-MP11
In this thesis we propose a new, biologically inspired, paradigm for the design of high speed communication systems. The paradigm consists of a new modulation format referred to as Interval Modulation (IM). In order to transmit data in an efficient manner using this format, new coding techniques are needed. In this thesis we propose a coding technique based on variable length to variable length prefix trees and code construction algorithms are outlined. These codes are referred to as Interval Modulation Codes (IMC). Furthermore, data encoded with this modulation format cannot be transmitted or received using conventional synchronous CDR based receivers. In this thesis we outline a new asynchronous circuit architecture for both the transmitter and receiver. The architecture is based on active delay lines and eliminates the need for clock recovery.https://thesis.library.caltech.edu/id/eprint/2818Coding Techniques for Data-Storage Systems
https://resolver.caltech.edu/CaltechETD:etd-01242008-134459
Authors: Cassuto, Yuval
Year: 2008
DOI: 10.7907/38ZT-QT95
As information-bearing objects, data-storage systems are natural consumers of information-theoretic ideas. For many issues in data-storage systems, the best trade-off between cost, performance and reliability, passes through the application of error-correcting codes. Error-correcting codes that are specialized for data-storage systems is the subject studied by this thesis. On the practical side, central challenges of storage systems are addressed, both at the individual-device level and higher at the enterprise level for disk arrays. The results for individual devices include a new coding paradigm for Multi-Level Flash storage that benefits storage density and access speed, and also a higher-throughput algorithm for decoding Reed-Solomon codes with large decoding radii. The results for storage arrays address models and constructions to combat correlated device failures, and also introduce new highly-regular array-code constructions with optimal redundancy and updates. On the theoretical side, the research stretches across multiple layers of coding theory innovation: new codes for new error models, new codes for existing error models, and new decoding techniques for known codes. To bridge the properties and constraints of practical systems with the mathematical language of coding theory, new well-motivated models and abstractions are proposed. Among them are the models of t asymmetric limited-magnitude errors and clustered erasures. Later, after maximizing the theory's power in addressing the abstractions, the performance of storage systems that employ the new schemes is analytically validated.https://thesis.library.caltech.edu/id/eprint/328Randomness and Noise in Information Systems
https://resolver.caltech.edu/CaltechTHESIS:07122012-141803264
Authors: Zhou, Hongchao
Year: 2013
DOI: 10.7907/82KV-2H11
This dissertation is devoted to the study of randomness and noise in a number of information systems including computation systems, storage systems, and natural paradigms like molecular systems, where randomness plays important and distinct roles. Motivated by applications in engineering and science we address a number of theoretical research questions.
<ul><li>In a computation system, randomness enables to perform tasks faster, simpler, or more space efficient. Hence, randomness is a useful computational resource, and the research question we address is: How to efficiently extract randomness from natural sources?</li><br/>
<li>In a molecular system such as a chemical reaction network or a gene regulatory network, randomness is inherent and serves as the key mechanism for producing the desired quantities of molecular species. A chemical reaction can be abstractly described as a probabilistic switch. Hence, given a set of probabilistic switches (with some fixed switching probabilities), the research question we address is: How to synthesize a stochastic network consisting of those switches that computes a pre-specified probability distribution?</li><br/>
<li>In an information storage system, like flash memories where information is represented by a relatively small number of electrons, randomness is a threat to data reliability. Hence, the research question we address is: How to represent, write and read information in the presence of randomness (noise)?</li></ul>
This dissertation is focusing on the foregoing key questions and describes novel contributions related to randomness generation and extraction, stochastic system synthesis and coding for information storage.
https://thesis.library.caltech.edu/id/eprint/7176Coding for Information Storage
https://resolver.caltech.edu/CaltechTHESIS:05312013-123819501
Authors: Wang, Zhiying
Year: 2013
DOI: 10.7907/TFHZ-RW88
<p>Storage systems are widely used and have played a crucial rule in both consumer and industrial products, for example, personal computers, data centers, and embedded systems. However, such system suffers from issues of cost, restricted-lifetime, and reliability with the emergence of new systems and devices, such as distributed storage and flash memory, respectively. Information theory, on the other hand, provides fundamental bounds and solutions to fully utilize resources such as data density, information I/O and network bandwidth. This thesis bridges these two topics, and proposes to solve challenges in data storage using a variety of coding techniques, so that storage becomes faster, more affordable, and more reliable.</p>
<p>We consider the system level and study the integration of RAID schemes and distributed storage. Erasure-correcting codes are the basis of the ubiquitous RAID schemes for storage systems, where disks correspond to symbols in the code and are located in a (distributed) network. Specifically, RAID schemes are based on MDS (maximum distance separable) array codes that enable optimal storage and efficient encoding and decoding algorithms. With r redundancy symbols an MDS code can sustain r erasures. For example, consider an MDS code that can correct two erasures. It is clear that when two symbols are erased, one needs to access and transmit all the remaining information to rebuild the erasures. However, an interesting and practical question is: What is the smallest fraction of information that one needs to access and transmit in order to correct a single erasure? In Part I we will show that the lower bound of 1/2 is achievable and that the result can be generalized to codes with arbitrary number of parities and optimal rebuilding.</p>
<p>We consider the device level and study coding and modulation techniques for emerging non-volatile memories such as flash memory. In particular, rank modulation is a novel data representation scheme proposed by Jiang et al. for multi-level flash memory cells, in which a set of n cells stores information in the permutation induced by the different charge levels of the individual cells. It eliminates the need for discrete cell levels, as well as overshoot errors, when programming cells. In order to decrease the decoding complexity, we propose two variations of this scheme in Part II: bounded rank modulation where only small sliding windows of cells are sorted to generated permutations, and partial rank modulation where only part of the n cells are used to represent data. We study limits on the capacity of bounded rank modulation and propose encoding and decoding algorithms. We show that overlaps between windows will increase capacity. We present Gray codes spanning all possible partial-rank states and using only ``push-to-the-top'' operations. These Gray codes turn out to solve an open combinatorial problem called universal cycle, which is a sequence of integers generating all possible partial permutations.</p>
https://thesis.library.caltech.edu/id/eprint/7792Rewriting Schemes for Flash Memory
https://resolver.caltech.edu/CaltechTHESIS:04082015-142940694
Authors: En Gad, Eyal
Year: 2015
DOI: 10.7907/Z9R49NQ3
<p>Flash memory is a leading storage media with excellent features such as random access and
high storage density. However, it also faces significant reliability and endurance challenges.
In flash memory, the charge level in the cells can be easily increased, but removing charge
requires an expensive erasure operation. In this thesis we study rewriting schemes that
enable the data stored in a set of cells to be rewritten by only increasing the charge level
in the cells. We consider two types of modulation scheme; a convectional modulation based
on the absolute levels of the cells, and a recently-proposed scheme based on the relative cell
levels, called rank modulation. The contributions of this thesis to the study of rewriting
schemes for rank modulation include the following: we</p>
<p>•propose a new method of rewriting in rank modulation, beyond the previously proposed
method of “push-to-the-top”;</p>
<p>•study the limits of rewriting with the newly proposed method, and derive a tight upper
bound of 1 bit per cell;</p>
<p>•extend the rank-modulation scheme to support rankings with repetitions, in order to
improve the storage density;</p>
<p>•derive a tight upper bound of 2 bits per cell for rewriting in rank modulation with
repetitions;</p>
<p>•construct an efficient rewriting scheme that asymptotically approaches the upper bound
of 2 bit per cell.</p>
<p>The next part of this thesis studies rewriting schemes for a conventional absolute-levels
modulation. The considered model is called “write-once memory” (WOM). We focus on
WOM schemes that achieve the capacity of the model. In recent years several capacity-achieving
WOM schemes were proposed, based on polar codes and randomness extractors.
The contributions of this thesis to the study of WOM scheme include the following: we</p>
<p>•propose a new capacity-achievingWOM scheme based on sparse-graph codes, and show
its attractive properties for practical implementation;</p>
<p>•improve the design of polarWOMschemes to remove the reliance on shared randomness
and include an error-correction capability.</p>
<p>The last part of the thesis studies the local rank-modulation (LRM) scheme, in which a
sliding window going over a sequence of real-valued variables induces a sequence of permutations.
The LRM scheme is used to simulate a single conventional multi-level flash cell.
The simulated cell is realized by a Gray code traversing all the relative-value states where,
physically, the transition between two adjacent states in the Gray code is achieved by using
a single “push-to-the-top” operation. The main results of the last part of the thesis are two
constructions of Gray codes with asymptotically-optimal rate.</p>https://thesis.library.caltech.edu/id/eprint/8814Coding for Security and Reliability in Distributed Systems
https://resolver.caltech.edu/CaltechTHESIS:06042017-212503971
Authors: Huang, Wentao
Year: 2017
DOI: 10.7907/Z9P26W5C
<p>This dissertation studies the use of coding techniques to improve the reliability and security of distributed systems. The first three parts focus on distributed storage systems, and study schemes that encode a message into <i>n</i> shares, assigned to <i>n</i> nodes, such that any <i>n</i> - <i>r</i> nodes can decode the message (reliability) and any colluding <i>z</i> nodes cannot infer any information about the message (security). The objective is to optimize the computational, implementation, communication and access complexity of the schemes during the process of encoding, decoding and repair. These are the key metrics of the schemes so that when they are applied in practical distributed storage systems, the systems are not only reliable and secure, but also fast and cost-effective.</p>
<p>Schemes with highly efficient computation and implementation are studied in Part I. For the practical high rate case of <i>r</i> ≤ 3 and <i>z</i> ≤ 3, we construct schemes that require only <i>r</i> + <i>z</i> XORs to encode and <i>z</i> XORs to decode each message bit, based on practical erasure codes including the B, EVENODD and STAR codes. This encoding and decoding complexity is shown to be optimal. For general <i>r</i> and <i>z</i>, we design schemes over a special ring from Cauchy matrices and Vandermonde matrices. Both schemes can be efficiently encoded and decoded due to the structure of the ring. We also discuss methods to shorten the proposed schemes.</p>
<p>Part II studies schemes that are efficient in terms of communication and access complexity. We derive a lower bound on the decoding bandwidth, and design schemes achieving the optimal decoding bandwidth and access. We then design schemes that achieve the optimal bandwidth and access not only for decoding, but also for repair. Furthermore, we present a family of Shamir's schemes with asymptotically optimal decoding bandwidth.</p>
<p>Part III studies the problem of secure repair, i.e., reconstructing the share of a (failed) node without leaking any information about the message. We present generic secure repair protocols that can securely repair any linear schemes. We derive a lower bound on the secure repair bandwidth and show that the proposed protocols are essentially optimal in terms of bandwidth.</p>
<p>In the final part of the dissertation, we study the use of coding techniques to improve the reliability and security of network communication.</p>
<p>Specifically, in Part IV we draw connections between several important problems in network coding. We present reductions that map an arbitrary multiple-unicast network coding instance to a unicast secure network coding instance in which at most one link is eavesdropped, or a unicast network error correction instance in which at most one link is erroneous, such that a rate tuple is achievable in the multiple-unicast network coding instance if and only if a corresponding rate is achievable in the unicast secure network coding instance, or in the unicast network error correction instance. Conversely, we show that an arbitrary unicast secure network coding instance in which at most one link is eavesdropped can be reduced back to a multiple-unicast network coding instance. Additionally, we show that the capacity of a unicast network error correction instance in general is not (exactly) achievable. We derive upper bounds on the secrecy capacity for the secure network coding problem, based on cut-sets and the connectivity of links. Finally, we study optimal coding schemes for the network error correction problem, in the setting that the network and adversary parameters are not known a priori.</p>https://thesis.library.caltech.edu/id/eprint/10269Decoding the Past
https://resolver.caltech.edu/CaltechTHESIS:04032019-102853075
Authors: Jain, Siddharth
Year: 2019
DOI: 10.7907/K286-5N63
<p>The human genome is continuously evolving, hence the sequenced genome is a snapshot in time of this evolving entity. Over time, the genome accumulates mutations that can be associated with different phenotypes - like physical traits, diseases, etc. Underlying mutation accumulation is an <i>evolution channel</i> (the term <i>channel</i> is motivated by the notion of communication channel introduced by Shannon [1] in 1948 and started the area of <i>Information Theory</i>), which is controlled by hereditary, environmental, and stochastic factors. The premise of this thesis is to understand the human genome using information theory framework. In particular, it focuses on: (i) the analysis and characterization of the evolution channel using measures of <i>capacity</i>, <i>expressiveness</i>, <i>evolution distance</i>, and <i>uniqueness</i> of ancestry and uses these insights for (ii) the design of error correcting codes for DNA storage, (iii) inversion symmetry in the genome and (iv) cancer classification.</p>
<p>The mutational events characterizing this evolution channel can be divided into two categories, namely point mutations and duplications. While evolution through point mutations is <i>unconstrained</i>, giving rise to combinatorially many possibilities of what could have happened in the past, evolution through duplications adds constraints limiting the number of those possibilities. Further, more than 50% of the genome has been observed to consist of repeated sequences. We focus on the much constrained form of duplications known as tandem duplications in order to understand the limits of evolution by duplication. Our sequence evolution model consists of a starting sequence called <i>seed</i> and a set of tandem duplication rules. We find limits on the diversity of sequences that can be generated by tandem duplications using measures of capacity and expressiveness. Additionally, we calculate bounds on the duplication distance which is used to measure the timing of generation by these duplications. We also ask questions about the uniqueness of seed for a given sequence and completely characterize the duplication length sets where the seed is unique or non-unique. These insights also led us to design error correcting codes for any number of tandem duplication errors that are useful for DNA-storage based applications. For uniform duplication length and duplication length bounded by 2, our designed codes achieve channel capacity. We also define and measure <i>uncertainty</i> in decoding when the duplication channel is misinformed. Moreover, we add substitutions to our tandem duplication model and calculate sequence generation diversity for a given budget of substitutions.</p>
<p>We also use our duplication model to explain the inversion symmetry observed in the genome of many species. The inversion symmetry is popularly known as the 2nd Chargaff Rule, according to which in a <i>single</i> strand DNA, the frequency of a <i>k</i>-mer is almost the same as the frequency of its reverse complement. The insights gained by these problems led us to investigate the tandem repeat regions in the genome. Tandem repeat regions in the genome can be traced back in time algorithmically to make inference about the effect of the hereditary, environmental and stochastic factors on the mutation rate of the genome. By inferring the evolutionary history of the tandem repeat regions, we show how this knowledge can be used to make predictions about the risk of incurring a mutation based disease, specifically cancer. More precisely, we introduce the concept of mutation profiles that are computed without any comparative analysis, but instead by analyzing the short tandem repeat regions in a single <i>healthy</i> genome and capturing information about the individual's evolution channel. Using gradient boosting on data from more than 5,000 TCGA (The Cancer Genome Atlas) cancer patients, we demonstrate that these mutation profiles can accurately distinguish between patients with various types of cancer. For example, the pairwise validation accuracy of the classifier between PAAD (pancreas) patients and GBM (brain) patients is 93%. Our results show that healthy unaffected cells still contain a cancer-specific signal, which opens the possibility of cancer prediction from a healthy genome.</p>https://thesis.library.caltech.edu/id/eprint/11436Correcting Errors in DNA Storage
https://resolver.caltech.edu/CaltechTHESIS:06062022-185003284
Authors: Sima, Jin
Year: 2022
DOI: 10.7907/kdph-6z71
<p>DNA-based storage has potentially unprecedented advantages of high information density and long duration, and is one of the promising techniques to meet the ever-growing demands to keep data in the future. As noise and errors are present in almost every procedure during reading, writing, and storing of information in DNA storage systems, error correction is inevitable to guarantee reliable data storage in DNA. Moreover, it is often required that error correction is done in an efficient manner to reduce the cost and time needed for reading and writing data. Due to the technology constraints and physical limitations, error correction in DNA-based storage poses the following challenges that differ from those in traditional digital data transmission and storage systems.</p>
<p>1. A combination of deletion, insertion, and substitution errors present. The goal is to construct efficient codes correcting these errors. While substitution errors are special cases of deletion and insertion errors, and are well studied under the current theory and practice frameworks, deletion and insertion errors are much more difficult to deal with, and less understanding was gained for deletion and insertion errors.</p>
<p>2. Error correction is over an unordered set of strings, rather than over a single string, which can be regarded as a set of ordered strings. The latter, which includes the above deletion/insertion coding problem, is commonly studied for current digital communication and storage systems. Our goal is to extend the deletion/insertion correction capability for a single string to a set of unordered strings.</p>
<p>3. The decoder observes multiple noisy copies of every coded string. The problem is to deduce a set of strings (or a single string) from a collection of their noisy samples, also studied as the population recovery (or trace reconstruction for a single string) problem. The problem is well answered with substitution errors only and becomes elusive with the introduction of deletion and insertion errors.</p>
<p>This thesis tries to address the above challenges. For the first challenge, we proposed binary codes correct any constant number of deletions and/or insertions with order-wise optimal redundancy, which made a step toward a solution to a longstanding open problem introduced by Levenshtein in 1960s. We also extended it to different settings, in particular, non-binary deletin/insertion correcting codes suitable for DNA storage applications.</p>
<p>For the second challenge, we established lower and upper bounds on the optimal redundancy of codes correcting any number of substitution, deletion, and insertion errors and found that the redundancy needed for coding over an unordered set of strings is order-wise the same as that needed for coding over a ordered set of strings. Using our results for the first challenge, we proposed codes correcting any constant number of deletion/insertion errors with order-wise optimal redundancy under some parametter settings.</p>
<p>For the third challenge, we studied the problem of trace reconstruction, which asks the number of noisy samples needed to reconstruct a single string. While there is a exponential gap between upper and lower bounds on sample complexities in general, we showed that a polynomial number of samples suffice, given a reference string that is within constant edit distance from the target string.</p>
<p>Apart from dealing with the above challenges, we investigated error correction for multi-head racetrack memory applications. The problem can be considered as correcting any constant number of deletions/insertions in a single string with multiple noisy copies, with the help of coding. Different from the settings we considered above in the trace reconstruction problem, where noisy copies are independent given the target string, in racetrack memory, the noisy copies are correlated, and the number of errors is small compared to the trace reconstruction problem. We derived a lower bound on redundancy and proposed a code correcting any number of deletions/insertions with order-wise optimal redundancy.</p>https://thesis.library.caltech.edu/id/eprint/14951