Advisor Feed
https://feeds.library.caltech.edu/people/Abu-Mostafa-Y-S/advisor.rss
A Caltech Library Repository Feedhttp://www.rssboard.org/rss-specificationpython-feedgenenTue, 16 Apr 2024 14:51:44 +0000Some Results on Kolmogorov-Chaitin Complexity
https://resolver.caltech.edu/CaltechTHESIS:04132012-090947715
Authors: {'items': [{'id': 'Schweizer-David-Lawrence', 'name': {'family': 'Schweizer', 'given': 'David Lawrence'}, 'show_email': 'NO'}]}
Year: 1986
DOI: 10.7907/50qm-c858
No abstract.https://thesis.library.caltech.edu/id/eprint/6930Resource-Bounded Category and Measure in Exponential Complexity Classes
https://resolver.caltech.edu/CaltechTHESIS:03192012-101000956
Authors: {'items': [{'id': 'Lutz-Jack-Harold', 'name': {'family': 'Lutz', 'given': 'Jack Harold'}, 'show_email': 'NO'}]}
Year: 1987
DOI: 10.7907/dccw-3z56
<p>This thesis presents <i>resource-bounded category</i> and <i>resource bounded-measure</i> - two new tools for computational complexity theory - and some applications of these tools to the structure theory of exponential complexity classes.</p>
<p>Resource-bounded category, a complexity-theoretic version of the classical Baire category method, identifies certain subsets of PSPACE, E, ESPACE, and other complexity classes as <i>meager</i>. These meager sets are shown to form a nontrivial ideal of "small" subsets of the complexity class. The meager sets are also (almost) characterized in terms of certain two-person infinite games called <i>resource-bounded Banach-Mazur games</i>.</p>
<p>Similarly, resource-bounded measure, a complexity-theoretic version of Lebesgue measure theory, identifies the <i>measure</i> 0 subsets of E, ESPACE, and other complexity classes, and these too are shown to form nontrivial ideals of "small" subsets. A resource-bounded extension of the classical Kolmogorov zero-one law is also proven. This shows that measurable sets of complexity-theoretic interest either have measure 0 or are the complements of sets of measure 0.</p>
<p>Resource-bounded category and measure are then applied to the investigation of uniform versus nonuniform complexity. In particular, Kannan's theorem that ESPACE ⊊ P/Poly is extended by showing that P/ Poly ∩ ESPACE is only a meager, measure 0 subset of ESPACE. A theorem of Huynh is extended similarly by showing that all but a meager, measure 0 subset of the languages in ESPACE have high space-bounded Kolmogorov complexity.</p>
<p>These tools are also combined with a new hierarchy of exponential time complexity classes to refine known relationships between nonuniform complexity and time complexity.</p>
<p>In the last part of the thesis, known properties of hard languages are extended. In particular, recent results of Schöning and Huynh state that any language L which is ≤<sup>P</sup><sub>m</sub>-hard for E or ≤<sup>P</sup><sub>T</sub>-hard for ESPACE cannot be feasibly approximated (i.e., its symmetric difference with any feasible language has exponential density). It is proven here that this conclusion in fact holds unless only a meager subset of E is ≤<sup>P</sup><sub>m</sub>-reducible to L and only a meager, measure 0 subset of ESPACE is ≤<sup>PSPACE</sup><sub>m</sub>-reducible to L. (It is conjectured, but not proven, that this result is actually stronger than those of Schöning and Huynh.) This suggests a new lower bound method which may be useful in interesting cases.</p>https://thesis.library.caltech.edu/id/eprint/6854Soft-decision decoding of a family of nonlinear codes using a neural network
https://resolver.caltech.edu/CaltechETD:etd-06252007-080630
Authors: {'items': [{'id': 'Erlanson-R-A', 'name': {'family': 'Erlanson', 'given': 'Ruth A.'}, 'show_email': 'NO'}]}
Year: 1991
DOI: 10.7907/c855-aj24
We demonstrate the use of a continuous Hopfield neural network as a K-WinnerTake-All (KWTA) network. We prove that, given an input of N real numbers, such a network will converge to a vector of K positive one components and (N-K) negative one components, with the positive positions indicating the K largest input components. In addition, we show that the [(N K)] such vectors are the only stable states of the system.
One application of the KWTA network is the analog decoding of error-correcting codes. We prove that the KWTA network performs optimal decoding.
We consider decoders that are networks with nodes in overlapping, randomly placed KWTA constraints and discuss characteristics of the resulting codes.
We present two families of decoders constructed by overlapping KWTA constraints in a structured fashion on the nodes of a neural network. We analyze the performance of these decoders in terms of error rate, and discuss code minimum distance and information rate. We observe that these decoders perform near-optimal, soft-decision decoding on a class of nonlinear codes. We present a gain schedule that results in improved decoder performance in terms of error rate.
We present a general algorithm for determining the minimum distance of codes defined by the stable states of neural networks with nodes in overlapping KWTA constraints.
We consider the feasibility of embedding these neural network decoders in VLSI technologies and show that decoders of reasonable size could be implemented on a single integrated circuit. We also analyze the scaling of such implementations with decoder size and complexity.
Finally, we present an algorithm, based on the random coding theorem, to communicate an array of bits over a distributed communication network of simple processors connected by a common noisy bus.https://thesis.library.caltech.edu/id/eprint/2731Learning algorithms for neural networks
https://resolver.caltech.edu/CaltechETD:etd-09232005-083502
Authors: {'items': [{'email': 'amir@alumni.caltech.edu', 'id': 'Atiya-Amir', 'name': {'family': 'Atiya', 'given': 'Amir'}, 'show_email': 'YES'}]}
Year: 1991
DOI: 10.7907/F46C-3V67
This thesis deals mainly with the development of new learning algorithms and the study of the dynamics of neural networks. We develop a method for training feedback neural networks. Appropriate stability conditions are derived, and learning is performed by the gradient descent technique. We develop a new associative memory model using Hopfield's continuous feedback network. We demonstrate some of the storage limitations of the Hopfield network, and develop alternative architectures and an algorithm for designing the associative memory. We propose a new unsupervised learning method for neural networks. The method is based on applying repeatedly the gradient ascent technique on a defined criterion function. We study some of the dynamical aspects of Hopfield networks. New stability results are derived. Oscillations and synchronizations in several architectures are studied, and related to recent findings in biology. The problem of recording the outputs of real neural networks is considered. A new method for the detection and the recognition of the recorded neural signals is proposed.
https://thesis.library.caltech.edu/id/eprint/3725Combinatorial design of fault-tolerant communication structures, with applications to non-blocking switches
https://resolver.caltech.edu/CaltechETD:etd-07122007-092015
Authors: {'items': [{'id': 'Schweizer-David-Lawrence', 'name': {'family': 'Schweizer', 'given': 'David Lawrence'}, 'show_email': 'NO'}]}
Year: 1991
DOI: 10.7907/92k2-w943
This thesis is an investigation into structures and strategies for fault-tolerant communication. We assume the existence of some set of nodes--people, telephones, processors--with a need to pass messages--telephone calls, signals on a wire, data packets--amongst themselves.
In Part I, our goal is to create a structure, that is, a pattern of interconnection, in which a designated source node can broadcast a message to (and through) a group of recipient nodes. We seek a structure in which every node has tightly limited fan-out, but which is nonetheless able to function reliably even when challenged with significant numbers of node failures. The structures are described only in terms of their connectivity, and we therefore use the language of graph theory.
Part II is based on the observation that certain transformations of the graphs in Part I produce graphs that look like previously studied structures called non-blocking switches. We show that these transformations, when applied to other graphs, yield new, easier approaches to, and proofs of, some known theorems.
Part III is an independent body of work describing some investigations into possible extensions of the theory of Kolmogorov-Chaitin complexity into the foundations of pattern recognition. We prove the existence of an information theoretic metric on strings in which the distance between two strings is a measure of the amount of specification required for a universal computer to interconvert the strings. We also prove two topological theorems about this metric.https://thesis.library.caltech.edu/id/eprint/2859Invariance hints and the VC dimension
https://resolver.caltech.edu/CaltechETD:etd-07202007-075240
Authors: {'items': [{'id': 'Fyfe-W-J-A', 'name': {'family': 'Fyfe', 'given': 'William John Andrew'}, 'show_email': 'NO'}]}
Year: 1992
DOI: 10.7907/ft2z-te28
We are interested in having a neural network learn an unknown function f. If the function satisfies an invariant of some sort, such as f is an odd function, then we want to be able to take advantage of this information and not have the network deduce the invariant based on an example of f.
The invariant might be defined in terms of an explicit transformation of the input space under which f is constant. In this case it is possible to build a network that necessarily satisfies the invariant.
In general, we define the invariant in terms of a partition of the input space such that if x, x' are in the same partition element then f(x) = f(x'). An example of the invariant would be a a pair (x, x') taken from a single partition element. We can combine examples of the invariant with examples of the function in the learning process. The goal is to substitute examples of the invariant for examples of the function; the extent to which we can actually do this depends on the appropriate VC dimensions. Simulations verify, at least in simple cases, that examples of the invariant do aid the learning process.https://thesis.library.caltech.edu/id/eprint/2950Sequence specific effects on the incorporation of dideoxynucleotides by a modified T7 polymerase
https://resolver.caltech.edu/CaltechTHESIS:11122012-105407501
Authors: {'items': [{'id': 'Blanchard-A-P', 'name': {'family': 'Blanchard', 'given': 'Alan-Philippe'}, 'show_email': 'NO'}]}
Year: 1993
DOI: 10.7907/9vz4-0e68
<p>While incorporating nucleotides onto the end of a DNA molecule, DNA polymerases selectively discriminate against dideoxynucleotides in favor of incorporating deoxynucleotides. The magnitude of this discrimination is modulated by the template DNA sequence near the incorporation site. This effect has been characterized by analyzing the raw data from a large number of DNA sequencing experiments. It is shown that, for bacteriophage T7 polymerase, the 5 contiguous bases extending from 3 bases 3' (on the template strand) from the incorporation site to 1 base 5' of the incorporation site are the most important in modulating dideoxynucelotide discrimination. A table of discrimination ratios for 1007 different 5-mer contexts is presented.</p>https://thesis.library.caltech.edu/id/eprint/7265The Scheduling Problem in Learning from Hints
https://resolver.caltech.edu/CaltechTHESIS:03272012-100501462
Authors: {'items': [{'email': 'cataltepe@itu.edu.tr', 'id': 'Çataltepe-Zehra-Kök', 'name': {'family': 'Çataltepe', 'given': 'Zehra Kök'}, 'orcid': '0000-0002-9742-5907', 'show_email': 'YES'}]}
Year: 1994
DOI: 10.7907/3zvq-w228
<p>Any information about the function to be learned is called a hint. Learning from hints is a generalization of learning from examples. In this paradigm, hints are expressed by their examples and then taught to a learning-from-examples system. In general, using other hints in addition to the examples of the function, improves the generalization performance.</p>
<p>The scheduling problem in learning from hints is deciding which hint to teach at which time during training. Over- or under- emphasizing a hint may render it useless, making scheduling very important. Fixed and adaptive schedules are two types of schedules that are discussed.</p>
<p>Adaptive minimization is a general adaptive schedule that uses an estimate of generalization error in terms of errors on hints. when such an estimate is available, it can also be optimized by means of directly descending on it. An estimate may be used to decide on when to stop training, too.</p>
<p>A method to find a estimate incorporating the errors on invariance hints, and simulation results on this estimate, are presented. Two computer programs that provide a learning-from-hints environment and improvements on them are discussed.</p>https://thesis.library.caltech.edu/id/eprint/6874Monotonicity and connectedness in learning systems
https://resolver.caltech.edu/CaltechETD:etd-09222005-110351
Authors: {'items': [{'email': 'joe_sill@yahoo.com', 'id': 'Sill-J', 'name': {'family': 'Sill', 'given': 'Joseph'}, 'show_email': 'YES'}]}
Year: 1998
DOI: 10.7907/GQWN-1H71
This thesis studies two properties- monotonicity and connectedness- in the context of machine learning. The first part of the thesis examines the role of monotonicity constraints in machine learning from both practical and theoretical perspectives. Two techniques for enforcing monotonicity in machine learning models are proposed. The first method adds to the objective function a penalty term measuring the degree to which the model violates monotonicity. The penalty term can be interpreted as a Bayesian prior favoring functions which obey monotonicity. This method has the potential to enforce monotonicity only approximately, making it appropriate for situations where strict monotonicity may not hold. The second approach consists of a model which is monotonic by virtue of functional form. This model is shown to have universal approximation capabilities with respect to the class M of monotonic functions. A variety of theoretical results are also presented regarding M. The generalization behavior of this class is shown to depend heavily on the probability distribution over the input space. Although the VC dimension of M is [infinity], the VC entropy (i.e., the expected number of dichotomies) is modest for many distributions, allowing us to obtain bounds on the generalization error. Monte Carlo techniques for estimating the capacity and VC entropy of M are presented.
The second part of the thesis considers broader issues in learning theory. Generalization error bounds based on the VC dimension describe a function class by counting the number of dichotomies it induces. In this thesis, a more detailed characterization is presented which takes into account the diversity of a set of dichotomies in addition to its cardinality. Many function classes in common usage are shown to possess a property called connectedness. Models with this property induce dichotomy sets which are highly clustered and have little diversity. We derive an improvement to the VC bound which applies to function classes with the connectedness property.
https://thesis.library.caltech.edu/id/eprint/3689Supervised learning in probabilistic environments
https://resolver.caltech.edu/CaltechETD:etd-09232005-143548
Authors: {'items': [{'id': 'Magdon-Ismail-M', 'name': {'family': 'Magdon-Ismail', 'given': 'Malik'}, 'show_email': 'NO'}]}
Year: 1998
DOI: 10.7907/6Y8S-4442
NOTE: Text or symbols not renderable in plain ASCII are indicated by [...]. Abstract is included in .pdf document.
For a wide class of learning systems and different noise models, we bound the test performance in terms of the noise level and number of data points. We obtain O(1/N) convergence to the best hypothesis, the rate of convergence depending on the noise level and target complexity with respect to the learning model. Our results can be applied to estimate the model limitation, which we illustrate in the financial markets. Changes in model limitation can be used to track changes in volatility.
We analyze regularization in generalized linear models, focusing on weight decay. For a well specified linear model, the optimal regularization parameter decreases as [...]. When the data is noiseless, regularization is harmful. For a misspecified linear model, the "degree" of misspecification has an effect analogous to noise. For more general learning systems we develop EXPLOVA (explanation of variance) which also enables us to derive a condition on the learning model for regularization to help. We emphasize the necessity of prior information for effective regularization.
By counting functions on a discretized grid, we develop a framework for incorporating prior knowledge about the target function into the learning process. Using this framework, we derive a direct connection between smoothness priors and Tikhonov regularization, in addition to the regularization terms implied by other priors.
We prove a No Free Lunch result for noise prediction: when the prior over target functions is uniform, the data set conveys no information about the noise distribution. We then consider using maximum likelihood to predict non-stationary noise variance in time series. Maximum likelihood leads to systematic errors that favor lower variance. We discuss the systematic correction of these errors.
We develop stochastic and deterministic techniques for density estimation based on approximating the distribution function, thus placing density estimation within the supervised learning framework. We prove consistency of the estimators and obtain convergence rates in L1 and L2. We also develop approaches to random variate generation based on "inverting" the density estimation procedure and based on a control formulation.
In general, we use multilayer neural networks to illustrate our methods.
https://thesis.library.caltech.edu/id/eprint/3728Incorporating Input Information into Learning and Augmented Objective Functions
https://resolver.caltech.edu/CaltechETD:etd-10042005-104636
Authors: {'items': [{'email': 'cataltepe@itu.edu.tr', 'id': 'Çataltepe-Zehra-Kök', 'name': {'family': 'Çataltepe', 'given': 'Zehra Kök'}, 'orcid': '0000-0002-9742-5907', 'show_email': 'YES'}]}
Year: 1998
DOI: 10.7907/82JV-3D67
<p>In many applications, some form of input information, such as test inputs or extra inputs, is available. We incorporate input information into learning by an augmented error function, which is an estimator of the out-of-sample error. The augmented error consists of the training error plus an additional term scaled by the augmentation parameter. For general linear models, we analytically show that the augmented solution has smaller out-of-sample error than the least squares solution. For nonlinear models, we devise an algorithm to minimize the augmented error by gradient descent, determining the augmentation parameter using cross validation.</p>
<p>Augmented objective functions also arise when hints are incorporated into learning. We first show that using the invariance hints to estimate the test error, and early stopping on this estimator, results in better solutions than the minimization of the training error. We also extend our algorithm for incorporating input information to the case of learning from hints.</p>
<p>Input information or hints are additional information about the target function. When the only available information is the training set, all the models with the same training error are equally likely to be the target. In that case, we show that early stopping of training at any training error level above the minimum can not decrease the out-of-sample error. Our results are nonasymptotic for general linear models and the bin model, and asymptotic for nonlinear models. When additional information is available, early stopping can help.</p>https://thesis.library.caltech.edu/id/eprint/3913Data driven production models for speech processing
https://resolver.caltech.edu/CaltechETD:etd-02272008-093303
Authors: {'items': [{'id': 'Roweis-S-T', 'name': {'family': 'Roweis', 'given': 'Sam T.'}, 'show_email': 'NO'}]}
Year: 1999
DOI: 10.7907/DP55-8897
When difficult computations are to be performed on sensory data it is often advantageous to employ a model of the underlying process which produced the observations. Because such generative models capture information about the set of possible observations, they can help to explain complex variability naturally present in the data and are useful in separating signal from noise. In the case of neural and artificial sensory processing systems generative models are learned directly from environmental input although they are often rooted in the underlying physics of the modality involved. One effective use of learned models is made by performing model inversion or state inference on incoming observation sequences to discover the underlying state or control parameter trajectories which could have produced them. These inferred states can then be used as inputs to a pattern recognition or pattern completion module.
In the case of human speech perception and production, the models in question are called articulatory models and relate the movements of a talker's mouth to the sequence of sounds produced. Linguistic theories and substantial psychophysical evidence argue strongly that articulatory model inversion plays an important role in speech perception and recognition in the brain. Unfortunately, despite potential engineering advantages and evidence for being part of the human strategy, such inversion of speech production models is absent in almost all artificial speech processing systems.
This dissertation presents a series of experiments which investigate articulatory speech processing using real speech production data from a database containing simultaneous audio and mouth movement recordings. I show that it is possible to learn simple low dimensionality models which accurately capture the structure observed in such real production data. I discuss how these models can be used to learn a forward synthesis system which generates spectral sequences from articulatory movements. I also describe an inversion algorithm which estimates movements from an acoustic signal Finally, I demonstrate the use of articulatory movements, both true and recovered, in a simple speech recognition task, showing the possibility of doing true articulatory speech recognition in artificial systems.https://thesis.library.caltech.edu/id/eprint/786Contextual pattern recognition with applications to biomedical image identification
https://resolver.caltech.edu/CaltechETD:etd-09222005-111015
Authors: {'items': [{'email': 'xubosong@csee.ogi.edu', 'id': 'Song-Xubo', 'name': {'family': 'Song', 'given': 'Xubo'}, 'show_email': 'YES'}]}
Year: 1999
DOI: 10.7907/F5YK-HM52
This thesis studies two rather distinct topics: one is the incorporation of contextual information in pattern recognition, with applications to biomedical image identification; and the other is the theoretical modeling of learning and generalization in the regime of machine learning.
In Part I of the thesis, we propose techniques to incorporate contextual information into object classification. In the real world there are cases where the identity of an object is ambiguous due to the noise in the measurements based on which the classification should be made. It is helpful to reduce the ambiguity by utilizing extra information referred to as context, which in our case is the identities of the accompanying objects. We investigate the incorporation of both full and partial context. Their error probabilities, in terms of both set-by-set error and element-by-element error, are established and compared to context-free approach. The computational cost is studied in detail for full context, partial context and context-free cases. The techniques are applied to toy problems as well as real world problems such as white blood cell image classification and microscopic urinalysis. It is demonstrated that superior classification performance is achieved by using context. In our particular application, it reduces overall classification error, as well as false positive and false negative diagnosis rates.
In Part II of the thesis, we propose a novel theoretical framework, called the Bin Model, for learning and generalization. Using the Bin Model, a closed form is derived for generalization that estimates the out-of-sample performance in terms of the in-sample performance. We address the problems of overfitting, and characterize conditions under which it does not appear. The effect of noise on generalization is studied, and the generalization of the Bin Model framework from classification problems to regression problems is discussed.
https://thesis.library.caltech.edu/id/eprint/3690Generalization Error Estimates and Training Data Valuation
https://resolver.caltech.edu/CaltechETD:etd-09062005-083717
Authors: {'items': [{'email': 'zander@fantastivision.com', 'id': 'Nicholson-Alexander-Marshall', 'name': {'family': 'Nicholson', 'given': 'Alexander Marshall'}, 'show_email': 'YES'}]}
Year: 2002
DOI: 10.7907/1H16-VX81
This thesis addresses several problems related to generalization in machine learning systems. We introduce a theoretical framework for studying learning and generalization. Within this framework, a closed form is derived for the expected generalization error that estimates the out-of-sample performance in terms of the in-sample performance. We consider the problem of overfitting and show that, using a simple exhaustive learning algorithm, overfitting does not occur. These results do not assume a particular form of the target function, input distribution or learning model, and hold even with noisy data sets. We apply our analysis to practical learning systems, illustrate how it may be used to estimate out-of-sample errors in practice, and demonstrate that the resulting estimates improve upon errors estimated with a validation set for real world problems.
Based on this study of generalization, we develop a technique for quantitative valuation of training data. We demonstrate that this valuation may be used to select training sets that improve generalization performance. With a reasonable prior over target functions, it further allows us to estimate the level of noise in a data set and provides for detection and correction of noise in individual examples. Finally, this data valuation can be used to classify new examples, yielding a new learning algorithm that is shown to be relatively robust to noise.
https://thesis.library.caltech.edu/id/eprint/3347Maximum Drawdown of a Brownian Motion and AlphaBoost: a Boosting Algorithm
https://resolver.caltech.edu/CaltechETD:etd-05272004-115820
Authors: {'items': [{'id': 'Pratap-Amrit', 'name': {'family': 'Pratap', 'given': 'Amrit'}, 'show_email': 'NO'}]}
Year: 2004
DOI: 10.7907/J2H0-XV66
<p>We study two problems, one in the field of computational finance and the other one in machine learning.</p>
<p>Firstly we study the Maximal drawdown statistics of the Brownian random walk. We give the infinite series representation of its distribution and consider its expected value. For the case when drift is zero, we give an exact expression of the expected value and for the other cases, we give an infinite series representation. For all the cases, we compute the limiting behavior of the expected value.</p>
<p>Secondly, we propose a new algorithm for boosting, AlphaBoost, which does better than AdaBoost in reducing the cost function. We study its generalization properties and compare it to AdaBoost. However, this algorithm does not always give better out-of-sample performance.</p>https://thesis.library.caltech.edu/id/eprint/2132Infinite Ensemble Learning with Support Vector Machines
https://resolver.caltech.edu/CaltechETD:etd-05262005-030549
Authors: {'items': [{'email': 'htlin@csie.ntu.edu.tw', 'id': 'Lin-Hsuan-Tien', 'name': {'family': 'Lin', 'given': 'Hsuan-Tien'}, 'orcid': '0000-0003-2968-0671', 'show_email': 'YES'}]}
Year: 2005
DOI: 10.7907/E03R-EN93
<p>Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of base learners. However, existing algorithms are limited to combining only a finite number of base learners, and the generated ensemble is usually sparse. It is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of base learners.</p>
<p>In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on SVM. The framework could output an infinite and nonsparse ensemble, and can be applied to construct new kernels for SVM as well as to interpret existing ones. We demonstrate the framework with a concrete application, the stump kernel, which embodies infinitely many decision stumps. The stump kernel is simple, yet powerful.</p>
<p>Experimental results show that SVM with the stump kernel usually achieves better performance than boosting, even with noisy data.</p>
https://thesis.library.caltech.edu/id/eprint/2087Data Complexity in Machine Learning and Novel Classification Algorithms
https://resolver.caltech.edu/CaltechETD:etd-04122006-114210
Authors: {'items': [{'id': 'Li-Ling', 'name': {'family': 'Li', 'given': 'Ling'}, 'show_email': 'NO'}]}
Year: 2006
DOI: 10.7907/EW2G-9986
<p>This thesis summarizes four of my research projects in machine learning. One of them is on a theoretical challenge of defining and exploring complexity measures for data sets; the others are about new and improved classification algorithms.</p>
<p>We first investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by that data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Experiments were carried out with a practical complexity measure on several toy problems.</p>
<p>We then propose a family of novel learning algorithms to directly minimize the 0/1 loss for perceptrons. A perceptron is a linear threshold classifier that separates examples with a hyperplane. Unlike most perceptron learning algorithms, which require smooth cost functions, our algorithms directly minimize the 0/1 loss, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computationally efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost.</p>
<p>We also study ensemble methods that aggregate many base hypotheses in order to achieve better performance. AdaBoost is one such method for binary classification problems. The superior out-of-sample performance of AdaBoost has been attributed to the fact that it minimizes a cost function based on the margin, in that it can be viewed as a special case of AnyBoost, an abstract gradient descent algorithm. We provide a more sophisticated abstract boosting algorithm, CGBoost, based on conjugate gradient in function space. When the AdaBoost exponential cost function is optimized, CGBoost generally yields much lower cost and training error but higher test error, which implies that the exponential cost is vulnerable to overfitting. With the optimization power of CGBoost, we can adopt more "regularized" cost functions that have better out-of-sample performance but are difficult to optimize. Our experiments demonstrate that CGBoost generally outperforms AnyBoost in cost reduction. With suitable cost functions, CGBoost can have better out-of-sample performance.</p>
<p>A multiclass classification problem can be reduced to a collection of binary problems with the aid of a coding matrix. The quality of the final solution, which is an ensemble of base classifiers learned on the binary problems, is affected by both the performance of the base learner and the error-correcting ability of the coding matrix. A coding matrix with strong error-correcting ability may not be overall optimal if the binary problems are too hard for the base learner. Thus a trade-off between error-correcting and base learning should be sought. In this paper, we propose a new multiclass boosting algorithm that modifies the coding matrix according to the learning ability of the base learner. We show experimentally that our algorithm is very efficient in optimizing the multiclass margin cost, and outperforms existing multiclass algorithms such as AdaBoost.ECC and one-vs-one. The improvement is especially significant when the base learner is not very powerful.</p>https://thesis.library.caltech.edu/id/eprint/1361Adaptive Learning Algorithms and Data Cloning
https://resolver.caltech.edu/CaltechETD:etd-05292008-231048
Authors: {'items': [{'id': 'Pratap-Amrit', 'name': {'family': 'Pratap', 'given': 'Amrit'}, 'show_email': 'NO'}]}
Year: 2008
DOI: 10.7907/GV3D-AB69
<p>This thesis is in the field of machine learning: the use of data to automatically learn a hypothesis to predict the future behavior of a system. It summarizes three of my research projects.</p>
<p>We first investigate the role of margins in the phenomenal success of the Boosting Algorithms. AdaBoost (Adaptive Boosting) is an algorithm for generating an ensemble of hypotheses for classification. The superior out-of-sample performance of AdaBoost has been attributed to the fact that it can generate a classifier which classifies the points with a large margin of confidence. This led to the development of many new algorithms focusing on optimizing the margin of confidence. It was observed that directly optimizing the margins leads to a poor performance. This apparent contradiction has been the topic of a long unresolved debate in the machine-learning community. We introduce new algorithms which are expressly designed to test the margin hypothesis and provide concrete evidence which refutes the margin argument.</p>
<p>We then propose a novel algorithm for Adaptive sampling under Monotonicity constraint. The typical learning problem takes examples of the target function as input information and produces a hypothesis that approximates the target as an output. We consider a generalization of this paradigm by taking different types of information as input, and producing only specific properties of the target as output. This is a very common setup which occurs in many different real-life settings where the samples are expensive to obtain. We show experimentally that our algorithm achieves better performance than the existing methods, such as Staircase procedure and PEST.</p>
<p>One of the major pitfalls in machine learning research is that of selection bias. This is mostly introduced unconsciously due to the choices made during the learning process, which often lead to over-optimistic estimates of the performance. In the third project, we introduce a new methodology for systematically reducing selection bias. Experiments show that using cloned datasets for model selection can lead to better performance and reduce the selection bias.</p>https://thesis.library.caltech.edu/id/eprint/2267From Ordinal Ranking to Binary Classification
https://resolver.caltech.edu/CaltechETD:etd-05302008-143505
Authors: {'items': [{'email': 'htlin@csie.ntu.edu.tw', 'id': 'Lin-Hsuan-Tien', 'name': {'family': 'Lin', 'given': 'Hsuan-Tien'}, 'orcid': '0000-0003-2968-0671', 'show_email': 'YES'}]}
Year: 2008
DOI: 10.7907/7B0F-E145
<p>We study the ordinal ranking problem in machine learning. The problem can be viewed as a classification problem with additional ordinal information or as a regression problem without actual numerical information. From the classification perspective, we formalize the concept of ordinal information by a cost-sensitive setup, and propose some novel cost-sensitive classification algorithms. The algorithms are derived from a systematic cost-transformation technique, which carries a strong theoretical guarantee. Experimental results show that the novel algorithms perform well both in a general cost-sensitive setup and in the specific ordinal ranking setup.</p>
<p>From the regression perspective, we propose the threshold ensemble model for ordinal ranking, which allows the machines to estimate a real-valued score (like regression) before quantizing it to an ordinal rank. We study the generalization ability of threshold ensembles and derive novel large-margin bounds on its expected test performance. In addition, we improve an existing algorithm and propose a novel algorithm for constructing large-margin threshold ensembles. Our proposed algorithms are efficient in training and achieve decent out-of-sample performance when compared with the state-of-the-art algorithm on benchmark data sets.</p>
<p>We then study how ordinal ranking can be reduced to weighted binary classification. The reduction framework is simpler than the cost-sensitive classification approach and includes the threshold ensemble model as a special case. The framework allows us to derive strong theoretical results that tightly connect ordinal ranking with binary classification. We demonstrate the algorithmic and theoretical use of the reduction framework by extending SVM and AdaBoost, two of the most popular binary classification algorithms, to the area of ordinal ranking. Coupling SVM with the reduction framework results in a novel and faster algorithm for ordinal ranking with superior performance on real-world data sets, as well as a new bound on the expected test performance for generalized linear ordinal rankers. Coupling AdaBoost with the reduction framework leads to a novel algorithm that boosts the training accuracy of any cost-sensitive ordinal ranking algorithms theoretically, and in turn improves their test performance empirically.</p>
<p>From the studies above, the key to improve ordinal ranking is to improve binary classification. In the final part of the thesis, we include two projects that aim at understanding binary classification better in the context of ensemble learning. First, we discuss how AdaBoost is restricted to combining only a finite number of hypotheses and remove the restriction by formulating a framework of infinite ensemble learning based on SVM. The framework can output an infinite ensemble through embedding infinitely many hypotheses into an~SVM kernel. Using the framework, we show that binary classification (and hence ordinal ranking) can be improved by going from a finite ensemble to an infinite one. Second, we discuss how AdaBoost carries the property of being resistant to overfitting. Then, we propose the SeedBoost algorithm, which uses the property as a machinery to prevent other learning algorithms from overfitting. Empirical results demonstrate that SeedBoost can indeed improve an overfitting algorithm on some data sets.</p>https://thesis.library.caltech.edu/id/eprint/2321Optimal Data Distributions in Machine Learning
https://resolver.caltech.edu/CaltechTHESIS:05262015-094933189
Authors: {'items': [{'email': 'carlos.r.gonzalez@gmail.com', 'id': 'González-Palacios-Carlos-Roberto', 'name': {'family': 'González Palacios', 'given': 'Carlos Roberto'}, 'show_email': 'YES'}]}
Year: 2015
DOI: 10.7907/Z9DR2SD5
<p>In the first part of the thesis we explore three fundamental questions that arise naturally when we conceive a machine learning scenario where the training and test distributions can differ. Contrary to conventional wisdom, we show that in fact mismatched training and test distribution can yield better out-of-sample performance. This optimal performance can be obtained by training with the dual distribution. This optimal training distribution depends on the test distribution set by the problem, but not on the target function that we want to learn. We show how to obtain this distribution in both discrete and continuous input spaces, as well as how to approximate it in a practical scenario. Benefits of using this distribution are exemplified in both synthetic and real data sets.</p>
<p>In order to apply the dual distribution in the supervised learning scenario where the training data set is fixed, it is necessary to use weights to make the sample appear as if it came from the dual distribution. We explore the negative effect that weighting a sample can have. The theoretical decomposition of the use of weights regarding its effect on the out-of-sample error is easy to understand but not actionable in practice, as the quantities involved cannot be computed. Hence, we propose the Targeted Weighting algorithm that determines if, for a given set of weights, the out-of-sample performance will improve or not in a practical setting. This is necessary as the setting assumes there are no labeled points distributed according to the test distribution, only unlabeled samples.</p>
<p>Finally, we propose a new class of matching algorithms that can be used to match the training set to a desired distribution, such as the dual distribution (or the test distribution). These algorithms can be applied to very large datasets, and we show how they lead to improved performance in a large real dataset such as the Netflix dataset. Their computational complexity is the main reason for their advantage over previous algorithms proposed in the covariate shift literature.</p>
<p>In the second part of the thesis we apply Machine Learning to the problem of behavior recognition. We develop a specific behavior classifier to study fly aggression, and we develop a system that allows analyzing behavior in videos of animals, with minimal supervision. The system, which we call CUBA (Caltech Unsupervised Behavior Analysis), allows detecting movemes, actions, and stories from time series describing the position of animals in videos. The method summarizes the data, as well as it provides biologists with a mathematical tool to test new hypotheses. Other benefits of CUBA include finding classifiers for specific behaviors without the need for annotation, as well as providing means to discriminate groups of animals, for example, according to their genetic line.</p>https://thesis.library.caltech.edu/id/eprint/8888Advancements in Hemodynamic Measurement: Arterial Resonance, Ultrasound, and Machine Learning
https://resolver.caltech.edu/CaltechTHESIS:06022023-215651797
Authors: {'items': [{'email': 'dominicyurk@gmail.com', 'id': 'Yurk-Dominic-Jeffrey', 'name': {'family': 'Yurk', 'given': 'Dominic Jeffrey'}, 'orcid': '0000-0002-2276-4189', 'show_email': 'YES'}]}
Year: 2023
DOI: 10.7907/q7j4-vj19
<p>This thesis covers two separate projects which both use ultrasound to measure a form of blood pressure in very different ways. The first project focuses on the noninvasive measurement of continuous arterial blood pressure via the previously unstudied phenomenon of arterial resonance. While prior research efforts have attempted many methods of noninvasive blood pressure measurement, none has been able to generate continuous, calibration-free measurements based on a first-principles physical model. This work describes the derivation of this resonance-based model, its <i>in vitro</i> validation, and its <i>in vivo</i> testing on 60 subjects. This testing resulted in robust resonance detection and accurate calculation of BP in the large majority of evaluated subjects, representing very promising performance for the first test of a new biomedical technology. The second study changes focus to the measurement of blood pressure in the right atrium of the heart, an important clinical indicator in heart disease patients. Rather than developing a new physical approach, this project used machine learning to model the existing assessments made by cardiologists. Comparison to gold standard invasive catheter measurements showed that model predictions were statistically indistinguishable from cardiologist measurements. Both of these projects represent significant advances in expanding precise blood pressure measurements beyond critical care units and expanding access to a much broader population.</p>https://thesis.library.caltech.edu/id/eprint/16066