Our first contribution is a novel fast cross validation procedure suitable for any mergeable algorithm. This fast cross validation procedure has a constant runtime independent of the number of folds and can be implemented on distributed systems. This procedure is also widely applicable. We show that 32 recently proposed learning algorithms are mergeable and therefore fit our cross validation framework. These learning algorithms come from many subfields of machine learning, including density estimation, regularized loss minimization, dimensionality reduction, submodular optimization, variational inference, and markov chain monte carlo.
We also provide two new mergeable learning algorithms. In the context of regularized loss minimization, existing merge procedures either have high bias or slow runtimes. We introduce the optimal weighted average (OWA) merge procedure, which achieves both a low bias and fast runtime. We also improve the cover tree data structure for fast nearest neighbor queries by providing a merge procedure. In doing so, we improve both the theoretical guarantees of the cover tree and its practical runtime. For example, the original cover tree was able to find nearest neighbors in time O(c12exp log n), and we improve this bound to O(c4hole log n) for i.i.d. data. Here, cexp and chole are measures of the “intrinsic dimensionality” of the data, and on typical datasets cexp > chole. Experiments on large scale ad-click, genomics, and image classification tasks empirically validate these algorithms.
Michael John Izbicki (2017). Divide and Conquer Algorithms for Machine Learning. Doctoral dissertation, University of California at Riverside. |
@phdthesis{Izb16, author = "Michael John Izbicki", title = "Divide and Conquer Algorithms for Machine Learning", school = "University of California at Riverside", schoolabbr = "UC Riverside", year = 2017, month = Sep, }