{"title": "Union of Intersections (UoI) for Interpretable Data Driven Discovery and Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 1078, "page_last": 1086, "abstract": "The increasing size and complexity of scientific data could dramatically enhance discovery and prediction for basic scientific applications, e.g., neuroscience, genetics, systems biology, etc. Realizing this potential, however, requires novel statistical analysis methods that are both interpretable and predictive. We introduce the Union of Intersections (UoI) method, a flexible, modular, and scalable framework for enhanced model selection and estimation. The method performs model selection and model estimation through intersection and union operations, respectively. We show that UoI can satisfy the bi-criteria of low-variance and nearly unbiased estimation of a small number of interpretable features, while maintaining high-quality prediction accuracy. We perform extensive numerical investigation to evaluate a UoI algorithm ($UoI_{Lasso}$) on synthetic and real data. In doing so, we demonstrate the extraction of interpretable functional networks from human electrophysiology recordings as well as the accurate prediction of phenotypes from genotype-phenotype data with reduced features. We also show (with the $UoI_{L1Logistic}$ and $UoI_{CUR}$ variants of the basic framework) improved prediction parsimony for classification and matrix factorization on several benchmark biomedical data sets. These results suggest that methods based on UoI framework could improve interpretation and prediction in data-driven discovery across scientific fields.", "full_text": "Union of Intersections (UoI) for Interpretable Data\n\nDriven Discovery and Prediction\n\nKristofer E. Bouchard\u2217\n\nAlejandro F. Bujan\u2020\n\nFarbod Roosta-Khorasani\u2021\n\nShashanka Ubaru\u00a7\n\nPrabhat\u00b6\n\nAntoine M. Snijders(cid:107)\n\nJian-Hua Mao(cid:107)\n\nEdward F. Chang\u2217\u2217\n\nMichael W. Mahoney\u2021\n\nSharmodeep Bhattacharyya\u2020\u2020\n\nAbstract\n\nThe increasing size and complexity of scienti\ufb01c data could dramatically enhance\ndiscovery and prediction for basic scienti\ufb01c applications. Realizing this potential,\nhowever, requires novel statistical analysis methods that are both interpretable\nand predictive. We introduce Union of Intersections (UoI), a \ufb02exible, modular,\nand scalable framework for enhanced model selection and estimation. Methods\nbased on UoI perform model selection and model estimation through intersection\nand union operations, respectively. We show that UoI-based methods achieve\nlow-variance and nearly unbiased estimation of a small number of interpretable\nfeatures, while maintaining high-quality prediction accuracy. We perform extensive\nnumerical investigation to evaluate a UoI algorithm (U oILasso) on synthetic and\nreal data. In doing so, we demonstrate the extraction of interpretable functional\nnetworks from human electrophysiology recordings as well as accurate prediction\nof phenotypes from genotype-phenotype data with reduced features. We also show\n(with the U oIL1Logistic and U oICU R variants of the basic framework) improved\nprediction parsimony for classi\ufb01cation and matrix factorization on several bench-\nmark biomedical data sets. These results suggest that methods based on the UoI\nframework could improve interpretation and prediction in data-driven discovery\nacross scienti\ufb01c \ufb01elds.\n\nIntroduction\n\n1\nA central goal of data-driven science is to identify a small number of features (i.e., predictor variables;\nX in Fig. 1(a)) that generate a response variable of interest (y in Fig. 1(a)) and then to estimate\nthe relative contributions of these features as the parameters in the generative process relating the\npredictor variables to the response variable (Fig. 1(a)). A common characteristic of many modern\nmassive data sets is that they have a large number of features (i.e., high-dimensional data), while\n\n\u2217Biological Systems and Engineering Division, LBNL. kebouchard@lbl.gov\n\u2020Redwood Center, UC Berkeley. afbujan@gmail.com\n\u2021ICSI and Department of Statistics, UC Berkeley. {farbod,mmahoney}@icsi.berkeley.edu\n\u00a7Department of Computer Science and Engineering, University of Minnesota. ubaru001@umn.edu\n\u00b6NERSC, LBNL. prabhat@lbl.gov\n(cid:107)Biological Systems and Engineering Division, LBNL. {AMSnijders,jhmao}@lbl.gov\n\u2217\u2217Department of Neurological Surgery, UC San Francisco. Edward.Chang@ucsf.edu\n\u2020\u2020Department of Statistics, Oregon State University. bhattash@science.oregonstate.edu\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: The basic UoI framework. (a) Schematic of regularization and ensemble methods for\nregression. (b) Schematic of the Union of Intersections (UoI) framework. (c) A data-distributed\nversion of the U oILasso algorithm. (d) Dependence of false positive, false negatives, and estimation\nvariability on number of bootstraps in selection (B1) and estimation (B2) modules.\n\nalso exhibiting a high degree of sparsity and/or redundancy [2, 19, 11]. That is, while formally\nhigh-dimensional, most of the useful information in the data features for tasks such as reconstruction,\nregression, and classi\ufb01cation can be restricted or compressed into a much smaller number of important\nfeatures. In regression and classi\ufb01cation, it is common to employ sparsity-inducing regularization\nto attempt to achieve simultaneously two related but quite different goals: to identify the features\nimportant for prediction (i.e., model selection) and to estimate the associated model parameters\n(i.e., model estimation) [2, 19]. For example, the Lasso algorithm in linear regression uses L1-\nregularization to penalize the total magnitude of model parameters, and this often results in feature\ncompression by setting some parameters exactly to zero [18] (See Fig. 1(a), pure white elements\nin right-hand vectors, emphasized by \u00d7). It is well known that this type of regularization implies a\nprior assumption about the distribution of the parameter (e.g., L1-regularization implicitly assumes\na Laplacian prior distribution) [12]. However, strong sparsity-inducing regularization, which is\ncommon when there are many more potential features than data samples (i.e., the so-called small\nn/p regime) can severely hinder the interpretation of model parameters (Fig. 1(a), indicated by less\nsaturated colors between top and bottom vectors on right hand side). For example, while sparsity may\nbe achieved, incorrect features may be chosen and parameters estimates may be biased. In addition,\nit can impede model selection and estimation when the true model distribution deviates from the\nassumed distribution [2, 10]. This may not matter for prediction quality, but it clearly has negative\nconsequences for interpretability, an admittedly not completely-well-de\ufb01ned property of algorithms\nthat is crucial in many scienti\ufb01c applications [9]. In this context, interpretability re\ufb02ects the degree to\nwhich an algorithm returns a small number of physically meaningful features with unbiased and low\nvariance estimates of their contributions.\n\nOn the other hand, another common characteristic of many state of the art methods is to combine\nseveral related models for a given task. In statistical data analysis, this is often formalized by so-called\n\n2\n\n\fensemble methods, which improve prediction accuracy by combining parameter estimates [12]. In\nparticular, by combining several different models, ensemble methods often include more features\nto predict the response variables, and thus the number of data features is expanded relative to the\nindividuals in the ensemble. For example, estimating an ensemble of model parameters by randomly\nresampling the data many times (e.g., bootstrapping) and then averaging the parameter estimates\n(e.g., bagging) can yield improved prediction accuracy by reducing estimation variability [8, 12] (See\nFig. 1(a), bottom). However, by averaging estimates from a large ensemble, this process often results\nin many non-zero parameters, which can hinder interpretability and the identi\ufb01cation of the true\nmodel support (compare top and bottom vectors on right hand side of Fig. 1(a)). Taken together, these\nobservations suggest that explicit and more precise control of feature compression and expansion\nmay result in an algorithm with improved interpretative and predictive properties.\n\nIn this paper, we introduce Union of Intersections (UoI), a \ufb02exible, modular, and scalable framework\nto enhance both the identi\ufb01cation of features (model selection) as well as the estimation of the\ncontributions of these features (model estimation). We have found that the UoI framework permits us\nto explore the interpretability-predictivity trade-off space, without imposing an explicit prior on the\nmodel distribution, and without formulating a non-convex problem, thereby often leading to improved\ninterpretability and prediction. Ideally, data analysis methods in many scienti\ufb01c applications should\nbe selective (only features that in\ufb02uence the response variable are selected), accurate (estimated\nparameters in the model are as close to the true value as possible), predictive (allowing prediction of\nthe response variable), stable (e.g., the variability of the estimated parameters is small), and scalable\n(able to return an answer in a reasonable amount of time on very large data sets) [17, 2, 15, 10]. We\nshow empirically that UoI-based methods can simultaneously achieve these goals, results supported\nby preliminary theory. We primarily demonstrate the power of UoI-based methods in the context of\nsparse linear regression (U oILasso), as it is the canonical statistical/machine learning problem, it is\ntheoretically tractable, and it is widely used in virtually every \ufb01eld of scienti\ufb01c inquiry. However, our\nframework is very general, and we demonstrate this by extending UoI to classi\ufb01cation (U oIL1Logistic)\nand matrix factorization (U oICU R) problems. While our main focus is on neuroscience (broadly\nspeaking) applications, our results also highlight the power of UoI across a broad range of synthetic\nand real scienti\ufb01c data sets.1\n2 Union of Intersections (UoI)\nFor concreteness, we consider an application of UoI in the context of the linear regression. Speci\ufb01cally,\nwe consider the problem of estimating the parameters \u03b2 \u2208 Rp that map a p-dimensional vector of\npredictor variables x \u2208 Rp to the observation variable y \u2208 R, when there are n paired samples of x\nand y corrupted by i.i.d Gausian noise:\n\n(1)\nwhere \u03b5 iid\u223c N (0, \u03c32) for each sample. When the true \u03b2 is thought to be sparse (i.e., in the L0-norm\nsense), then an estimate of \u03b2 (call it \u02c6\u03b2) can be found by solving a constrained optimization problem\nof the form:\n\ny = \u03b2T x + \u03b5,\n\nn(cid:88)\n\n\u02c6\u03b2 \u2208 argmin\u03b2\u2208Rp\n\n(yi \u2212 \u03b2xi)2 + \u03bbR(\u03b2).\n\n(2)\n\ni=1\n\nHere, R(\u03b2) is a regularization term that typically penalizes the overall magnitude of the parameter\nvector \u03b2 (e.g., R(\u03b2) = (cid:107)\u03b2(cid:107)1 is the target of the Lasso algorithm).\nThe Basic UoI Framework. The key mathematical idea underlying UoI is to perform model selection\nthrough intersection (compressive) operations and model estimation through union (expansive)\noperations, in that order. This is schematized in Fig. 1(b), which plots a hypothetical range of selected\n\n1More details, including both empirical and theoretical results, are in the associated technical report [4].\n\n3\n\n\ffeatures (x1 : xp, abscissa) for different values of the regularization parameter (\u03bb, ordinate). See [4]\nfor a more detailed description. In particular, UoI \ufb01rst performs feature compression (Fig. 1(b), Step\n1) through intersection operations (intersection of supports across bootstrap samples) to construct a\nfamily (S) of candidate model supports (Fig. 1(b), e.g., Sj\u22121, opaque red region is intersection of\nabutting pink regions). UoI then performs feature expansion (Fig. 1(b), Step 2) through a union of\n(potentially) different model supports: for each bootstrap sample, the best model estimates (across\ndifferent supports) is chosen, and then a new model is generated by averaging the estimates (i.e.,\ntaking the union) across bootstrap samples (Fig. 1(b), dashed vertical black line indicates the union\nof features from Sj and Sj+1). Both feature compression and expansion are performed across all\nregularization strengths. In UoI, feature compression via intersections and feature expansion via\nunions are balanced to maximize prediction accuracy of the sparsely estimated model parameters for\nthe response variable y.\nInnovations in Union of Intersections. UoI has three central innovations: (1) calculate model\nsupports (Sj) using an intersection operation for a range of regularization parameters (increases\nin \u03bb shrink all values \u02c6\u03b2 towards 0), ef\ufb01ciently constructing a family of potential model supports\n{S : Sj \u2208 Sj\u2212k, for k suf\ufb01ciently large}; (2) use a novel form of model averaging in the union\nstep to directly optimize prediction accuracy (this can be thought of as a hybrid of bagging [8]\nand boosting [16]); and (3) combine pure model selection using an intersection operation with\nmodel selection/estimation using a union operation in that order (which controls both false negatives\nand false positives in model selection). Together, these innovations often lead to better selection,\nestimation and prediction accuracy. Importantly, this is done without explicitly imposing a prior on\nthe distribution of parameter values, and without formulating a non-convex optimization problem.\nThe U oILasso Algorithm. Since the basic UoI framework, as described in Fig. 1(c), has two main\ncomputational modules\u2014one for model selection, and one for model estimation\u2014UoI is a framework\ninto which many existing algorithms can be inserted. Here, for simplicity, we primarily demonstrate\nUoI in the context of linear regression in the U oILasso algorithm, although we also apply it to\nclassi\ufb01cation with the U oIL1Logistic algorithm as well as matrix factorization with the U oICU R\nalgorithm. U oILasso expands on the BoLasso method for the model selection module [1], and it\nperforms a novel model averaging in the estimation module based on averaging ordinary least squares\n(OLS) estimates with potentially different model supports. U oILasso (and UoI in general) has a\nhigh degree of natural algorithmic parallelism that we have exploited in a distributed Python-MPI\nimplementation. (Fig. 1(c) schematizes a simpli\ufb01ed distributed implementation of the algorithm;\nsee [4] for more details.) This parallelized U oILasso algorithm uses distribution of bootstrap data\nsamples and regularization parameters (in Map) for independent computations involving convex\noptimizations (Lasso and OLS, in Solve), and it then combines results (in Reduce) with intersection\noperations (model selection module) and union operations (model estimation module). By solving\nindependent convex optimization problems (e.g., Lasso, OLS) with distributed data resampling, our\nU oILasso algorithm ef\ufb01ciently constructs a family of model supports, and it then averages nearly\nunbiased model estimates, potentially with different supports, to maximize prediction accuracy while\nminimizing the number of features to aid interpretability.\n3 Results\n3.1 Methods\nAll numerical results used 100 random sub-samplings with replacement of 80-10-10 cross-validation\nto estimate model parameters (80%), choose optimal meta-parameters (e.g., \u03bb, 10%), and determine\nprediction quality (10%). Below, \u03b2 denotes the values of the true model parameters, \u02c6\u03b2 denotes the\nestimated values of the model parameters from some algorithm (e.g., U oILasso), S\u03b2 is the support of\n\n4\n\n\fthe true model (i.e., the set of non-zero parameter indices), and S \u02c6\u03b2 is the support of the estimated\nmodel. We calculated several metrics of model selection, model estimation, and prediction accuracy.\n(1) Selection accuracy (set overlap): 1\u2212 |S \u02c6\u03b2 \u2206S\u03b2|\n, where, \u2206 is the symmetric set difference operator.\n|S \u02c6\u03b2|0+|S\u03b2|0\nThis metric ranges in [0, 1], taking a value of 0 if S\u03b2 and S \u02c6\u03b2 have no elements in common, and\n2.\ntaking a value of 1 if and only if they are identical. (2) Estimation error (r.m.s):\n(cid:80)n\n(cid:80) (yi\u2212 \u02c6yi)2\n(3) Estimation variability (parameter variance): E[ \u02c6\u03b22] \u2212 (E[ \u02c6\u03b2])2. (4) Prediction accuracy (R2):\n(cid:80) (yi\u2212E[y])2 . (5) Prediction parsimony (BIC): n log( 1\ni=1(yi \u2212 \u02c6yi)2) + (cid:107) \u02c6\u03b2(cid:107)0 log(n). For the\nexperimental data, as the true model size is unknown, the selection ratio ((cid:107) \u02c6\u03b2(cid:107)0\np ) is a measure of the\noverall size of the estimated model relative to the total number of parameters. For the classi\ufb01cation\ntask using U oIL1Logistic, BIC was calculated as: \u22122 log (cid:96) + S \u02c6\u03b2 log N, where (cid:96) is the log-likelihood\non the validation set. For the matrix factorization task using U oICU R, reconstruction accuracy was\nthe Frobenius norm of the difference between the data matrix A and the low-rank approximation\nmatrix A(cid:48) constructed from A(:, c), the reduced column matrix of A: (cid:107)A \u2212 A(cid:48)(cid:107)F , where c is the set\nof k selected columns.\n\n(cid:80) (\u03b2i \u2212 \u02c6\u03b2i)\n\nn\u22121\n\n(cid:113)\n\n1\np\n\n3.2 Model Selection and Stability: Explicit Control of False Positives, False Negatives, and\n\nEstimate Stability\n\nDue to the form of the basic UoI framework, we can control both false negative and false positive\ndiscoveries, as well as the stability of the estimates. For any regularized regression method like\nin (2), a decrease in the penalization parameter (\u03bb) tends to increase the number of false positives,\nand an increase in \u03bb tends to increase false negatives. Preliminary analysis of the UoI framework\nshows that, for false positives, a large number of bootstrap resamples in the intersection step (B1)\nproduces an increase in the probability of getting no false positive discoveries, while an increase in\nthe number of bootstraps in the union step (B2) leads to a decrease in the probability of getting no\nfalse positives. Conversely, for false negatives, a large number of bootstrap resamples in the union\nstep (B2) produces an increase in the probability of no false negative discoveries, while an increase\nin the number of bootstraps in the intersection step (B1) leads to a decrease in the probability of no\nfalse negatives. Also, a large number of bootstrap samples in union step (B2) gives a more stable\nestimate. These properties were con\ufb01rmed numerically for U oILasso and are displayed in Fig. 1(d),\nwhich plots the average normalized false negatives, false positives, and standard deviation of model\nestimates from running U oILasso, with ranges of B1 and B2 on four different models. These results\nare supported by preliminary theoretical analysis of a variant of U oILasso (see [4]). Thus, the relative\nvalues of B1 and B2 express the fundamental balance between the two basic operations of intersection\n(which compresses the feature space) and union (which expands the feature space). Model selection\nthrough intersection often excludes true parameters (i.e., false negatives), and, conversely, model\nestimation using unions often includes erroneous parameters (i.e., false positives). By using stochastic\nresampling, combined with model selection through intersections, followed by model estimation\nthrough unions, UoI permits us to mitigate the feature inclusion/exclusion inherent in either operation.\nEssentially, the limitations of selection by intersection are counteracted by the union of estimates,\nand vice versa.\n\n3.3 U oILasso has Superior Performance on Simulated Data Sets\nTo explore the performance of the U oILasso algorithm, we have performed extensive numerical\ninvestigations on simulated data sets, where we can control key properties of the data. There are a\nlarge number of algorithms available for linear regression, and we picked some of the most popular\nalgorithms (e.g., Lasso), as well as more uncommon, but more powerful algorithms (e.g., SCAD, a\nnon-convex method). Speci\ufb01cally, we compared U oILasso to \ufb01ve other model selection/estimation\n\n5\n\n\fFigure 2: Range of observed results, in comparison with existing algorithms. (a) True \u03b2 distribu-\ntion (grey histograms) and estimated values (colored lines). (b) Scatter plot of true and estimated\nvalues of observation variable on held-out samples. (c) Metrics of algorithm performance.\n\nmethods: Ridge, Lasso, SCAD, BoATS, and debiased Lasso [12, 18, 10, 5, 3, 13]. Note that BoATS\nand debiased Lasso are both two-stage methods. We examined performance of these algorithms\nacross a variety of underlying distributions of model parameters, degrees of sparsity, and noise\nlevels. Across all algorithms examined, we found that U oILasso (Fig. 2, black) generally resulted\nin very high selection accuracy (Fig. 2(c), right) with parameter estimates with low error (Fig. 2(c),\ncenter-right), leading to the best prediction accuracy (Fig. 2(c), center-left) and prediction parsimony\n(Fig. 2(c), left). In addition, it was very robust to differences in underlying parameter distribution,\ndegree of sparsity, and magnitude of noise. (See [4] for more details.)\n3.4 U oILasso in Neuroscience: Sparse Functional Networks from Human Neural\nRecordings and Parsimonious Prediction from Genetic and Phenotypic Data\n\nWe sought to determine if the enhanced selection and estimation properties of U oILasso also improved\nits utility as a tool for data-driven discovery in complex, diverse neuroscience data sets. Neurobiology\nseeks to understand the brain across multiple spatio-temporal scales, from molecules-to-minds.\nWe \ufb01rst tackled the problem of graph formation from multi-electrode (p = 86 electrodes) neural\nrecordings taken directly from the surface of the human brain during speech production (n = 45 trials\neach). See [7] for details. That is, the goal was to construct sparse neuroscienti\ufb01cally-meaningful\ngraphs for further downstream analysis. To estimate functional connectivity, we calculated partial\ncorrelation graphs. The model was estimated independently for each electrode, and we compared\nthe results of graphs estimated by U oILasso to the graphs estimated by SCAD. In Fig. 3(a)-(b), we\ndisplay the networks derived from recordings during the production of /b/ while speaking /ba/. We\nfound that the U oILasso network (Fig. 3(a)) was much sparser than the SCAD network (Fig. 3(b)).\nFurthermore, the network extracted by U oILasso contained electrodes in the lip (dorsal vSMC), jaw\n(central vSMC), and larynx (ventral vSMC) regions, accurately re\ufb02ecting the articulators engaged\nin the production of /b/ (Fig. 3(c)) [7]. The SCAD network (Fig. 3(d)) did not have any of these\nproperties. This highlights the improved power of U oILasso to extract sparse graphs with functionally\nmeaningful features relative to even some non-convex methods.\n\nWe calculated connectivity graphs during the production of 9 consonant-vowel syllables. Fig. 3(e)\ndisplays a summary of prediction accuracy for U oILasso networks (red) and SCAD networks (black)\n\n6\n\n\fFigure 3: Application of UoI to neuroscience and genetics data. (a)-(f): Functional connectivity\nnetworks from ECoG recordings during speech production. (g)-(h): Parsimonious prediction of\ncomplex phenotypes form genotype and phenotype data.\n\nas a function of time. The average relative prediction accuracy (compared to baseline times) for the\nU oILasso network was generally greater during the time of peak phoneme encoding [T = -100:200]\ncompared to the SCAD network. Fig. 3(f) plots the time course of the parameter selection ratio for\nthe U oILasso network (red) and SCAD network (black). The U oILasso network was consistently\n\u223c 5\u00d7 sparser than the SCAD network. These results demonstrate that U oILasso extracts sparser\ngraphs from noisy neural signals with a modest increase in prediction accuracy compared to SCAD.\n\nWe next investigated whether U oILasso would improve the identi\ufb01cation of a small number of highly\npredictive features from genotype-phenotype data. To do so, we analyzed data from n = 365 mice\n(173 female, 192 male) that are part of the genetically diverse Collaborative Cross cohort. We\nanalyzed single-nucleotide polymorphisms (SNPs) from across the entire genome of each mouse\n(p = 11, 563 SNPs). For each animal, we measured two continuous, quantitative phenotypes: weight\nand behavioral performance on the rotorod task (see [14] for details). We focused on predicting\nthese phenotypes from a small number of geneotype-phenotype features. We found that U oILasso\nidenti\ufb01ed and estimated a small number of features that were suf\ufb01cient to explain large amounts\nof variability in these complex behavioral and physiological phenotypes. Fig. 3(g) displays the\nnon-zero values estimated for the different features (e.g., location of loci on the genome) contributing\nto the weight (black) and speed (red) phenotype. Here, non-opaque points correspond to the mean\n\u00b1 s.d. across cross-validation samples, while the opaque points are the medians. Importantly, for\nboth speed and weight phenotypes, we con\ufb01rmed that several identi\ufb01ed predictor features had been\nreported in the literature, though by different studies, e.g., genes coding for Kif1b, Rrm2b/Ubr5,\nand Dloc2. (See [4] for more details.) Accurate prediction of phenotypic variability with a small\nnumber of factors was a unique property of models found by U oILasso. For both weight and rotorod\nperformance, models \ufb01t by U oILasso had marginally increased prediction accuracy compared to\nother methods (+1%), but they did so with far fewer parameters (lower selection ratios). This results\nin prediction parsimony (BIC) that was several orders of magnitude better (Fig. 3(h)). Together, these\nresults demonstrate that U oILasso can identify a small number of genetic/physiological factors that\nare highly predictive of complex physiological and behavioral phenotypes.\n\n7\n\n\fFigure 4: Extension of UoI to classi\ufb01cation and matrix decomposition. (a) UoI for classi\ufb01cation\n(U oIL1Logistic). (b) UoI for matrix decomposition (U oICU R); solid and dashed lines are for PAH\nand dashed SORCH data sets, respectively.\n\n3.5 U oIL1Logistic and U oICU R: Application of UoI to Classi\ufb01cation and Matrix\n\nDecomposition\n\nAs noted, UoI is is a framework into which other methods can be inserted. While we have primarily\ndemonstrated UoI in the context of linear regression, it is much more general than that. To illustrate\nthis, we implemented a classi\ufb01cation algorithm (U oIL1Logistic) and matrix decomposition algorithm\n(U oICU R), and we compared them to the base methods on several data sets (see [4] for details).\nIn classi\ufb01cation, UoI resulted in either equal or improved prediction accuracy with 2x-10x fewer\nparameters for a variety of biomedical classi\ufb01cation tasks (Fig. 4(a)). For matrix decomposition\n(in this case, column subset selection), for a given dimensionality, UoI resulted in reconstruction\nerrors that were consistently lower than the base method (BasicCUR), and quickly approached an\nunscalable greedy algorithm (GreedyCUR) for two genetics data sets (Fig. 4(b)). In both cases, UoI\nimproved the prediction parsimony relative to the base (classi\ufb01cation or decomposition) method.\n\n4 Discussion\n\nUoI-based methods leverage stochastic data resampling and a range of sparsity-inducing regularization\nparameters/dimensions to build families of potential features, and they then average nearly unbiased\nparameter estimates of selected features to maximize predictive accuracy. Thus, UoI separates model\nselection with intersection operations from model estimation with union operations: the limitations of\nselection by intersection are counteracted by the union of estimates, and vice versa. Stochastic data\nresampling can be a viewed as a perturbation of the data, and UoI ef\ufb01ciently identi\ufb01es and robustly\nestimates features that are stable to these perturbations. A unique property of UoI-based methods is\nthe ability to control both false positives and false negatives. Initial theoretical work (see [4]) shows\nthat increasing the number of bootstraps in the selection module (B1) increases the amount of feature\ncompression (primary controller of false positives), while increasing the number of bootstraps in\nthe estimation module (B2) increases feature expansion (primary controller of false negatives), and\nwe observe this empirically. Thus, neither should be too large, and their relative values express the\nbalance between feature compression and expansion. This tension is seen in many places in machine\nlearning and data analysis: local nearest neighbor methods vs. global latent factor models; local\nspectral methods that tend to expand due to their diffusion-based properties vs. \ufb02ow-based methods\nthat tend to contract; and sparse L1 vs. dense L2 penalties/priors more generally. Interestingly, an\nanalogous balance of compressive and expansive forces contributes to neural leaning algorithms\nbased on Hebbian synaptic plasticity [6]. Our results highlight how revisiting popular methods in\nlight of new data science demands can lead to still further-improved methods, and they suggest\nseveral directions for theoretical and empirical work.\n\n8\n\n\fReferences\n[1] F. R. Bach. Bolasso: model consistent Lasso estimation through the bootstrap. In Proceedings of the 25th\n\ninternational conference on Machine learning, pages 33\u201340, 2008.\n\n[2] P. Bickel and B. Li. Regularization in statistics. TEST, 15(2):271\u2013344, 2006.\n\n[3] K. E. Bouchard. Bootstrapped adaptive threshold selection for statistical model selection and estimation.\n\nTechnical report, 2015. Preprint: arXiv:1505.03511.\n\n[4] K. E. Bouchard, A. F. Bujan, F. Roosta-Khorasani, S. Ubaru, Prabhat, A. M. Snijders, J.-H. Mao, E. F.\nChang, M. W. Mahoney, and S. Bhattacharyya. Union of Intersections (UoI) for interpretable data\ndriven discovery and prediction. Technical report, 2017. Preprint: arXiv:1705.07585 (also available as\nSupplementary Material).\n\n[5] K. E. Bouchard and E. F. Chang. Control of spoken vowel acoustics and the in\ufb02uence of phonetic context\n\nin human speech sensorimotor cortex. Journal of Neuroscience, 34(38):12662\u201312677, 2014.\n\n[6] K. E. Bouchard, S. Ganguli, and M. S. Brainard. Role of the site of synaptic competition and the balance\nof learning forces for Hebbian encoding of probabilistic Markov sequences. Frontiers in Computational\nNeuroscience, 9(92), 2015.\n\n[7] K. E. Bouchard, N. Mesgarani, K. Johnson, and E. F. Chang. Functional organization of human sensorimo-\n\ntor cortex for speech articulation. Nature, 495(7441):327\u2013332, 2013.\n\n[8] L. Breiman. Bagging predictors. Machine Learning, 24(2):123\u2013140, 1996.\n\n[9] National Research Council. Frontiers in Massive Data Analysis. The National Academies Press, Washing-\n\nton, D. C., 2013.\n\n[10] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal\n\nof the American Statistical Association, 96(456):1348\u20131360, 2001.\n\n[11] S. Ganguli and H. Sompolinsky. 2012. Annual Review of Neuroscience, 35(1):485\u2013508, Compressed\n\nSensing, Sparsity, and Dimensionality in Neuronal Information Processing and Data Analysis.\n\n[12] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, New\n\nYork, 2003.\n\n[13] A. Javanmard and A. Montanari. Con\ufb01dence intervals and hypothesis testing for high-dimensional\n\nregression. Journal of Machine Learning Research, 15:2869\u20132909, 2014.\n\n[14] J.-H. Mao, S. A. Langley, Y. Huang, M. Hang, K. E. Bouchard, S. E. Celniker, J. B. Brown, J. K. Jansson,\nG. H. Karpen, and A. M. Snijders. Identi\ufb01cation of genetic factors that modify motor performance and\nbody weight using collaborative cross mice. Scienti\ufb01c Reports, 5:16247, 2015.\n\n[15] V. Marx. Biology: The big challenges of big data. Nature, 498(7453):255\u2013260, 2013.\n\n[16] R. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT Press, Cambridge, MA, 2012.\n\n[17] T. J. Sejnowski, P. S. Churchland, and J. A. Movshon. Putting big data to good use in neuroscience. Nature\n\nNeuroscience, 17(11):1440\u20131441, 2014.\n\n[18] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[19] M. J. Wainwright. Structured regularizers for high-dimensional problems: Statistical and computational\n\nissues. Annual Review of Statistics and Its Application, 1:233\u2013253, 2014.\n\n9\n\n\f", "award": [], "sourceid": 717, "authors": [{"given_name": "Kristofer", "family_name": "Bouchard", "institution": "Lawrence Berkeley National Laboratory"}, {"given_name": "Alejandro", "family_name": "Bujan", "institution": "UC Berkeley"}, {"given_name": "Fred", "family_name": "Roosta", "institution": "University of California Berkeley"}, {"given_name": "Shashanka", "family_name": "Ubaru", "institution": "University of Minnesota"}, {"given_name": "Mr.", "family_name": "Prabhat", "institution": "LBL/NERSC"}, {"given_name": "Antoine", "family_name": "Snijders", "institution": null}, {"given_name": "Jian-Hua", "family_name": "Mao", "institution": null}, {"given_name": "Edward", "family_name": "Chang", "institution": null}, {"given_name": "Michael", "family_name": "Mahoney", "institution": "UC Berkeley"}, {"given_name": "Sharmodeep", "family_name": "Bhattacharya", "institution": "Oregon State University"}]}