{"title": "Multi-Label Prediction via Compressed Sensing", "book": "Advances in Neural Information Processing Systems", "page_first": 772, "page_last": 780, "abstract": "", "full_text": "Multi-Label Prediction via Compressed Sensing\n\nDaniel Hsu\nUC San Diego\n\ndjhsu@cs.ucsd.edu\n\nSham M. Kakade\n\nTTI-Chicago\n\nsham@tti-c.org\n\nJohn Langford\nYahoo! Research\njl@hunch.net\n\nTong Zhang\n\nRutgers University\n\ntongz@rci.rutgers.edu\n\nAbstract\n\nWe consider multi-label prediction problems with large output spaces under the\nassumption of output sparsity \u2013 that the target (label) vectors have small support.\nWe develop a general theory for a variant of the popular error correcting output\ncode scheme, using ideas from compressed sensing for exploiting this sparsity.\nThe method can be regarded as a simple reduction from multi-label regression\nproblems to binary regression problems. We show that the number of subprob-\nlems need only be logarithmic in the total number of possible labels, making this\napproach radically more ef\ufb01cient than others. We also state and prove robustness\nguarantees for this method in the form of regret transform bounds (in general),\nand also provide a more detailed analysis for the linear prediction setting.\n\n1 Introduction\n\nSuppose we have a large database of images, and we want to learn to predict who or what is in any\ngiven one. A standard approach to this task is to collect a sample of these images x along with\ncorresponding labels y = (y1, . . . , yd) \u2208 {0, 1}d, where yi = 1 if and only if person or object i\nis depicted in image x, and then feed the labeled sample to a multi-label learning algorithm. Here,\nd is the total number of entities depicted in the entire database. When d is very large (e.g. 103,\n104), the simple one-against-all approach of learning a single predictor for each entity can become\nprohibitively expensive, both at training and testing time.\n\nOur motivation for the present work comes from the observation that although the output (label)\nspace may be very high dimensional, the actual labels are often sparse. That is, in each image, only\na small number of entities may be present and there may only be a small amount of ambiguity in\nwho or what they are. In this work, we consider how this sparsity in the output space, or output\nsparsity, eases the burden of large-scale multi-label learning.\nExploiting output sparsity. A subtle but critical point that distinguishes output sparsity from more\ncommon notions of sparsity (say, in feature or weight vectors) is that we are interested in the sparsity\nof E[y|x] rather than y. In general, E[y|x] may be sparse while the actual outcome y may not (e.g. if\nthere is much unbiased noise); and, vice versa, y may be sparse with probability one but E[y|x] may\nhave large support (e.g. if there is little distinction between several labels).\n\nConventional linear algebra suggests that we must predict d parameters in order to \ufb01nd the value of\nthe d-dimensional vector E[y|x] for each x. A crucial observation \u2013 central to the area of compressed\nsensing [1] \u2013 is that methods exist to recover E[y|x] from just O(k log d) measurements when E[y|x]\nis k-sparse. This is the basis of our approach.\n\n1\n\n\fOur contributions. We show how to apply algorithms for compressed sensing to the output coding\napproach [2]. At a high level, the output coding approach creates a collection of subproblems of\nthe form \u201cIs the label in this subset or its complement?\u201d, solves these problems, and then uses their\nsolution to predict the \ufb01nal label.\n\nThe role of compressed sensing in our application is distinct from its more conventional uses in data\ncompression. Although we do employ a sensing matrix to compress training data, we ultimately\nare not interested in recovering data explicitly compressed this way. Rather, we learn to predict\ncompressed label vectors, and then use sparse reconstruction algorithms to recover uncompressed\nlabels from these predictions. Thus we are interested in reconstruction accuracy of predictions,\naveraged over the data distribution.\n\nThe main contributions of this work are:\n\n1. A formal application of compressed sensing to prediction problems with output sparsity.\n\n2. An ef\ufb01cient output coding method, in which the number of required predictions is only\n\nlogarithmic in the number of labels d, making it applicable to very large-scale problems.\n\n3. Robustness guarantees, in the form of regret transform bounds (in general) and a further\n\ndetailed analysis for the linear prediction setting.\n\nPrior work. The ubiquity of multi-label prediction problems in domains ranging from multiple ob-\nject recognition in computer vision to automatic keyword tagging for content databases has spurred\nthe development of numerous general methods for the task. Perhaps the most straightforward ap-\nproach is the well-known one-against-all reduction [3], but this can be too expensive when the num-\nber of possible labels is large (especially if applied to the power set of the label space [4]). When\nstructure can be imposed on the label space (e.g. class hierarchy), ef\ufb01cient learning and prediction\nmethods are often possible [5, 6, 7, 8, 9]. Here, we focus on a different type of structure, namely\noutput sparsity, which is not addressed in previous work. Moreover, our method is general enough to\ntake advantage of structured notions of sparsity (e.g. group sparsity) when available [10]. Recently,\nheuristics have been proposed for discovering structure in large output spaces that empirically offer\nsome degree of ef\ufb01ciency [11].\n\nAs previously mentioned, our work is most closely related to the class of output coding method\nfor multi-class prediction, which was \ufb01rst introduced and shown to be useful experimentally in [2].\nRelative to this work, we expand the scope of the approach to multi-label prediction and provide\nbounds on regret and error which guide the design of codes. The loss based decoding approach [12]\nsuggests decoding so as to minimize loss. However, it does not provide signi\ufb01cant guidance in the\nchoice of encoding method, or the feedback between encoding and decoding which we analyze here.\n\nThe output coding approach is inconsistent when classi\ufb01ers are used and the underlying problems\nbeing encoded are noisy. This is proved and analyzed in [13], where it is also shown that using a\nHadamard code creates a robust consistent predictor when reduced to binary regression. Compared\nto this method, our approach achieves the same robustness guarantees up to a constant factor, but\nrequires training and evaluating exponentially (in d) fewer predictors.\nOur algorithms rely on several methods from compressed sensing, which we detail where used.\n\n2 Preliminaries\n\nLet X be an arbitrary input space and Y \u2282 Rd be a d-dimensional output (label) space. We assume\nthe data source is de\ufb01ned by a \ufb01xed but unknown distribution over X \u00d7 Y. Our goal is to learn a\npredictor F : X \u2192 Y with low expected \u21132\n2 (the sum of mean-squared-\nerrors over all labels) using a set of n training data {(xi, yi)}n\nWe focus on the regime in which the output space is very high-dimensional (d very large), but for\nany given x \u2208 X , the expected value E[y|x] of the corresponding label y \u2208 Y has only a few\nnon-zero entries. A vector is k-sparse if it has at most k non-zero entries.\n\n2-error ExkF (x) \u2212 E[y|x]k2\n\ni=1.\n\n2\n\n\f3 Learning and Prediction\n\n3.1 Learning to Predict Compressed Labels\n\nLet A : Rd \u2192 Rm be a linear compression function, where m \u2264 d (but hopefully m \u226a d). We use\nA to compress (i.e. reduce the dimension of) the labels Y, and learn a predictor H : X \u2192 A(Y) of\nthese compressed labels. Since A is linear, we simply represent A \u2208 Rm\u00d7d as a matrix.\nSpeci\ufb01cally, given a sample {(xi, yi)}n\ni=1, we form a compressed sample {(xi, Ayi)}n\nlearn a predictor H of E[Ay|x] with the objective of minimizing the \u21132\n3.2 Predicting Sparse Labels\n\ni=1 and then\n2.\n2-error ExkH(x)\u2212 E[Ay|x]k2\n\nTo obtain a predictor F of E[y|x], we compose the predictor H of E[Ay|x] (learned using the com-\npressed sample) with a reconstruction algorithm R : Rm \u2192 Rd. The algorithm R maps predictions\nof compressed labels h \u2208 Rm to predictions of labels y \u2208 Y in the original output space. These\nalgorithms typically aim to \ufb01nd a sparse vector y such that Ay closely approximates h.\nRecent developments in the area of compressed sensing have produced a spate of reconstruction\nalgorithms with strong performance guarantees when the compression function A satis\ufb01es certain\nproperties. We abstract out the relevant aspects of these guarantees in the following de\ufb01nition.\n\nDe\ufb01nition. An algorithm R is a valid reconstruction algorithm for a family of compression functions\n(Ak \u2282Sm\u22651\nRm\u00d7d : k \u2208 N) and sparsity error sperr : N \u00d7 Rd \u2192 R, if there exists a function\nf : N \u2192 N and constants C1, C2 \u2208 R such that: on input k \u2208 N, A \u2208 Ak with m rows, and\nh \u2208 Rm, the algorithm R(k, A, h) returns an f (k)-sparse vectorby satisfying\n\n2 \u2264 C1 \u00b7 kh \u2212 Ayk2\n\n2 + C2 \u00b7 sperr(k, y)\n\nkby \u2212 yk2\n\nfor all y \u2208 Rd. The function f is the output sparsity of R and the constants C1 and C2 are the regret\nfactors.\n\nvector by returned by the reconstruction algorithm should be close to E[y|x]; this latter distance\nkby\u2212 E[y|x]k2\n\nInformally, if the predicted compressed label H(x) is close to E[Ay|x] = AE[y|x], then the sparse\n2 should degrade gracefully in terms of the accuracy of H(x) and the sparsity of E[y|x].\nMoreover, the algorithm should be agnostic about the sparsity of E[y|x] (and thus the sparsity error\nsperr(k, E[y|x])), as well as the \u201cmeasurement noise\u201d (the prediction error kH(x) \u2212 E[Ay|x]k2).\nThis is a subtle condition and precludes certain reconstruction algorithm (e.g. Basis Pursuit [14])\nthat require the user to supply a bound on the measurement noise. However, the condition is needed\nin our application, as such bounds on the prediction error (for each x) are not generally known\nbeforehand.\n\nWe make a few additional remarks on the de\ufb01nition.\n\n1. The minimum number of rows of matrices A \u2208 Ak may in general depend on k (as well as\nthe ambient dimension d). In the next section, we show how to construct such A with close\nto the optimal number of rows.\n\nk-sparse vector.\n\n2. The sparsity error sperr(k, y) should measure how poorly y \u2208 Rd is approximated by a\n3. A reasonable output sparsity f (k) for sparsity level k should not be much more than k,\n\ne.g. f (k) = O(k).\n\nConcrete examples of valid reconstruction algorithms (along with the associated Ak, sperr, etc.) are\ngiven in the next section.\n\n4 Algorithms\n\nOur prescribed recipe is summarized in Algorithms 1 and 2. We give some examples of compression\nfunctions and reconstruction algorithms in the following subsections.\n\n3\n\n\fAlgorithm 1 Training algorithm\nparameters sparsity level k, compression\nfunction A \u2208 Ak with m rows, regression\nlearning algorithm L\n\ninput training data S \u2282 X \u00d7 Rd\n\nfor i = 1, . . . , m do\n\nhi \u2190 L({(x, (Ay)i) : (x, y) \u2208 S})\n\nend for\n\noutput regressors H = [h1, . . . , hm]\n\nAlgorithm 2 Prediction algorithm\nparameters sparsity level k, compression\nfunction A \u2208 Ak with m rows, valid re-\nconstruction algorithm R for Ak\ntest\npoint x \u2208 X\n\ninput regressors H = [h1, . . . , hm],\n\noutput by = ~R(k, A, [h1(x), . . . , hm(x)])\n\nFigure 1: Training and prediction algorithms.\n\n4.1 Compression Functions\n\nSeveral valid reconstruction algorithms are known for compression matrices that satisfy a restricted\nisometry property.\nDe\ufb01nition. A matrix A \u2208 Rm\u00d7d satis\ufb01es the (k, \u03b4)-restricted isometry property ((k, \u03b4)-RIP), \u03b4 \u2208\n(0, 1), if (1 \u2212 \u03b4)kxk2\nWhile some explicit constructions of (k, \u03b4)-RIP matrices are known (e.g. [15]), the best guarantees\nare obtained when the matrix is chosen randomly from an appropriate distribution, such as one of\nthe following [16, 17].\n\n2 for all k-sparse x \u2208 Rd.\n\n2 \u2264 (1 + \u03b4)kxk2\n\n2 \u2264 kAxk2\n\n\u2022 All entries i.i.d. Gaussian N (0, 1/m), with m = O(k log(d/k)).\n\u2022 All entries i.i.d. Bernoulli B(1/2) over {\u00b11/\u221am}, with m = O(k log(d/k)).\n\u2022 m randomly chosen rows of the d \u00d7 d Hadamard matrix over {\u00b11/\u221am}, with m =\n\nO(k log5 d).\n\nThe hidden constants in the big-O notation depend inversely on \u03b4 and the probability of failure.\nA striking feature of these constructions is the very mild dependence of m on the ambient dimension\nd. This translates to a signi\ufb01cant savings in the number of learning problems one has to solve after\nemploying our reduction.\nSome reconstruction algorithms require a stronger guarantee of bounded coherence \u00b5(A) \u2264\nO(1/k), where \u00b5(A) de\ufb01ned as\n\n\u00b5(A) = max\n\n1\u2264i K columns, it is better to choose correlated columns to avoid over-\ufb01tting. Both OMP and\nFoBa explicitly avoid this and thus do not fare well; but CoSaMP, Lasso, and CD do allow selecting\ncorrelated columns and thus perform better in this regime.\n\nThe results for precision-at-k are similar to that of mean-squared-error, except that choosing corre-\nlated columns does not necessarily help in the small m regime. This is because the extra correlated\ncolumns need not correspond to accurate label coordinates.\n\nIn summary, the experiments demonstrate the feasibility and robustness of our reduction method for\ntwo natural multi-label prediction tasks. They show that predictions of relatively few compressed\nlabels are suf\ufb01cient to recover an accurate sparse label vector, and as our theory suggests, the ro-\nbustness of the reconstruction algorithms is a key factor in their success.\n\nAcknowledgments\n\nWe thank Andy Cotter for help processing the image features for the ESP Game data. This work\nwas completed while the \ufb01rst author was an intern at TTI-C in 2008.\n\nReferences\n[1] David Donoho. Compressed sensing. IEEE Trans. Info. Theory, 52(4):1289\u20131306, 2006.\n[2] T. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes.\n\nJournal of Arti\ufb01cial Intelligence Research, 2:263\u2013286, 1995.\n\n[3] R. Rifkin and A. Klautau. In defense of one-vs-all classi\ufb01cation. Journal of Machine Learning Research,\n\n5:101\u2013141, 2004.\n\n[4] M. Boutell, J. Luo, X. Shen, and C. Brown. Learning multi-label scene classi\ufb01cation. Pattern Recognition,\n\n37(9):1757\u20131771, 2004.\n\n[5] A. Clare and R.D. King. Knowledge discovery in multi-label phenotype data. In European Conference\n\non Principles of Data Mining and Knowledge Discovery, 2001.\n\n8\n\n\f[6] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.\n\n[7] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni.\n\nIncremental algorithms for hierarchical classi\ufb01cation.\n\nJournal of Machine Learning Research, 7:31\u201354, 2006.\n\n[8] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interde-\n\npendent and structured output spaces. In ICML, 2004.\n\n[9] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Kernel-based learning of hierarchical multilabel\n\nclassi\ufb01cation models. Journal of Machine Learning Research, 7:1601\u20131626, 2006.\n\n[10] J. Huang, T. Zhang, and D. Metaxax. Learning with structured sparsity. In ICML, 2009.\n\n[11] G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective and ef\ufb01cient multilabel classi\ufb01cation in domains\nwith large number of labels. In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data,\n2008.\n\n[12] Erin Allwein, Robert Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach\n\nfor margin classi\ufb01ers. Journal of Machine Learning Research, 1:113\u2013141, 2000.\n\n[13] J. Langford and A. Beygelzimer. Sensitive error correcting output codes. In Proc. Conference on Learning\n\nTheory, 2005.\n\n[14] Emmanuel Cand`es, Justin Romberg, and Terrence Tao. Stable signal recovery from incomplete and\n\ninaccurate measurements. Comm. Pure Appl. Math., 59:1207\u2013122, 2006.\n\n[15] R. DeVore. Deterministic constructions of compressed sensing matrices. J. of Complexity, 23:918\u2013925,\n\n2007.\n\n[16] Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Uniform uncertainty principle for\n\nBernoulli and subgaussian ensembles. Constructive Approximation, 28(3):277\u2013289, 2008.\n\n[17] M. Rudelson and R. Vershynin. Sparse reconstruction by convex relaxation: Fourier and Gaussian mea-\n\nsurements. In Proc. Conference on Information Sciences and Systems, 2006.\n\n[18] S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal\n\nProcessing, 41(12):3397\u20133415, 1993.\n\n[19] Tong Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In\n\nProc. Neural Information Processing Systems, 2008.\n\n[20] D. Needell and J.A. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples.\n\nApplied and Computational Harmonic Analysis, 2007.\n\n[21] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annals of\n\nStatistics, 32(2):407\u2013499, 2004.\n\n[22] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk\n\nbounds, margin bounds, and regularization. In Proc. Neural Information Processing Systems, 2008.\n\n[23] Andrew Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In ICML, 2004.\n\n[24] Luis von Ahn and Laura Dabbish. Labeling images with a computer game. In Proc. ACM Conference on\n\nHuman Factors in Computing Systems, 2004.\n\n[25] Marcin Marsza\u0142ek, Cordelia Schmid, Hedi Harzallah, and Joost van de Weijer. Learning object repre-\nsentations for visual object class recognition. In Visual Recognition Challange Workshop, in conjunction\nwith ICCV, 2007.\n\n[26] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features.\n\nComputer Vision and Image Understanding, 110(3):346\u2013359, 2008.\n\n[27] David Donoho, Michael Elad, and Vladimir Temlyakov. Stable recovery of sparse overcomplete repre-\n\nsentations in the presence of noise. IEEE Trans. Info. Theory, 52(1):6\u201318, 2006.\n\n[28] Sanjoy Dasgupta. Learning Probability Distributions. PhD thesis, University of California, 2000.\n\n9\n\n\f", "award": [], "sourceid": 3824, "authors": [{"given_name": "Daniel", "family_name": "Hsu", "institution": ""}, {"given_name": "Sham", "family_name": "Kakade", "institution": ""}, {"given_name": "John", "family_name": "Langford", "institution": ""}, {"given_name": "Tong", "family_name": "Zhang", "institution": ""}]}