{"title": "Learning as MAP Inference in Discrete Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1970, "page_last": 1978, "abstract": "We present a new formulation for attacking binary classification problems. Instead of relying on convex losses and regularisers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but \\emph{discrete} formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex paradigms, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for \\emph{direct} regularisation through cardinality-based penalties, such as the $\\ell_0$ pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation.", "full_text": "Learning as MAP Inference in Discrete\n\nGraphical Models\n\nXianghang Liu\nNICTA/UNSW\nSydney, Australia\n\nJames Petterson\n\nNICTA/ANU\n\nCanberra, Australia\n\nxianghang.liu@nicta.com.au\n\njames.petterson@nicta.com.au\n\nTiberio S. Caetano\n\nNICTA/ANU/University of Sydney\n\nCanberra and Sydney, Australia\ntiberio.caetano@nicta.com.au\n\nAbstract\n\nWe present a new formulation for binary classi\ufb01cation. Instead of relying\non convex losses and regularizers such as in SVMs, logistic regression and\nboosting, or instead non-convex but continuous formulations such as those\nencountered in neural networks and deep belief networks, our framework\nentails a non-convex but discrete formulation, where estimation amounts to\n\ufb01nding a MAP con\ufb01guration in a graphical model whose potential functions\nare low-dimensional discrete surrogates for the misclassi\ufb01cation loss. We\nargue that such a discrete formulation can naturally account for a number\nof issues that are typically encountered in either the convex or the contin-\nuous non-convex approaches, or both. By reducing the learning problem\nto a MAP inference problem, we can immediately translate the guarantees\navailable for many inference settings to the learning problem itself. We\nempirically demonstrate in a number of experiments that this approach\nis promising in dealing with issues such as severe label noise, while still\nhaving global optimality guarantees. Due to the discrete nature of the for-\nmulation, it also allows for direct regularization through cardinality-based\npenalties, such as the (cid:96)0 pseudo-norm, thus providing the ability to perform\nfeature selection and trade-o\ufb00 interpretability and predictability in a prin-\ncipled manner. We also outline a number of open problems arising from\nthe formulation.\n\n1 Introduction\n\nA large fraction of the machine learning community is concerned itself with the formulation\nof a learning problem as a single, well-de\ufb01ned optimization problem. This is the case for\nmany popular techniques, including those associated with margin or likelihood-based esti-\nmators, such as SVMs, logistic regression, boosting, CRFs and deep belief networks. Among\nthese optimization-based frameworks for learning, two paradigms stand out: the one based\non convex formulations (such as SVMs) and the one based on non-convex formulations (such\nas deep belief networks). The main argument in favor of convex formulations is that we\ncan e\ufb00ectively decouple modeling from optimization, what has substantial theoretical and\npractical bene\ufb01ts. In particular, it is of great value in terms of reproducibility, modularity\nand ease of use. Coming from the other end, the main argument for non-convexity is that\na convex formulation very often fails to capture fundamental properties of a real problem\n(e.g. see [1, 2] for examples of some fundamental limitations of convex loss functions).\n\n1\n\n\fThe motivation for this paper starts from the observation that the above tension is not really\nbetween convexity and non-convexity, but between convexity and continuous non-convexity.\nHistorically, the optimization-based approach to machine learning has been virtually a syn-\nonym of continuous optimization. Estimation in continuous parameter spaces in some cases\nallows for closed-form solutions (such as in least-squares regression), or if not we can resort\nto computing gradients (for smooth continuous functions) or subgradients (for non-smooth\ncontinuous functions) which give us a generic tool for \ufb01nding a local optimum of an arbitrary\ncontinuous function (global optimum if the continuous function is convex). On the contrary,\nunless P=NP there is no general tool to e\ufb03ciently optimize discrete functions. We suspect\nthis is one of the reasons why machine learning has traditionally been formulated in terms\nof continuous optimization:\nit is indeed convenient to compute gradients or subgradients\nand delegate optimization to some o\ufb00-the-shelf gradient-based algorithm.\n\nThe formulation we introduce in this paper is non-convex, but discrete rather than con-\ntinuous. By being non-convex we will attempt at capturing some of the expressive power\nof continuous non-convex formulations (such as robustness to labeling noise), and by being\ndiscrete we will retain the ability of convex formulations to provide theoretical guarantees in\noptimization. There are highly non-trivial classes of non-convex discrete functions de\ufb01ned\nover exponentially large discrete spaces which can be optimized e\ufb03ciently. This is, after all,\nthe main topic of combinatorial optimization. Discrete functions factored over cliques of\nlow-treewidth graphs can be optimized e\ufb03ciently via dynamic programming [3]. Arbitrary\nsubmodular functions can be minimized in polynomial time [4]. Particular submodular\nfunctions can be optimized very e\ufb03ciently using max-\ufb02ow algorithms [5]. Discrete functions\nde\ufb01ned over other particular classes of graphs also have polynomial-time algorithms (planar\ngraphs [6], perfect graphs [7]). And of course although many discrete optimization prob-\nlems are NP-hard, several have e\ufb03cient constant-factor approximations [8]. In addition to\nall that, much progress has been done recently on developing tight LP relaxations for hard\ncombinatorial problems [9]. Although all these discrete approaches have been widely used\nfor solving inference problems in machine learning settings, we argue in this paper that they\nshould also be used to solve estimation problems, or learning per se.\n\nThe discrete approach does pose several new questions though, which we list at the end. Our\ncontribution is to outline the overall framework in terms of a few key ideas and assumptions,\nas well as to empirically evaluate in real-world datasets particular model instances within\nthe framework. Although these instances are very simple, they already display important\ndesirable behavior that is missing in state-of-the-art estimators such as SVMs.\n\n2 Desiderata\n\nWe want to rethink the problem of learning a linear binary classi\ufb01er. In this section we\nlist the features that we would like a general-purpose learning machine for this problem to\npossess. These features essentially guide the assumptions behind our framework.\n\nOption to decouple modeling from optimization: As discussed in the introduction,\nthis is the great appeal of convex formulations, and we would like to retain it. Note however\nthat we want the option, not necessarily a mandate of always decoupling modeling from\noptimization. We want to be able to please the user who is not an optimization expert or\ndoesn\u2019t have the time or resources to re\ufb01ne the optimizer, by having the option of requesting\nthe learning machine to con\ufb01gure itself in a mode in which global optimization is guaranteed\nand the runtime of optimization is precisely predictable. However we also want to please the\nuser who is an expert, and is willing to spend a lot of time in re\ufb01ning the optimizer, to achieve\nthe best possible results regardless of training time considerations. In our framework, we\nhave the option to explore the spectrum between simpler models in which we can generate\nprecise estimates of the runtime of the whole algorithm, and more complex models where\nwe can focus on boosted performance at the expense of runtime predictability or demand\nfor expert-exclusive \ufb01ne-tuning skills.\nOption of Simplicity: This point is related to the previous one, but it\u2019s more general.\nThe complexity of a learning algorithm is a great barrier for its dissemination, even if it\npromises exceptional results once properly implemented. Most users of machine learning\nare not machine learning experts themselves, and for them in particular the cost of getting\n\n2\n\n\fa complex algorithm to work often outweighs the accuracy gains, especially if a reasonably\ngood solution can be obtained with a very simple algorithm. For instance, in our framework\nthe user has the option of reducing the learning algorithm to a series of matrix multiplications\nand lookup operations, while having a precise estimate of the total runtime of the algorithm\nand retaining good performance.\nRobustness to label noise: SVMs are considered state-of-the-art estimators for binary\nclassi\ufb01ers, as well as boosting and logistic regression. All these optimize convex loss func-\ntions. However, when label noise is present, convex loss functions in\ufb02ict arbitrarily large\npenalty on misclassi\ufb01cations because they are unbounded.\nIn other words, in high label\nnoise settings these convex loss functions become poor proxies for the 0/1 loss (the loss we\nreally care about). This fundamental limitation of convex loss functions is well understood\ntheoretically [1]. The fact that the loss function of interest is itself discrete is indeed a hint\nthat maybe we should investigate discrete surrogates rather than continuous surrogates for\nthe 0/1 loss: optimizing discrete functions over continuous spaces is hard, but not necessarily\nover discrete spaces. In our framework we directly address this issue.\nAbility to achieve sparsity: Often we need to estimate sparse models. This can be for\nseveral reasons, including interpretability (be able to tell which are the \u2018most important\u2019 fea-\ntures), e\ufb03ciency (at prediction time we can only a\ufb00ord to use a limited number of features)\nor, importantly, for purely statistical reasons (constraining the solution to low-dimensional\nsubspaces has a regularization e\ufb00ect). The standard convex approach uses (cid:96)1 regularization.\nHowever the assumptions required to make (cid:96)1-regularized models be actually good proxies\nfor the support cardinality function ((cid:96)0 pseudo-norm) are very strong and in practice rarely\nmet [10]. In fact this has motivated an entire new line of work on structured sparsity, which\ntries to further regularize the solution so as to obtain better statistical properties in high\ndimensions [11, 12, 13]. This however comes at the price of more expensive optimization\nalgorithms. Ideally we would like to regularize with (cid:96)0 directly; maybe this suggests the\npossibility of exploring an inherently discrete formulation? In our approach we have the\nability to perform direct regularization via the (cid:96)0 pseudo-norm, or other scale-invariant\nregularizers.\nLeverage the power of low-dimensional approximations: Machine learning folklore\nhas it that the Naive Bayes assumption (features conditionally independent given the class\nlabel) often produces remarkably good classi\ufb01ers. So a natural question is:\nis it really\nnecessary to work directly in the original high-dimensional space, such as SVMs do? A\nkey aspect of our framework is that we explicitly exploit the concept of composing a high-\ndimensional model from low-dimensional pieces. However we go beyond the Naive Bayes\nassumption by constructing graphs that model dependencies between variables. By varying\nthe properties of these graphs we can trade-o\ufb00 model complexity and optimization e\ufb03ciency\nin a straightforward manner.\n\n3 Basic Setting\n\nMuch of current machine learning research studies estimators of the type\n\n(cid:88)\n\nargmin\n\n\u03b8\u2208\u0398\n\nn\n\n(cid:96)(yn, f (xn; \u03b8)) + \u03bb\u2126(\u03b8)\n\n(1)\nwhere {xn, yn} is a training set of inputs x \u2208 X and outputs y \u2208 Y, assumed sampled\nindependently from an unknown probability measure P on X \u00d7 Y. f : X \u2192 Y is a member\nof a given class of predictors parameterized by \u03b8, \u0398 is a continuous space such as a Hilbert\nspace, and (cid:96) as well as \u2126 are continuous and convex functions of \u03b8. (cid:96) is a loss function which\nenforces a penalty whenever f (xn) (cid:54)= yn, and therefore the \ufb01rst term in (1) measures the\ntotal loss incurred by predictor f on the training sample {xn, yn} under parameterization \u03b8.\n\u2126 controls the complexity of \u03b8 so as to avoid over\ufb01tting, and \u03bb trades-o\ufb00 the importance of a\ngood \ufb01t to the training set versus model parsimony, so that good generalization is hopefully\nachieved.\n\nProblem (1) is often called regularized empirical risk minimization, since the \ufb01rst term is\nthe risk (expected loss) under the empirical distribution of the training data, and the second\nis a regularizer. This formulation is used for regression (Y continuous) as well as classi\ufb01ca-\ntion and structured prediction (Y discrete). Logistic Regression, Regularized Least-Squares\n\n3\n\n\fRegression, SVMs, CRFs, structured SVMs, Lasso, Group Lasso and a variety of other es-\ntimators are all instances of (1) for particular choices of (cid:96), f , \u0398 and \u2126. The formulation in\n(1) is a very general formulation for machine learning under the i.i.d. assumption.\n\nIn this paper we study problem (1) under the assumption that the parameter space \u0398 is\ndiscrete and \ufb01nite, focusing on binary classi\ufb01cation, when Y = {\u22121, 1}.\n\n4 Formulation\n\nOur formulation departs from the one in (1) in two ways. The \ufb01rst assumption is that both\nthe loss (cid:96) and the regularizer \u2126 are additive on low-dimensional functions de\ufb01ned by a graph\nG = (V, E), i.e.,\n\n(cid:96)(y, f (x; \u03b8)) =\n\n\u2126(\u03b8) =\n\n\u2126c(\u03b8c)\n\n(cid:96)c(y, fc(x; \u03b8c))\n\n(2)\n\n(3)\n\n(cid:88)\n\nc\u2208C\n\n(cid:88)\n\nc\u2208C(cid:48)\n\nwhere C \u222a C(cid:48) is the set of maximal cliques in G. Note that (3) is standard: (cid:96)1 and (cid:96)2\nnorms for example are both additive on singletons (in which case C(cid:48) = V ). The arguably\nstrong assumption here is (2). C is the set of parts where each part c is, in principle, an\narbitrary subset of {1, . . . , D}, where D is the dimensionality of the parameterization, i.e.,\n\u03b8 = (\u03b81, . . . , \u03b8D). (cid:96)c is a low-dimensional discrete surrogate for (cid:96), and fc is a low-dimensional\npredictor, both to be de\ufb01ned below. Note that in general two parameter subvectors \u03b8ci and\n\u03b8cj are not independent since the cliques ci and cj can overlap.\nIndeed, one of the key\nreasons sustaining the power of this formulation is that all \u03b8c are coupled either directly or\nindirectly through the connected graph G = (V, E).\n\nThe second assumption is that \u0398 is discrete and therefore the vector \u03b8 = (\u03b81, . . . , \u03b8D) is\ndiscrete in the sense that \u03b8i is only allowed to take on \ufb01nitely many values, including the\nvalue 0 (this will be important when we discuss regularization). For simplicity of exposition\nlet\u2019s assume that the number of discrete values (bins) for each \u03b8i is the same: B. B can be\npotentially quite large, for example it can be in the hundreds.\n\nRandom Projections. An instance x above in reality is not the raw feature vector but\ninstead a random projection of it into a space of the same or higher dimension, i.e., we\ne\ufb00ectively apply X = RX(cid:48) where X(cid:48) is the original data matrix, R is a random matrix with\nentries drawn from N (0, 1) and X is the new data matrix. This often provides improved\nperformance for our model due to the spreading of higher-order dependencies over lower-\norder cliques (when mapping to a higher dimensional space) and also is motivated from a\ntheoretical argument (section 6). In what follows x is the feature vector after the projection.\n\nLow-Dimensional Predictor. We will assume a standard linear predictor of the kind\n\nfc(x; \u03b8) = argmax\ny\u2208{\u22121,1}\n\ny (cid:104)xc, \u03b8c(cid:105) = sign(cid:104)xc, \u03b8c(cid:105)\n\nIn other words, we have a linear classi\ufb01er that only considers the features in clique c.1\n\nLow-Dimensional Discrete Surrogates for the 0/1 loss The low-dimensional discrete\nsurrogate for the 0/1 loss is simply de\ufb01ned as the 0/1 loss incurred by predictor fc:\n\n(cid:96)c(y; fc(x; \u03b8)) = (1 \u2212 yfc(x; \u03b8))/2\n\n(4)\n\n(5)\n\nA key observation now is that fc and therefore (cid:96)c can be computed in O(Bk) by full enu-\nmeration over the Bk instantiations of \u03b8c, where k is the size of clique c. In other words,\nthe 0/1 loss constrained to the discretized subspace de\ufb01ned by clique c can be exactly and\ne\ufb03ciently computed (for small cliques).\n\nRegularization. One critical technical\nissue is that linear predictors of the kind\nargmaxy (cid:104)\u03c6(x, y), \u03b8(cid:105) are insensitive to scalings of \u03b8 [14]. Therefore, the loss (cid:96) will be such\nthat (cid:96)(y, f (x; \u03b1\u03b8)) = (cid:96)(y, f (x; \u03b8)) for \u03b1 (cid:54)= 0. This means that any regularizer that depends\n1For notational simplicity we assume an o\ufb00set parameter is already included in \u03b8c and a corre-\n\nsponding entry of 1 is appended to the vector xc.\n\n4\n\n\fon scale (such as (cid:96)1 and (cid:96)2 norms) is e\ufb00ectively meaningless since the minimization in (1)\nwill drive \u2126(\u03b8) to 0 (as this doesn\u2019t a\ufb00ect the loss). In other words, in such discrete setting\nwe need a scale-invariant regularizer, such as the (cid:96)0 pseudo-norm. Note that (cid:96)0 is trivial to\nimplement in this formulation, as we have enforced that the zero value must be included in\nthe set of B values attainable by each \u03b8i:\n\n\u2126(\u03b8) = (cid:96)0(\u03b8) =\n\n1\u03b8i(cid:54)=0\n\n(6)\n\n(cid:88)\n\ni\n\ngroup regularizers, for example of the form(cid:80)\n\nIn addition, since this regularizer is additive on singletons \u03b8i, it comes for free the fact that\nit does not contribute to the complexity of inference in the graphical model (i.e., it is a\nunary potential), which is a convenient property. Nothing prevents us however from having\nc\u2208C(cid:48) \u03bbc1\u03b8c(cid:54)=0. Again, we can trade-o\ufb00 model\nsimplicity and optimization e\ufb03ciency by controlling the size of the maximal clique in C(cid:48).\nFinal optimization Problem. After compiling the low-dimensional discrete proxies for\nthe 0/1 loss (the functions lc) and incorporating our regularizer, we can assemble the fol-\nlowing optimization problem\n\n(cid:88)\n\nc\u2208C\n\nN(cid:88)\n(cid:124)\n\nn=1\n\nargmin\n\n\u03b8\u2208\u0398\n\nD(cid:88)\n\ni=1\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n\u03bb1\u03b8i(cid:54)=0\n:=\u2212\u03bb\u03c6i(\u03b8i)\n\n(cid:125)\n\n(cid:96)c(yn, fc(xn; \u03b8c))\n\n+\n\n(cid:123)(cid:122)\n\n:=\u2212N \u03c8c(\u03b8c)\n\n(7)\n\nwhich is a relaxation of (1) under all the above assumptions. The critical observation\nnow is that (7) is a MAP inference problem in a discrete graphical model with clique set\nC, high-order clique potentials \u03c8c(\u03b8c) and unary potentials \u03c6i(\u03b8i) [15]. Therefore we can\nresort to the vast literature on inference in graphical models to \ufb01nd exact or approximate\nsolutions for (7). For example, if G = (V, E) is a tree, then (7) can be solved exactly\nand e\ufb03ciently using a dynamic programming algorithm that only requires matrix-vector\nmultiplications in the (min, +) semiring, in addition to elementary lookup operations [3].\nFor more general graphs the problem (7) can become NP-hard, but even in that case there\nare several principled approaches that often \ufb01nd excellent solutions, such as those based\non linear programming relaxations [9] for tightly outer-bounding the marginal polytope\n[16]. In the experimental section we explore several options for constructing G, from simply\ngenerating a random chain (where MAP inference can be solved e\ufb03ciently by dynamic\nprogramming) to generating dense random graphs (where MAP inference requires a more\nsophisticated approach such as an LP relaxation).\n\n5 Related Work\n\nThe most closely related work we found is a recent paper by Potetz [17]. In a similar spirit\nto our approach, it also addresses the problem of estimating linear binary classi\ufb01ers in a\ndiscrete formulation. However, instead of composing low-dimensional discrete surrogates of\nthe 0/1 loss as we do, it instead uses a fully connected factor graph and performs inference by\nestimating the mean of the max-marginals rather than MAP. Inference is approached using\nmessage-passing, which for the fully connected graph reduces to an intractable knapsack\nproblem.\nIn order to obtain a tractable model, the problem is then relaxed to a linear\nmultiple choice knapsack problem, which can be solved e\ufb03ciently. All the experiments\nthough are performed on very low-dimensional datasets2 and it is unclear how this approach\nwould scale to high dimensionality while keeping a fully connected graph.\n\n6 Analysis\n\nHere we sketch arguments supporting the assumptions driving our formulation. Obtaining a\nrigorous theoretical analysis is left as an open problem for future research. Our assumptions\ninvolve three approximations of the problem of 0/1 loss minimization. First, the discretiza-\ntion of the parameter space. Second, the computation of low-dimensional proxies for the\n0/1 loss rather than attacking the 0/1 loss directly in the resulting discrete space. Finally,\nthe use of a graph G = (V, E) which in general will be sparse, i.e., not fully connected. We\nnow discuss each of these assumptions.\n\n2Seven datasets with dimensionalities 7,9,10,11,14,15 and 61. See [17].\n\n5\n\n\f6.1 Discretization of the parameter space\nThe explicit enforcement of a \ufb01nite number of possible values for each parameter may seem at\n\ufb01rst a strong assumption. However, a key observation here is that we are restricting ourselves\nto linear predictors, which basically means that, for any sample, small perturbations of a\nrandom hyperplane will with high probability induce at most small changes in the 0/1 loss.\nTherefore there are good reasons to believe that indeed, for linear predictors, increasing\nbinning has a diminishing returns behavior and after only a moderate amount of bins no\nmuch improvement can be obtained. This assumption is also used in [17].\n\n6.2 Low-dimensional proxies for the 0/1 loss\nThis assumption can be justi\ufb01ed using recent results stating that the margin is well-preserved\nunder random projections to low-dimensional subspaces [18, 19]. For instance, Theorem 6 in\n[19] shows that the margin is preserved with high probability for embeddings with dimension\nonly logarithmic on the sample size (a result similar in spirit to the Johnson-Lindenstrauss\nLemma [20]). Since the (soft)margin upper bounds the 0/1 loss, this should also be preserved\nwith at least equivalent guarantees.\n\n6.3 Graph sparsity\nThis is apparently the strongest assumption.\nIn our formulation, we impose conditional\nindependence assumptions on the set of random variables used as features. There are two\nmain observations. The \ufb01rst is that in real high-dimensional data the existence of (ap-\nproximate) conditional independences is more of a rule than an exception. This is directly\nrelated to the fact that usually high-dimensional data inhabit low-dimensional manifolds\nor subspaces. In our case, we have a graph with the nodes representing di\ufb00erent features,\nand this can be seen as a patching of low-dimensional subspaces, where each subspace is\nde\ufb01ned by one of the cliques in the graph. We do not address in this work how to optimally\ndetermine a subgraph, leaving that as an open problem in this framework. Rather, we show\nthat even with random subgraphs, and in particular subgraphs as simple as chains, we can\nobtain models that have high accuracy and remarkable robustness to high degrees of label\nnoise. The second observation is that nothing prevents us from using quite dense graphs\nand seeking approximate rather than exact MAP inference, say through LP relaxations [9].\nIndeed we illustrate this possibility in the experimental section below.\n\n7 Experiments\n\nSettings. To evaluate our method (DISCRETE) for binary classi\ufb01cation problems, we\napply it to real-world datasets and compared it to linear Support Vector Machines (SVM),\nwhich are a state-of-the-art estimator for linear classi\ufb01ers. We note that although both use\nlinear predictors, the model classes are not identical: since we use discretization, the set of\nhyperplanes our estimator will optimize over is strictly smaller. We run these algorithms on\npublicly available datasets from the UCI machine learning repository [21]. See Table 1 for\nthe details of these datasets. For both algorithms, the only hyperparameter is the trade-\no\ufb00 between the loss and the regularization term. We run 5-fold cross validation for both\nmethods to select the optimal hyperparameters. The number of bins used for discretization\nmay a\ufb00ect the accuracy of DISCRETE. For the experiments, we \ufb01x it to 11, since for larger\nvalues there was negligible improvement (which supports our argument from section 6.1).\n\nRobustness to Label Noise. In the \ufb01rst experiment, we test the robustness of di\ufb00erent\nmethods to increasing label noise. We \ufb01rst \ufb02ip the labels of the training data with increasing\nprobability from 0 to 0.4 and then run these algorithms on the noisy training data. The plots\nof the classi\ufb01cation accuracy at each noise level are shown in Figure 1. For DISCRETE, we\nused as the graph G a random chain, i.e., the simplest possible option for a connected graph.\nIn this case, optimization is straightforward via a Viterbi algorithm: a sequence of matrix-\nvector multiplications in the (min, +) semiring with trivial bookkeeping and subsequential\nlookup, which will run in O(B2D) since we have B states per variable and D variables. To\nassess the e\ufb00ect of randomization, we run on 20 random chains and plot both the average\nand the standard error obtained. The impact of randomization seems negligible. From\nFigure 1, DISCRETE demonstrates classi\ufb01cation accuracy only slightly inferior to SVM in\n\n6\n\n\f(a) GISETTE\n\n(b) MNIST 5 vs 6\n\n(c) A2A\n\n(d) USPS 8 vs 9\n\n(e) ISOLET\n\n(f) ACOUSTIC\n\nFigure 1: Comparison of the Discrete Method and Linear SVM\n\nthe noiseless regime (i.e., when the hinge loss is a good proxy for the 0/1 loss). However,\nas soon as a signi\ufb01cant amount of label noise is present, SVM degrades substantially while\nDISCRETE remains remarkably stable, delivering high accuracy even after \ufb02ipping labels\nwith 40% probability. We believe these are signi\ufb01cant results given the truly elementary\nnature of the optimization procedure: the method is simple, fast and the runtime can be\npredicted with high accuracy since there is a determined number of operations; 2(D \u2212 1)\nmessages are passed, each with worst-case runtime of O(B2) determined by the matrix-\nvector multiplication. Note in particular how this di\ufb00ers from continuous optimization\nsettings in which the analysis is in terms of rate of convergence rather than the precise\nnumber of discrete operations performed. It is also interesting to observe that for di\ufb00erent\nvalues of the cross-validation parameter our algorithm runs in precisely the same amount of\ntime, while for SVMs convergence will be much slower for small scalings of the regularizer\nsince the relative importance of the non-di\ufb00erentiable hinge loss over the strongly convex\nquadratic term increases. This experiment shows that even if we have the simplest setting\nof our formulation (random chains, which comes with very fast and exact MAP inference)\nwe can still obtain results that are close or similar to those obtained by the state-of-the-art\nlinear SVM classi\ufb01er in the noiseless case, and superior for high levels of label noise.\n\nEvaluation without Noise. As seen in Figure 1, in the noiseless (or small noise) regime\nSVM is often slightly superior to our random chain model. A natural question to ask is\ntherefore how would more complex graph topologies perform. Here we run experiments\non two other types of graphs: a random 2-chain (i.e. a random junction tree with cliques\n{i, i + 1, i + 2}) and a random k-regular graph, where k is set to be such that the resulting\ngraph has 10% of the possible edges. For the 2-chain, the optimization algorithm is exact\ninference via (min, +) message-passing, just as the Viterbi algorithm, but now applied to\na larger clique, which increases the memory and runtime cost by O(B). For the random\ngraph, we obtain a more complex topology in which exact inference is intractable. In our\nexperiments we used the approximate inference algorithm from [22], which solves optimally\nand e\ufb03ciently an LP relaxation via the alternating direction method of multipliers, ADMM\n[23].\n\n7\n\n\fTable 1: Datasets used for the experiments in Figure 1\nGISETTE MNIST A2A USPS ISOLET ACOUSTIC\n\n# Train\n# Test\n\n# Features\n\n6000\n1000\n5000\n\n10205\n1134\n784\n\n2265\n30296\n123\n\n950\n237\n256\n\n480\n120\n617\n\n19705\n78823\n\n50\n\nTable 2: Error rates of di\ufb00erent methods for binary classi\ufb01cation, without label noise. In\nthis setting, the hinge loss used by SVM is an excellent proxy for the 0/1 loss. Yet, the\nproposed variants (top 3 rows) are still competitive in most datasets.\n\nrandom chain\nrandom 2-chain\nrandom graph\nSVM\n\nGISETTE MNIST A2A USPS ISOLET ACOUSTIC\n89.23\n89\n88.6\n97.7\n\n93.79\n94.47\n94.89\n96.47\n\n76.01\n76.55\n74.80\n76.01\n\n82.55\n82.65\n83.17\n83.88\n\n97.51\n97.78\n97.44\n98.4\n\n100\n100\n100\n100\n\n8 Extensions and Open Problems\n\nClearly the results in this paper are only a \ufb01rst step in the direction proposed. Several\nquestions arise from this formulation.\n\nTheory. In section 6 we only sketched the reasons why we pursued the assumptions laid\nout in this paper. We did not present any rigorous quantitative arguments analyzing the\nlimitations of our formulation. This is left as an open problem. However we believe section\n6 does point to the key ideas that will ultimately underly a quantitative theory.\n\nExtension to multi-class and structured prediction.\nIn this work we only study\nbinary classi\ufb01cation problems. The extension to multi-class and structured prediction, as\nwell as other learning settings is an open problem.\n\nAdaptive binning. When discretizing the parameters, we used a \ufb01xed number of bins.\nThis can be made more elaborate through the use of adaptive binning techniques that are\ndependent on the information content of each variable.\n\nInformative graph construction. We only explored randomly generated graphs. The\nproblem of selecting a graph topology in an informative way is highly relevant and is left\nopen. For example B-matching can be used to generate an informative regular graph [24].\nThis problem is essentially a manifold learning problem and there are several ways it could\nbe approached. Existing work on supervised manifold learning is very relevant here.\n\nNonparametric extension. We considered only linear parametric models. It would be\ninteresting to consider nonparametric models, where the discretization occurs at the level of\nparameters associated with each training instance (as in the dual formulation of SVMs).\n\n9 Conclusion\n\nWe presented a discrete formulation for learning linear binary classi\ufb01ers. Parameters associ-\nated with features of the linear model are discretized into bins, and low-dimensional discrete\nsurrogates of the 0/1 loss restricted to small groups of features are constructed. This results\nin a data structure that can be seen as a graphical model, where regularized risk minimiza-\ntion can be performed via MAP inference. We sketch theoretical arguments supporting the\nassumptions underlying our proposal and present empirical evidence that very simple, easily\nand quickly trainable models estimated with such a procedure can deliver results that are\noften comparable to those obtained by linear SVMs for noiseless scenarios, and superior\nunder moderate to severe label noise.\n\nAcknowledgements\nWe thank E. Bonilla, A. Defazio, D. Garc\u00b4\u0131a-Garc\u00b4\u0131a, S. Gould, J. McAuley, S. Nowozin,\nM. Reid, S. Sanner and B. Williamson for discussions. NICTA is funded by the Australian\nGovernment as represented by the Department of Broadband, Communications and the\nDigital Economy and the Australian Research Council through the ICT Centre of Excellence\nprogram.\n\n8\n\n\fReferences\n\n[1] P. M. Long and R. A. Servedio, \u201cRandom classi\ufb01cation noise defeats all convex potential\n\nboosters,\u201d Machine Learning, vol. 78, no. 3, pp. 287\u2013304, 2010.\n\n[2] P. M. Long and R. A. Servedio, \u201cLearning large-margin halfspaces with more malicious noise,\u201d\n\nin NIPS, 2011.\n\n[3] S. M. Aji and R. J. McEliece, \u201cThe generalized distributive law,\u201d IEEE Trans. Inform. Theory,\n\nvol. 46, no. 2, pp. 325\u2013343, 2000.\n\n[4] B. Korte and J. Vygen, Combinatorial Optimization: Theory and Algorithms. Springer Pub-\n\nlishing Company, Incorporated, 4th ed., 2007.\n\n[5] V. Kolmogorov and R. Zabih, \u201cWhat energy functions can be minimizedvia graph cuts?,\u201d\nIEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, pp. 147\u2013159, 2004.\n\n[6] A. Globerson and T. S. Jaakkola, \u201cApproximate inference using planar graph decomposition,\u201d\nin Advances in Neural Information Processing Systems 19 (B. Sch\u00a8olkopf, J. Platt, and T. Ho\ufb00-\nman, eds.), pp. 473\u2013480, Cambridge, MA: MIT Press, 2007.\n\n[7] T. Jebara, \u201cPerfect graphs and graphical modeling.\u201d To Appear in Tractability, Cambridge\n\nUniversity Press, 2012.\n\n[8] V. V. Vazirani, Approximation Algorithms. Springer, 2004.\n\n[9] D. Sontag, Approximate Inference in Graphical Models using LP Relaxations. PhD thesis,\nMassachusetts Institute of Technology, Department of Electrical Engineering and Computer\nScience, 2010.\n\n[10] P. Zhao and B. Yu, \u201cOn model selection consistency of lasso,\u201d J. Mach. Learn. Res., vol. 7,\n\npp. 2541\u20132563, Dec. 2006.\n\n[11] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, \u201cStructured sparsity through convex\n\noptimization.\u201d Technical report, HAL 00621245-v2, to appear in Statistical Science, 2012.\n\n[12] J. Huang, T. Zhang, and D. Metaxas, \u201cLearning with structured sparsity,\u201d in Proceedings of\nthe 26th Annual International Conference on Machine Learning, ICML \u201909, (New York, NY,\nUSA), pp. 417\u2013424, ACM, 2009.\n\n[13] F. R. Bach, \u201cStructured sparsity-inducing norms through submodular functions,\u201d in NIPS,\n\npp. 118\u2013126, 2010.\n\n[14] D. Mcallester and J. Keshet, \u201cGeneralization bounds and consistency for latent structural\nprobit and ramp loss,\u201d in Advances in Neural Information Processing Systems 24 (J. Shawe-\nTaylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds.), pp. 2205\u20132212, 2011.\n\n[15] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. MIT\n\nPress, 2009.\n\n[16] M. J. Wainwright and M. I. Jordan, Graphical Models, Exponential Families, and Variational\n\nInference. Hanover, MA, USA: Now Publishers Inc., 2008.\n\n[17] B. Potetz, \u201cEstimating the bayes point using linear knapsack problems,\u201d in ICML, pp. 257\u2013264,\n\n2011.\n\n[18] M.-F. Balcan, A. Blum, and S. Vempala, \u201cKernels as features: On kernels, margins, and\n\nlow-dimensional mappings,\u201d Machine Learning, vol. 65, no. 1, pp. 79\u201394, 2006.\n\n[19] Q. Shi, C. Chen, R. Hill, and A. van den Hengel, \u201cIs margin preserved after random projec-\n\ntion?,\u201d in ICML, 2012.\n\n[20] S. Dasgupta and A. Gupta, \u201cAn elementary proof of a theorem of johnson and lindenstrauss,\u201d\n\nRandom Struct. Algorithms, vol. 22, pp. 60\u201365, Jan. 2003.\n\n[21] A. Frank and A. Asuncion, \u201cUCI machine learning repository,\u201d 2010.\n\n[22] O. Meshi and A. Globerson, \u201cAn alternating direction method for dual map lp relaxation,\u201d\nin Proceedings of the 2011 European conference on Machine learning and knowledge discovery\nin databases - Volume Part II, ECML PKDD\u201911, (Berlin, Heidelberg), pp. 470\u2013483, Springer-\nVerlag, 2011.\n\n[23] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, \u201cDistributed optimization and sta-\ntistical learning via the alternating direction method of multipliers,\u201d Foundations and Trends\nin Machine Learning, vol. 3, 2011.\n\n[24] T. Jebara, J. Wang, and S. Chang, \u201cGraph construction and b-matching for semi-supervised\n\nlearning,\u201d in ICML, 2009.\n\n9\n\n\f", "award": [], "sourceid": 970, "authors": [{"given_name": "Xianghang", "family_name": "Liu", "institution": null}, {"given_name": "James", "family_name": "Petterson", "institution": null}, {"given_name": "Tib\u00e9rio", "family_name": "Caetano", "institution": null}]}