{"title": "Sparse Logistic Regression Learns All Discrete Pairwise Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 8071, "page_last": 8081, "abstract": "We characterize the effectiveness of a classical algorithm for recovering the Markov graph of a general discrete pairwise graphical model from i.i.d. samples. The algorithm is (appropriately regularized) maximum conditional log-likelihood, which involves solving a convex program for each node; for Ising models this is $\\ell_1$-constrained logistic regression, while for more general alphabets an $\\ell_{2,1}$ group-norm constraint needs to be used. We show that this algorithm can recover any arbitrary discrete pairwise graphical model, and also characterize its sample complexity as a function of model width, alphabet size, edge parameter accuracy, and the number of variables. We show that along every one of these axes, it matches or improves on all existing results and algorithms for this problem. Our analysis applies a sharp generalization error bound for logistic regression when the weight vector has an $\\ell_1$ (or $\\ell_{2,1}$) constraint and the sample vector has an $\\ell_{\\infty}$ (or $\\ell_{2, \\infty}$) constraint. We also show that the proposed convex programs can be efficiently solved in $\\tilde{O}(n^2)$ running time (where $n$ is the number of variables) under the same statistical guarantees. We provide experimental results to support our analysis.", "full_text": "Sparse Logistic Regression Learns\n\nAll Discrete Pairwise Graphical Models\n\nShanshan Wu, Sujay Sanghavi, Alexandros G. Dimakis\n\nDepartment of Electrical and Computer Engineering\n\nshanshan@utexas.edu, sanghavi@mail.utexas.edu, dimakis@austin.utexas.edu\n\nUniversity of Texas at Austin\n\nAbstract\n\nWe characterize the effectiveness of a classical algorithm for recovering the Markov\ngraph of a general discrete pairwise graphical model from i.i.d. samples. The algo-\nrithm is (appropriately regularized) maximum conditional log-likelihood, which\ninvolves solving a convex program for each node; for Ising models this is (cid:96)1-\nconstrained logistic regression, while for more general alphabets an (cid:96)2,1 group-\nnorm constraint needs to be used. We show that this algorithm can recover any\narbitrary discrete pairwise graphical model, and also characterize its sample com-\nplexity as a function of model width, alphabet size, edge parameter accuracy, and\nthe number of variables. We show that along every one of these axes, it matches\nor improves on all existing results and algorithms for this problem. Our analysis\napplies a sharp generalization error bound for logistic regression when the weight\nvector has an (cid:96)1 (or (cid:96)2,1) constraint and the sample vector has an (cid:96)\u221e (or (cid:96)2,\u221e)\nconstraint. We also show that the proposed convex programs can be ef\ufb01ciently\nsolved in \u02dcO(n2) running time (where n is the number of variables) under the same\nstatistical guarantees. We provide experimental results to support our analysis.\n\n1\n\nIntroduction\n\nUndirected graphical models provide a framework for modeling high dimensional distributions with\ndependent variables and have many applications including in computer vision (Choi et al., 2010), bio-\ninformatics (Marbach et al., 2012), and sociology (Eagle et al., 2009). In this paper we characterize\nthe effectiveness of a natural, and already popular, algorithm for the structure learning problem.\nStructure learning is the task of \ufb01nding the dependency graph of a Markov random \ufb01eld (MRF)\ngiven i.i.d. samples; typically one is also interested in \ufb01nding estimates for the edge weights as\nwell. We consider the structure learning problem in general (non-binary) discrete pairwise graphical\nmodels. These are MRFs where the variables take values in a discrete alphabet, but all interactions are\npairwise. This includes the Ising model as a special case (which corresponds to a binary alphabet).\nThe natural and popular algorithm we consider is (appropriately regularized) maximum conditional\nlog-likelihood for \ufb01nding the neighborhood set of any given node. For the Ising model, this becomes\n(cid:96)1-constrained logistic regression; more generally for non-binary graphical models the regularizer\nbecomes an (cid:96)2,1 norm. We show that this algorithm can recover all discrete pairwise graphical\nmodels, and characterize its sample complexity as a function of the parameters of interest: model\nwidth, alphabet size, edge parameter accuracy, and the number of variables. We match or improve\ndependence on each of these parameters, over all existing results for the general alphabet case when\nno additional assumptions are made on the model (see Table 1). For the speci\ufb01c case of Ising models,\nsome recent work has better dependence on some parameters (see Table 2 in Appendix A).\nWe now describe the related work, and then outline our contributions.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSample complexity (N)\n\nO(exp( kO(d) exp(O(d2\u03bb))\n\n\u03b7O(1)\n\n) ln( nk\n\n\u03c1 ))\n\nAssumptions\n1. Alphabet size k \u2265 2\n2. Model width \u2264 \u03bb\n3. Degree \u2264 d\n4. Minimum edge weight \u2265 \u03b7 > 0\n5. Probability of success \u2265 1 \u2212 \u03c1\n1. Alphabet size k \u2265 2\n2. Model width \u2264 \u03bb\n3. Minimum edge weight \u2265 \u03b7 > 0\n4. Probability of success \u2265 1 \u2212 \u03c1\n\n\u03b74\n\n\u03b74\n\nPaper\n\nGreedy\nalgorithm (Hamilton\net al., 2017)\n\nO( \u03bb2k5 exp(14\u03bb)\n\nSparsitron (Klivans\nand Meka, 2017)\n(cid:96)2,1-constrained\nlogistic regression\n(this paper)\nTable 1: Sample complexity comparison for different graph recovery algorithms. The pairwise\ngraphical model has alphabet size k. For k = 2 (i.e., Ising models), our algorithm reduces to the\n(cid:96)1-constrained logistic regression (see Table 2 in Appendix A for related work on learning Ising\nmodels). Our sample complexity has a better dependency on the alphabet size ( \u02dcO(k4) versus \u02dcO(k5))\nthan that in (Klivans and Meka, 2017)2.\n\nO( \u03bb2k4 exp(14\u03bb)\n\nln( nk\n\n\u03c1\u03b7 ))\n\nln( nk\n\n\u03c1 ))\n\nRelated Work\n\nIn a classic paper, Ravikumar et al. (2010) considered the structure learning problem for Ising models.\nThey showed that (cid:96)1-regularized logistic regression provably recovers the correct dependency graph\nwith a very small number of samples by solving a convex program for each variable. This algorithm\nwas later generalized to multi-class logistic regression with (cid:96)2,1 group sparse regularization, for\nlearning MRFs with higher-order interactions and non-binary variables (Jalali et al., 2011). A well-\nknown limitation of (Ravikumar et al., 2010; Jalali et al., 2011) is that their theoretical guarantees\nonly work for a restricted class of models. Speci\ufb01cally, they require that the underlying learned model\nsatis\ufb01es technical incoherence assumptions, that are dif\ufb01cult to validate or check.\nA large amount of recent work has since proposed various algorithms to obtain provable learning\nresults for general graphical models without requiring the incoherence assumptions. We now describe\nthe (most related part of the extensive) related work, followed by our results and comparisons (see\nTable 1). For a discrete pairwise graphical model, let n be the number of variables and k be the\nalphabet size; de\ufb01ne the model width \u03bb as the maximum neighborhood weight (see De\ufb01nition 1\nand 2 for the precise de\ufb01nition). For structure learning algorithms, a popular approach is to focus\non the sub-problem of \ufb01nding the neighborhood of a single node. Once this is correctly learned, the\noverall graph structure is a simple union bound. Indeed all the papers we now discuss are of this\ntype. As shown in Table 1, Hamilton et al. (2017) proposed a greedy algorithm to learn pairwise (and\nhigher-order) MRFs with general alphabet. Their algorithm generalizes the approach of Bresler (2015)\nto learning Ising models. The sample complexity in (Hamilton et al., 2017) grows logarithmically\nin n, but doubly exponentially in the width \u03bb (only single exponential is necessary (Santhanam\nand Wainwright, 2012)). Klivans and Meka (2017) provided a different algorithmic and theoretical\napproach by setting this up as an online learning problem and leveraging results from the Hedge\nalgorithm therein. Their algorithm Sparsitron achieves single-exponential dependence on \u03bb.\nOur Contributions\n\u2022 Our main result: We show that a classical algorithm (cid:96)2,1-constrained3 logistic regression can\nrecover the edge weights of a discrete pairwise graphical model from i.i.d. samples (see Theorem 2).\nFor the special case of Ising models (see Theorem 1), this reduces to an (cid:96)1-constrained logistic\nregression. For the general setting with non-binary alphabet, since each edge has a group of\nparameters, it is natural to use an (cid:96)2,1 group sparse constraint to enforce sparsity at the level of\n\n2Theorem 8.4 in (Klivans and Meka, 2017) has a typo. The correct dependence should be k5 instead of k3.\nIn Section 8 of (Klivans and Meka, 2017), after re-writing the conditional distribution as a sigmoid function, the\nweight vector w is a vector of length (n \u2212 1)k + 1. Their derivation uses an incorrect bound (cid:107)w(cid:107)1 \u2264 2\u03bb, while\nit should be (cid:107)w(cid:107)1 \u2264 2k\u03bb. This gives rise to an additional k2 factor on the \ufb01nal sample complexity.\n\n3It may be possible to prove a similar result for the regularized version of the optimization problem using\ntechniques from (Negahban et al., 2012). One needs to prove that the objective function satis\ufb01es restricted strong\nconvexity (RSC) when the samples are from a graphical model distribution (Vuffray et al., 2016; Lokhov et al.,\n2018). It would be interesting to see if the proof presented in this paper is related to the RSC condition.\n\n2\n\n\fgroups. We make no incoherence assumption on the graphical models. As shown in Table 1,\nour sample complexity scales as \u02dcO(k4), which improves4 the previous best result with \u02dcO(k5)\ndependency5. Our analysis applies a sharp generalization error bound for logistic regression when\nthe weight vector has an (cid:96)2,1 (or (cid:96)1) constraint and the sample vector has an (cid:96)2,\u221e (or (cid:96)\u221e) constraint\n(see Lemma 8 and 11 in Appendix E). Our key insight is that a generalization bound can be used to\ncontrol the squared distance between the predicted and true logistic functions (see Lemma 1 and 2\nin Appendix B), which then implies an (cid:96)\u221e norm bound between the weight vectors (see Lemma 5\nand 6 in Appendix B).\n\u2022 We show that the proposed algorithms can run in \u02dcO(n2) time without affecting the statistical\nguarantees (see Section 2.3). Note that \u02dcO(n2) is an ef\ufb01cient runtime for graph recovery over n\nnodes. Previous algorithms in (Hamilton et al., 2017; Klivans and Meka, 2017) also require \u02dcO(n2)\nruntime for structure learning of pairwise graphical models.\n\u2022 We construct examples that violate the incoherence condition proposed in (Ravikumar et al., 2010)\n(see Figure 1). We then run (cid:96)1-constrained logistic regression and show that it can recover the\ngraph structure as long as given enough samples. This veri\ufb01es our analysis and shows that our\nconditions for graph recovery are weaker than those in (Ravikumar et al., 2010).\n\u2022 We empirically compare the proposed algorithm with the Sparsitron algorithm in (Klivans and\nMeka, 2017) over different alphabet sizes, and show that our algorithm needs fewer samples for\ngraph recovery (see Figure 2).\n\ndenote its i-th coordinate. The (cid:96)p norm of a vector is de\ufb01ned as (cid:107)x(cid:107)p = ((cid:80)\ndot product between two vectors (cid:104)x, y(cid:105) =(cid:80)\ni xiyi or two matrices (cid:104)A, B(cid:105) =(cid:80)\n\nNotation. We use [n] to denote the set {1, 2,\u00b7\u00b7\u00b7 , n}. For a vector x \u2208 Rn, we use xi or x(i) to\ni |xi|p)1/p. We use\nx\u2212i \u2208 Rn\u22121 to denote the vector after deleting the i-th coordinate. For a matrix A \u2208 Rn\u00d7k, we use\nAij or A(i, j) to denote its (i, j)-th entry. We use A(i, :) \u2208 Rk and A(:, j) \u2208 Rn to the denote the i-th\nrow vector and the j-th column vector. The (cid:96)p,q norm of a matrix A \u2208 Rn\u00d7k is de\ufb01ned as (cid:107)A(cid:107)p,q =\n(cid:107)[(cid:107)A(1, :)(cid:107)p, ...,(cid:107)A(n, :)(cid:107)p](cid:107)q. We de\ufb01ne (cid:107)A(cid:107)\u221e = maxij |A(i, j)|. We use (cid:104)\u00b7,\u00b7(cid:105) to represent the\nij A(i, j)B(i, j).\n\n2 Main results\n\nWe start with the special case of binary variables (i.e., Ising models), and then move to the general\ncase with non-binary variables.\n\n2.1 Learning Ising models\n\nWe \ufb01rst give a de\ufb01nition of an Ising model distribution.\nDe\ufb01nition 1. Let A \u2208 Rn\u00d7n be a symmetric weight matrix with Aii = 0 for i \u2208 [n]. Let \u03b8 \u2208 Rn be\na mean-\ufb01eld vector. The n-variable Ising model is a distribution D(A, \u03b8) on {\u22121, 1}n that satis\ufb01es\n\nP\n\nZ\u223cD(A,\u03b8)\n\n[Z = z] \u221d exp\n\nAijzizj +\n\n\u03b8izi\n\nThe dependency graph of D(A, \u03b8) is an undirected graph G = (V, E), with vertices V = [n] and\nedges E = {(i, j) : Aij (cid:54)= 0}. De\ufb01ne the width of D(A, \u03b8) as\n\n(1)\n\n(2)\n\n1\u2264i 0.\n\n\u2200m \u2208 [N ], xm \u2190 [zm\u2212i, 1], ym \u2190 zm\n\u02c6w \u2190 arg minw\u2208Rn\ns.t. (cid:107)w(cid:107)1 \u2264 2\u03bb\n\u2200j \u2208 [n], \u02c6Aij \u2190 \u02c6w\u02dcj/2, where \u02dcj = j if j < i and \u02dcj = j \u2212 1 if j > i\n\n(cid:80)N\nm=1 ln(1 + e\u2212ym(cid:104)w,xm(cid:105))\n\n4\n5 end\n6 Form an undirected graph \u02c6G on n nodes with edges {(i, j) : | \u02c6Aij| \u2265 \u03b7/2, i < j}.\nTheorem 1. Let D(A, \u03b8) be an unknown n-variable Ising model distribution with dependency graph\nG. Suppose that the D(A, \u03b8) has width \u03bb(A, \u03b8) \u2264 \u03bb. Given \u03c1 \u2208 (0, 1) and \u0001 > 0, if the number\nof i.i.d. samples satis\ufb01es N = O(\u03bb2 exp(12\u03bb) ln(n/\u03c1)/\u00014), then with probability at least 1 \u2212 \u03c1,\nAlgorithm 1 produces \u02c6A that satis\ufb01es\n\n1\nN\n\ni\n\n|Aij \u2212 \u02c6Aij| \u2264 \u0001.\n\nmax\ni,j\u2208[n]\n\n(6)\n\nCorollary 1. In the setup of Theorem 1, suppose that the Ising model distribution D(A, \u03b8) has\nminimum edge weight \u03b7(A, \u03b8) \u2265 \u03b7 > 0. If we set \u0001 < \u03b7/2 in (6), which corresponds to sample\ncomplexity N = O(\u03bb2 exp(12\u03bb) ln(n/\u03c1)/\u03b74), then with probability at least 1 \u2212 \u03c1, Algorithm 1\nrecovers the dependency graph, i.e., \u02c6G = G.\n\n4\n\n\f2.2 Learning pairwise graphical models over general alphabet\nDe\ufb01nition 2. Let k be the alphabet size. Let W = {Wij \u2208 Rk\u00d7k : i (cid:54)= j \u2208 [n]} be a set of weight\nmatrices satisfying Wij = W T\nji. Without loss of generality, we assume that every row (and column)\nvector of Wij has zero mean. Let \u0398 = {\u03b8i \u2208 Rk : i \u2208 [n]} be a set of external \ufb01eld vectors. Then the\nn-variable pairwise graphical model D(W, \u0398) is a distribution over [k]n where\n\nP\n\nZ\u223cD(W,\u0398)\n\n[Z = z] \u221d exp\n\nWij(zi, zj) +\n\n\u03b8i(zi)\n\n(7)\n\nThe dependency graph of D(W, \u0398) is an undirected graph G = (V, E), with vertices V = [n] and\nedges E = {(i, j) : Wij (cid:54)= 0}. The width of D(W, \u0398) is de\ufb01ned as\n\n\uf8f6\uf8f8 .\n\n(cid:88)\n\ni\u2208[n]\n\n\uf8f6\uf8f8 .\n\n1\u2264i 0.\n\nAlgorithm 2: Learning a pairwise graphical model via (cid:96)2,1-constrained logistic regression\nInput: alphabet size k; N i.i.d. samples {z1,\u00b7\u00b7\u00b7 , zN}, where zm \u2208 [k]n for m \u2208 [N ]; an upper\nOutput: \u02c6Wij \u2208 Rk\u00d7k for all i (cid:54)= j \u2208 [n]; an undirected graph \u02c6G on n nodes.\n1 for i \u2190 1 to n do\n2\n3\n4\n\nfor each pair \u03b1 (cid:54)= \u03b2 \u2208 [k] do\ni \u2208 {\u03b1, \u03b2}}\nS \u2190 {zm, m \u2208 [N ] : zm\n(cid:80)|S|\ni = \u03b1; yt \u2190 \u22121 if zt\n\u2200zt \u2208 S, xt \u2190 OneHotEncode([zt\u2212i, 1]), yt \u2190 1 if zt\n\u221a\ni = \u03b2\nt=1 ln(1 + e\u2212yt(cid:104)w,xt(cid:105))\nw\u03b1,\u03b2 \u2190 arg minw\u2208Rn\u00d7k\ns.t. (cid:107)w(cid:107)2,1 \u2264 2\u03bb\nDe\ufb01ne U \u03b1,\u03b2 \u2208 Rn\u00d7k by centering the \ufb01rst n \u2212 1 rows of w\u03b1,\u03b2 (see (12)).\n\nend\nfor j \u2208 [n]\\i and \u03b1 \u2208 [k] do\n\n6\n7\n8\n9\n10\n11 end\n12 Form graph \u02c6G on n nodes with edges {(i, j) : maxa,b | \u02c6Wij(a, b)| \u2265 \u03b7/2, i < j}.\n\n(cid:80)\n\u03b2\u2208[k] U \u03b1,\u03b2(\u02dcj, :), where \u02dcj = j if j < i and \u02dcj = j \u2212 1 if j > i.\n\n\u02c6Wij(\u03b1, :) \u2190 1\n\nend\n\n1|S|\n\nk\n\n5\n\nk\n\nTheorem 2. Let D(W, \u0398) be an n-variable pairwise graphical model distribution with width\n\u03bb(W, \u0398) \u2264 \u03bb. Given \u03c1 \u2208 (0, 1) and \u0001 > 0, if the number of i.i.d. samples satis\ufb01es N =\nO(\u03bb2k4 exp(14\u03bb) ln(nk/\u03c1)/\u00014), then with probability at least 1 \u2212 \u03c1, Algorithm 2 produces \u02c6Wij \u2208\nRk\u00d7k that satis\ufb01es\n\n|Wij(a, b) \u2212 \u02c6Wij(a, b)| \u2264 \u0001,\n\n\u2200i (cid:54)= j \u2208 [n], \u2200a, b \u2208 [k].\n\n(15)\n\nCorollary 2. In the setup of Theorem 2, suppose that the pairwise graphical model distribution\nD(W, \u0398) satis\ufb01es \u03b7(W, \u0398) \u2265 \u03b7 > 0. If we set \u0001 < \u03b7/2 in (15), which corresponds to sample\ncomplexity N = O(\u03bb2k4 exp(14\u03bb) ln(nk/\u03c1)/\u03b74), then with probability at least 1 \u2212 \u03c1, Algorithm 2\nrecovers the dependency graph, i.e., \u02c6G = G.\n\n6\n\n\f\u221a\n\n\u221a\n\nRemark ((cid:96)2,1 versus (cid:96)1 constraint). The w\u2217 \u2208 Rn\u00d7k matrix de\ufb01ned in (10) satis\ufb01es (cid:107)w\u2217(cid:107)2,1 \u2264\nk and (cid:107)w\u2217(cid:107)1 \u2264 2\u03bbk. Instead of solving the (cid:96)2,1-constrained logistic regression de\ufb01ned in\n\u221a\n2\u03bb\n(11), we could solve an (cid:96)1-constrained logistic regression with (cid:107)w(cid:107)1 \u2264 2\u03bbk. This additional\nk\nk) will lead to a worse sample complexity \u02dcO(k5) .\ndependence in the constraint (i.e., 2\u03bbk versus 2\u03bb\nRemark (dependence on the alphabet size). A simple lower bound of the sample complexity is\n\u2126(k2). To see why, consider a graph with two nodes (i.e., n = 2). Let W be a k-by-k weight matrix\nbetween the two nodes, de\ufb01ned as follows: W (1, 1) = W (2, 2) = 1, W (1, 2) = W (2, 1) = \u22121,\nand W (i, j) = 0 otherwise. This de\ufb01nition satis\ufb01es the condition that every row and column is\ncentered (De\ufb01nition 2). Besides, we have \u03bb = 1 and \u03b7 = 1, which means that the two quantities\ndo not scale in k. To distinguish W from the zero matrix, we need to observe samples in the set\n{(1, 1), (2, 2), (1, 2), (2, 1)}. This requires \u2126(k2) samples because any speci\ufb01c sample (a, b) (where\na \u2208 [k] and b \u2208 [k]) has a probability of approximately 1/k2 to show up.\n\n2.3 Learning pairwise graphical models in \u02dcO(n2) time\n\nOur results so far assume that the (cid:96)1-constrained logistic regression (in Algorithm 1) and the (cid:96)2,1-\nconstrained logistic regression (in Algorithm 2) is exactly solved. This would require \u02dcO(n4) com-\nplexity if an interior-point based method is used (Koh et al., 2007). Our key result in this section is\nTheorem 3, which says that the statistical guarantees in Theorem 1 and 2 still hold if the constrained\nlogistic regression is only approximately solved.\nTheorem 3 (Informal). Suppose that the constrained logistic regression in Algorithm 1 and 2 is\noptimized by the mirror descent method given in Appendix J. Given \u03c1 \u2208 (0, 1) and \u0001 > 0, if the\nnumber of mirror descent iterations satis\ufb01es T = O(\u03bb2k3 exp(O(\u03bb)) ln(n)/\u00014), then (6) and (15)\nstill hold with probability at least 1 \u2212 \u03c1. The time and space complexity of Algorithm 1 is O(T N n2)\nand O(T N +n2). The time and space complexity of Algorithm 2 is O(T N n2k2) and O(T N +n2k2).\n\nProof of Theorem 3 requires bounding (cid:107) \u00afw \u2212 w\u2217(cid:107)\u221e (where \u00afw is the value after T mirror descent\niterations). This is non-trivial because we are in the high-dimensional regime as the number of\nsamples N = O(ln(n)), the empirical loss functions in (11) and (4) are not strongly convex. Due to\nthe space limit, more details of this section can be found in Appendix I, J and K.\nNote that \u02dcO(n2) is an ef\ufb01cient time complexity for graph recovery over n nodes. Previous structural\nlearning algorithms of Ising models require either \u02dcO(n2) complexity (Bresler, 2015; Klivans and\nMeka, 2017) or a worse time complexity (Ravikumar et al., 2010; Vuffray et al., 2016).\nIt is possible to improve the time complexity given in Theorem 3 (especially the dependence on\n\u0001 and \u03bb), by using stochastic or accelerated versions of mirror descent algorithms (instead of the\nbatch version given in Appendix J). In fact, the Sparsitron algorithm proposed by Klivans and Meka\n(2017) can be seen as an online mirror descent algorithm for optimizing the (cid:96)1-constrained logistic\nregression (see Algorithm 3 in Appendix J). Furthermore, Algorithm 1 and 2 can be parallelized as\nevery node has an independent regression problem.\nWe would like to remark that our goal here is not to give the fastest \ufb01rst-order optimization algorithm.\nInstead, our goal is to provably show that it is possible to run Algorithm 1 and Algorithm 2 in \u02dcO(n2)\ntime without affecting the original statistical guarantees.\n\n3 Proof outline\n\nL(w) = E(x,y)\u223cD ln(1 + e\u2212y(cid:104)w,x(cid:105)) and \u02c6L(w) =(cid:80)N\n\nWe give a proof outline for Theorem 1. The proof of Theorem 2 follows a similar outline. Let D be\na distribution over {\u22121, 1}n \u00d7 {\u22121, 1}, where (x, y) \u223c D satis\ufb01es P[y = 1|x] = \u03c3((cid:104)w\u2217, x(cid:105)). Let\ni=1 ln(1 + e\u2212yi(cid:104)w,xi(cid:105))/N be the expected and\nempirical logistic loss. Suppose (cid:107)w\u2217(cid:107)1 \u2264 2\u03bb. Let \u02c6w \u2208 arg minw\n\u02c6L(w) s.t. (cid:107)w(cid:107)1 \u2264 2\u03bb. Our goal is\nto prove that (cid:107) \u02c6w \u2212 w\u2217(cid:107)\u221e is small when the samples are constructed from an Ising model distribution.\nOur proof can be summarized in three steps:\n\n7\n\n\f1. If the number of samples satis\ufb01es N = O(\u03bb2 ln(n/\u03c1)/\u03b32), then L( \u02c6w) \u2212 L(w\u2217) \u2264 O(\u03b3).\nThis is obtained using a sharp generalization bound when (cid:107)w(cid:107)1 \u2264 2\u03bb and (cid:107)x(cid:107)\u221e \u2264 1 (see\nLemma 8 in Appendix E).\n\n2. For any w, we show that L(w)\u2212L(w\u2217) \u2265 Ex[\u03c3((cid:104)w, x(cid:105))\u2212\u03c3((cid:104)w\u2217, x(cid:105))]2 (see Lemma 10 and\nLemma 9 in Appendix E). Hence, Step 1 implies that Ex[\u03c3((cid:104) \u02c6w, x(cid:105)) \u2212 \u03c3((cid:104)w\u2217, x(cid:105))]2 \u2264 O(\u03b3)\n(see Lemma 1 in Appendix B).\n\n3. We now use a result from (Klivans and Meka, 2017) (see Lemma 5 in Appendix B),\nwhich says that if the samples are from an Ising model and if \u03b3 = O(\u00012 exp(\u22126\u03bb)), then\nEx[\u03c3((cid:104) \u02c6w, x(cid:105)) \u2212 \u03c3((cid:104)w\u2217, x(cid:105))]2 \u2264 O(\u03b3) implies that (cid:107) \u02c6w \u2212 w\u2217(cid:107)\u221e \u2264 \u0001. The required number\nof samples is N = O(\u03bb2 ln(n/\u03c1)/\u03b32) = O(\u03bb2 exp(12\u03bb) ln(n/\u03c1)/\u00014).\n\nFor the general setting with non-binary alphabet (i.e., Theorem 2), the proof is similar to that\n\u221a\nof Theorem 1. The main difference is that we need to use a sharp generalization bound when\nk and (cid:107)x(cid:107)2,\u221e \u2264 1 (see Lemma 11 in Appendix E). This would give us Lemma 2 in\n(cid:107)w(cid:107)2,1 \u2264 2\u03bb\nAppendix B which bounds the squared distance between the two sigmoid functions. The last step is\nto use Lemma 6 to bound the in\ufb01nity norm between the two weight matrices.\n\n4 Experiments\n\nIn both of the simulations below, the external \ufb01eld is set to be zero. Sampling is done via exactly\ncomputing the distribution. We implement the algorithm in Matlab. All experiments are done using a\npersonal desktop. Source code can be found at https://github.com/wushanshan/GraphLearn.\nLearning Ising models. In Figure 1 we construct a diamond-shape graph and plot the incoherence\nvalue at Node 1. This value becomes bigger than 1 (and hence violates the incoherence condition\nin (Ravikumar et al., 2010)) when we increase the graph size n and edge weight a. We then run 100\ntimes of Algorithm 1 and plot the fraction of runs that exactly recovers the underlying graph structure.\nIn each run we generate a different set of samples. The result shown in Figure 1 is consistent with our\nanalysis and also indicates that our conditions for graph recovery are weaker than those in (Ravikumar\net al., 2010).\n\nFigure 1: Left: The graph structure used in this simulation. It has n nodes and 2(n \u2212 2) edges. Every\nedge has the same weight a > 0. Middle: Incoherence value at Node 1. The incoherence condition\nrequired by (Ravikumar et al., 2010) is violated for n \u2265 10 and a \u2265 0.2. Right: We simulate 100\nruns of Algorithm 1 for edge weight a = 0.2 across different n values.\n\nLearning general pairwise graphical models. We compare our algorithm (Algorithm 2) with the\nSparsitron algorithm in (Klivans and Meka, 2017) on a two-dimensional 3-by-3 grid (shown in\nFigure 2). We test two alphabet sizes in the experiments: k = 4, 6. For each value of k, we simulate\nboth algorithms 100 runs, and in each run we generate random Wij matrices with entries \u00b10.2.\nTo ensure that each row (as well as each column) of Wij is centered (De\ufb01nition 2), we randomly\nchoose Wij between two options: for example, if k = 2, then Wij = [0.2,\u22120.2;\u22120.2, 0.2] or\nWij = [\u22120.2, 0.2; 0.2,\u22120.2]. As shown in the Figure 2, our algorithm requires fewer samples for\nsuccessfully recovering the graphs. More details about this experiment can be found in Appendix L.\n\n8\n\n\u2026123n-2n-1naaaaaaaa68101214Number of nodes (n)0.511.52Incoherence at Node 1a=0.15a=0.20a=0.251000200030004000Number of samples (N)00.51Prob succ in 100 runsn=6n=8n=10n=12n=14\fFigure 2: Left: A two-dimensional 3-by-3 grid graph used in the simulation. Middle and right: Our\nalgorithm needs fewer samples than the Sparsitron algorithm (Klivans and Meka, 2017) for graph\nrecovery.\n\n5 Conclusion\n\nThe main contribution of this paper is to show that an existing and popular algorithm (i.e., group-sparse\nregularized logistic regression) actually gives the state-of-the-art performance (in a setting where\nalternative algorithms are being proposed). Speci\ufb01cally, we have shown that the (cid:96)2,1-constrained\nlogistic regression can recover the Markov graph of any discrete pairwise graphical model from i.i.d.\nsamples. For the special case of Ising model, the algorithm reduces to the (cid:96)1-constrained logistic\nregression. This algorithm has better sample complexity than the previous state-of-the-art result (k4\nversus k5), and can be ef\ufb01ciently optimized in \u02dcO(n2) time. One interesting direction for future work\nis to see if the 1/\u03b74 dependency in the sample complexity can be improved. It is also interesting to\nsee a thorough empirical evaluation of different structural learning algorithms.\nAnother interesting direction is to consider MRFs with higher-order interactions. Intuitively, it should\nnot be dif\ufb01cult to prove that (cid:96)1-constrained logistic regression can recover the structure of binary\nt-wise MRFs. One can prove it by combining results from Section 7 of (Klivans and Meka, 2017) and\nthe following fact: the Sparsitron algorithm can be viewed as an online mirror descent algorithm that\napproximately solves an (cid:96)1-constrained logistic regression. This observation is actually the starting\npoint of our paper. For higher-order MRFs with non-binary alphabet, we conjecture that similar result\ncan be proved for group-sparse regularized logistic regression. Extending the current proof/method\nto higher-order MRFs is de\ufb01nitely an interesting direction for future research.\n\n6 Acknowledgements\n\nThis research has been supported by NSF Grants 1302435, 1564000, and 1618689, DMS 1723052,\nCCF 1763702, AF 1901292 and research gifts by Google, Western Digital and NVIDIA.\n\nReferences\nAgarwal, A., Negahban, S., and Wainwright, M. J. (2010). Fast global convergence rates of gradient\nmethods for high-dimensional statistical recovery. In Advances in Neural Information Processing\nSystems, pages 37\u201345.\n\nAurell, E. and Ekeberg, M. (2012). Inverse ising inference using all the data. Physical review letters,\n\n108(9):090201.\n\nBanerjee, O., Ghaoui, L. E., and d\u2019Aspremont, A. (2008). Model selection through sparse maximum\nlikelihood estimation for multivariate gaussian or binary data. Journal of Machine learning\nresearch, 9(Mar):485\u2013516.\n\nBartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482.\n\nBen-Tal, A. and Nemirovski, A. (Fall 2013). Lectures on modern convex optimization. https:\n\n//www2.isye.gatech.edu/~nemirovs/Lect_ModConvOpt.pdf.\n\n9\n\n40006000800010000Number of samples (N)0.60.70.80.91Prob succ in 100 runsAlphabet size (k) = 4Our method[KM17]2461040.40.60.81k = 6Our method[KM17]\fBento, J. and Montanari, A. (2009). Which graphical models are dif\ufb01cult to learn? In Advances in\n\nNeural Information Processing Systems, pages 1303\u20131311.\n\nBresler, G. (2015). Ef\ufb01ciently learning ising models on arbitrary graphs. In Proceedings of the\n\nforty-seventh annual ACM symposium on Theory of computing (STOC), pages 771\u2013782. ACM.\nBubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends R(cid:13) in\n\nMachine Learning, 8(3-4):231\u2013357.\n\nChoi, M. J., Lim, J. J., Torralba, A., and Willsky, A. S. (2010). Exploiting hierarchical context on\na large database of object categories. In Computer vision and pattern recognition (CVPR), 2010\nIEEE conference on, pages 129\u2013136. IEEE.\n\nEagle, N., Pentland, A. S., and Lazer, D. (2009). Inferring friendship network structure by using\n\nmobile phone data. Proceedings of the national academy of sciences, 106(36):15274\u201315278.\n\nHamilton, L., Koehler, F., and Moitra, A. (2017). Information theoretic properties of markov random\n\ufb01elds, and their algorithmic applications. In Advances in Neural Information Processing Systems,\npages 2463\u20132472.\n\nJalali, A., Ravikumar, P., Vasuki, V., and Sanghavi, S. (2011). On learning discrete graphical models\nusing group-sparse regularization. In Proceedings of the Fourteenth International Conference on\nArti\ufb01cial Intelligence and Statistics, pages 378\u2013387.\n\nKakade, S. M., Shalev-Shwartz, S., and Tewari, A. (2012). Regularization techniques for learning\n\nwith matrices. Journal of Machine Learning Research, 13(Jun):1865\u20131890.\n\nKakade, S. M., Sridharan, K., and Tewari, A. (2009). On the complexity of linear prediction: Risk\nbounds, margin bounds, and regularization. In Advances in neural information processing systems,\npages 793\u2013800.\n\nKlivans, A. R. and Meka, R. (2017). Learning graphical models using multiplicative weights. In\nProceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS),\npages 343\u2013354. IEEE.\n\nKoh, K., Kim, S.-J., and Boyd, S. (2007). An interior-point method for large-scale (cid:96)1-regularized\n\nlogistic regression. Journal of Machine learning research, 8(Jul):1519\u20131555.\n\nLee, S.-I., Ganapathi, V., and Koller, D. (2007). Ef\ufb01cient structure learning of markov networks using\n\nl_1-regularization. In Advances in neural Information processing systems, pages 817\u2013824.\n\nLokhov, A. Y., Vuffray, M., Misra, S., and Chertkov, M. (2018). Optimal structure and parameter\n\nlearning of ising models. Science advances, 4(3):e1700791.\n\nMarbach, D., Costello, J. C., K\u00fcffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., Allison, K. R.,\nConsortium, T. D., Kellis, M., Collins, J. J., and Stolovitzky, G. (2012). Wisdom of crowds for\nrobust gene network inference. Nature methods, 9(8):796.\n\nNegahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A uni\ufb01ed framework for\nhigh-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science,\n27(4):538\u2013557.\n\nRavikumar, P., Wainwright, M. J., and Lafferty, J. D. (2010). High-dimensional ising model selection\n\nusing (cid:96)1-regularized logistic regression. The Annals of Statistics, 38(3):1287\u20131319.\n\nRigollet, P. and H\u00fctter, J.-C. (Spring 2017). Lectures notes on high dimensional statistics. http:\n\n//www-math.mit.edu/~rigollet/PDFs/RigNotes17.pdf.\n\nSanthanam, N. P. and Wainwright, M. J. (2012). Information-theoretic limits of selecting binary\ngraphical models in high dimensions. IEEE Transactions on Information Theory, 58(7):4117\u20134134.\n\nShalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press.\n\n10\n\n\fVon Neumann, J. (1949). On rings of operators. reduction theory. Annals of Mathematics, pages\n\n401\u2013485.\n\nVuffray, M., Misra, S., Lokhov, A., and Chertkov, M. (2016). Interaction screening: Ef\ufb01cient and\nsample-optimal learning of ising models. In Advances in Neural Information Processing Systems,\npages 2595\u20132603.\n\nVuffray, M., Misra, S., and Lokhov, A. Y. (2019). Ef\ufb01cient learning of discrete graphical models.\n\narXiv preprint arXiv:1902.00600.\n\nYang, E., Allen, G., Liu, Z., and Ravikumar, P. K. (2012). Graphical models via generalized linear\n\nmodels. In Advances in Neural Information Processing Systems, pages 1358\u20131366.\n\nYuan, M. and Lin, Y. (2007). Model selection and estimation in the gaussian graphical model.\n\nBiometrika, 94(1):19\u201335.\n\n11\n\n\f", "award": [], "sourceid": 4413, "authors": [{"given_name": "Shanshan", "family_name": "Wu", "institution": "University of Texas at Austin"}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": "UT-Austin"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "University of Texas, Austin"}]}