{"title": "Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1803, "page_last": 1811, "abstract": "We prove certain theoretical properties of a graph-regularized transductive learning objective that is based on minimizing a Kullback-Leibler divergence based loss. These include showing that the iterative alternating minimization procedure used to minimize the objective converges to the correct solution and deriving a test for convergence. We also propose a graph node ordering algorithm that is cache cognizant and leads to a linear speedup in parallel computations. This ensures that the algorithm scales to large data sets. By making use of empirical evaluation on the TIMIT and Switchboard I corpora, we show this approach is able to out-perform other state-of-the-art SSL approaches. In one instance, we solve a problem on a 120 million node graph.", "full_text": "Entropic Graph Regularization in Non-Parametric\n\nSemi-Supervised Classi\ufb01cation\n\nAmarnag Subramanya & Jeff Bilmes\n\nDepartment of Electrical Engineering, University of Washington, Seattle.\n\n{asubram,bilmes}@ee.washington.edu\n\nAbstract\n\nWe prove certain theoretical properties of a graph-regularized transductive learn-\ning objective that is based on minimizing a Kullback-Leibler divergence based\nloss. These include showing that the iterative alternating minimization procedure\nused to minimize the objective converges to the correct solution and deriving a test\nfor convergence. We also propose a graph node ordering algorithm that is cache\ncognizant and leads to a linear speedup in parallel computations. This ensures that\nthe algorithm scales to large data sets. By making use of empirical evaluation on\nthe TIMIT and Switchboard I corpora, we show this approach is able to outper-\nform other state-of-the-art SSL approaches. In one instance, we solve a problem\non a 120 million node graph.\n\nIntroduction\n\n1\nThe process of training classi\ufb01ers with small amounts of labeled data and relatively large amounts\nof unlabeled data is known as semi-supervised learning (SSL). In many applications, such as speech\nrecognition, annotating training data is time-consuming, tedious and error-prone. SSL lends itself as\na useful technique in such situations as one only needs to annotate small amounts of data for training\nmodels. For a survey of SSL algorithms, see [1, 2]. In this paper we focus on graph-based SSL [1].\nHere one assumes that the labeled and unlabeled samples are embedded within a low-dimensional\nmanifold expressed by a graph \u2014 each data sample is represented by a vertex within a weighted\ngraph with the weights providing a measure of similarity between vertices. Some graph-based SSL\napproaches perform random walks on the graph for inference [3, 4] while others optimize a loss\nfunction based on smoothness constraints derived from the graph [5, 6, 7, 8]. Graph-based SSL\nalgorithms are inherently non-parametric, transductive and discriminative [2]. The results of the\nbenchmark SSL evaluations in chapter 21 of [1] show that graph-based algorithms are in general\nbetter than other SSL algorithms.\nMost of the current graph-based SSL algorithms have a number of shortcomings \u2013 (a) in many cases,\nsuch as [6, 9], a two class problem is assumed; this necessitates the use of sub-optimal extensions\nlike one vs. rest to solve multi-class problems, (b) most graph-based SSL algorithms (exceptions in-\nclude [7, 8]) attempt to minimize squared error which is not optimal for classi\ufb01cation problems [10],\nand (c) there is a lack of principled approaches to integrating class prior information into graph-based\nSSL algorithms. Approaches such as class mass normalization and label bidding are used as a post-\nprocessing step rather than being tightly integrated with the inference. To address some of the above\nissues, we proposed a new graph-based SSL algorithm based on minimizing a Kullback-Leibler\ndivergence (KLD) based loss in [11]. Some of the advantages of this approach include, straight-\nforward extension to multi-class problems, ability to handle label uncertainty and integrate priors.\nWe also showed that this objective can be minimized using alternating minimization (AM), and can\noutperform other state-of-the-art SSL algorithms for document classi\ufb01cation.\nAnother criticism of previous work in graph-based SSL (and SSL in general) is the lack of algo-\nrithms that scale to very large data sets. SSL is based on the premise that unlabeled data is easily\n\n1\n\n\fi=1 be the set of labeled samples, Du = {xi}l+u\n\nobtained, and adding large quantities of unlabeled data leads to improved performance. Thus prac-\ntical scalability (e.g., parallelization), is very important in SSL algorithms.\n[12, 13] discuss the\napplication of TSVMs to large-scale problems. [14] suggests an algorithm for improving the induc-\ntion speed in the case of graph-based algorithms. [15] solves a graph transduction problem with\n650,000 samples. To the best of our knowledge, the largest graph-based problem solved to date had\nabout 900,000 samples (includes both labeled and unlabeled data) [16]. Clearly, this is a fraction\nof the amount of unlabeled data at our disposal. For example, on the Internet alone we create 1.6\nbillion blog posts, 60 billion emails, 2 million photos and 200,000 videos every day [17].\nThe goal of this paper is to provide theoretical analysis of our algorithm proposed in [11] and also\nshow how it can be scaled to very large problems. We \ufb01rst prove that AM on our KLD based\nobjective converges to the true optimum. We also provide a test for convergence and discuss some\ntheoretical connections between the two SSL objectives proposed in [11]. In addition, we propose\na graph node ordering algorithm that is cache cognizant and makes obtaining a linear speedup with\na parallel implementation more likely. As a result, the algorithms are able to scale to very large\ndatasets. The node ordering algorithm is quite general and can be applied to graph-based SSL\nalgorithm such as [5, 11]. In one instance, we solve a SSL problem over a graph with 120 million\nvertices. We use the phone classi\ufb01cation problem to demonstrate the scalability of the algorithm. We\nbelieve that speech recognition is an ideal application for SSL and in particular graph-based SSL for\nseveral reasons: (a) human speech is produced by a small number of articulators and thus amenable\nto representation in a low-dimensional manifold [18]; (b) annotating speech data is time-consuming\nand costly; and (c) the data sets tend to be very large.\n2 Graph-based SSL\nLet Dl = {(xi, ri)}l\ni=l+1, the set of unlabeled samples\nand D (cid:44) {Dl,Du}. Here ri is an encoding of the labeled data and will be explained shortly. We are\ninterested in solving the transductive learning problem, i.e., given D, the task is to predict the labels\nof the samples in Du. The \ufb01rst step in most graph-based SSL algorithms is the construction of an\nundirected weighted graph G = (V, E), where the vertices (nodes) V = {1, . . . , m}, m = l + u,\nare the data points in D and the edges E \u2286 V \u00d7 V . Let Vl and Vu be the set of labeled and\nunlabeled vertices respectively. G may be represented via a symmetric matrix W which is referred\nto as the weight or af\ufb01nity matrix. There are many ways of constructing the graph (see section 6.2\nin [2]). In this paper, we use symmetric k-nearest neighbor (NN) graphs \u2014 that is, we \ufb01rst form\nwij (cid:44) [W]ij = sim(xi, xj) and then make this graph sparse by setting wij = 0 unless i is one of j\u2019s\nk nearest neighbors or j is one of i\u2019s k nearest neighbors. It is assumed that sim(x, y) = sim(y, x).\nLet N (i) be the set of neighbors of vertex i. Choosing the correct similarity measure and |N (i)| are\ncrucial steps in the success of any graph-based SSL algorithm as it determines the graph [2].\nFor each i \u2208 V and j \u2208 Vl, de\ufb01ne probability measures pi and rj respectively over the measurable\nspace (Y,Y). Here Y is the \u03c3-\ufb01eld of measurable subsets of Y and Y \u2282 N (the set of natural\nnumbers) is the space of classi\ufb01er outputs. Thus |Y| = 2 yields binary classi\ufb01cation while |Y| > 2\nimplies multi-class. As we only consider classi\ufb01cation problems here, pi and ri are multinomial\ndistributions, pi(y) is the probability that xi belongs to class y and the classi\ufb01cation result is given\nby argmaxy pi(y). {rj}, j \u2208 Vl encodes the labels of the supervised portion of the training data. If\nthe labels are known with certainty, then rj is a \u201cone-hot\u201d vector (with the single 1 at the appropriate\nposition in the vector). rj is also capable of representing cases where the label is uncertain, i.e., for\nexample when the labels are in the form of a distribution (possibly derived from normalizing scores\nrepresenting con\ufb01dence). It is important to distinguish between the classical multi-label problem\nand the use of uncertainty in rj. If it is the case that rj(\u00afy1), rj(\u00afy2) > 0, \u00afy1 (cid:54)= \u00afy2, it does not\nimply that the input xj possesses two output labels \u00afy1 and \u00afy2. Rather, rj represents our belief in the\nvarious values of the output. As pi, ri are probability measures, they lie within a |Y|-dimensional\nprobability simplex which we represent using (cid:77)|Y| and so pi, ri \u2208(cid:77)|Y| (henceforth denoted as (cid:77)).\nAlso let p (cid:44) (p1, . . . , pm) \u2208(cid:77)m(cid:44)(cid:77) \u00d7 . . .\u00d7 (cid:77) (m times) and r (cid:44) (r1, . . . , rl) \u2208(cid:77)l.\nConsider the optimization problem proposed in [11] where p\u2217 = min\np\u2208(cid:77)m\n\nC1(p) and\n\nl(cid:88)\n\ni=1\n\nC1(p) =\n\nDKL\n\n(cid:0)ri||pi\n\n(cid:1) + \u00b5\n\nm(cid:88)\n\n(cid:88)\n\n(cid:0)pi||pj\n\n(cid:1) \u2212 \u03bd\n\nm(cid:88)\n\ni=1\n\nwijDKL\n\nH(pi).\n\ni=1\n\nj\u2208N (i)\n\n2\n\n\fy p(y) log p(y)\n\nHere H(p) = \u2212(cid:80)\nmeasures p and q and is given by DKL(p||q) =(cid:80)\n\ny p(y) log p(y) is the Shannon entropy of p and DKL(p||q) is the KLD between\nq(y) . If \u00b5, \u03bd, wij \u2265 0, \u2200 i, j then C1(p)\nis convex [19]. (\u00b5, \u03bd) are hyper-parameters whose choice we discuss in Section 5. The \ufb01rst term in\nC1 penalizes the solution pi i \u2208 {1, . . . , l}, when it is far away from the labeled training data Dl,\nbut it does not insist that pi = ri, as allowing for deviations from ri can help especially with noisy\nlabels [20] or when the graph is extremely dense in certain regions. The second term of C1 penalizes\na lack of consistency with the geometry of the data, i.e., a graph regularizer. If wij is large, we\nprefer a solution in which pi and pj are close in the KLD sense. The last term encourages each pi\nto be close to the uniform distribution if not preferred to the contrary by the \ufb01rst two terms. This\nacts as a guard against degenerate solutions commonly encountered in graph-based SSL [6], e.g., in\ncases where a sub-graph is not connected to any labeled vertex. We conjecture that by maximizing\nthe entropy of each pi, the classi\ufb01er has a better chance of producing high entropy results in graph\nregions of low con\ufb01dence (e.g. close to the decision boundary and/or low density regions). To\nrecap, C1 makes use of the manifold assumption, is naturally multi-class and able to encode label\nuncertainty.\nAs C1 is convex in p with linear constraints, we have a convex programming problem. However,\na closed form solution does not exist and so standard numerical optimization approaches such as\ninterior point methods (IPM) or method of multipliers (MOM) can be used to solve the problem.\nBut, each of these approaches have their own shortcomings and are rather cumbersome to implement\n(e.g. an implementation of MOM to solve this problem would have 7 extraneous parameters). Thus,\nin [11], we proposed the use of AM for minimizing C1. We will address the question of whether AM\nis superior to IPMs or MOMs for minimizing C1 shortly.\nConsider a problem of minimizing d(p, q) over p \u2208 P, q \u2208 Q. Sometimes solving this problem di-\nrectly is hard and in such cases AM lends itself as a valuable tool for ef\ufb01cient optimization. It is an\niterative process in which p(n) = argminp\u2208P d(p, q(n\u22121)) and q(n+1) = argminq\u2208Q d(p(n), q).\nThe Expectation-Maximization (EM) [21] algorithm is an example of AM. C1 is not amenable\nto optimization using AM and so we have proposed a modi\ufb01ed version of the objective where\n(p\u2217, q\u2217) = min\np,q\u2208(cid:77)m\n\nC2(p, q) and\n\n(cid:0)pi||qj\n\n(cid:1) \u2212 \u03bd\n\nm(cid:88)\n\ni=1\n\nH(pi).\n\nw(cid:48)\nijDKL\n\nC2(p, q) =\n\nDKL\n\nl(cid:88)\n\ni=1\n\n(cid:0)ri||qi\n\n(cid:1) + \u00b5\n\nm(cid:88)\n\n(cid:88)\n\ni=1\n\nj\u2208N (cid:48)\n\n(i)\n\nIn the above, a third measure qi, \u2200 i \u2208 V is de\ufb01ned over the measurable space (Y,Y), W(cid:48) =\nW + \u03b1In, N (cid:48)(i) = {{i} \u222a N (i)} and \u03b1 \u2265 0. Here the qi\u2019s play a similar role as the pi\u2019s and\ncan potentially be used to obtain a \ufb01nal classi\ufb01cation result (argmaxy qi(y)), but \u03b1, which is a\nhyper-parameter, plays an important role in ensuring that pi and qi are close \u2200 i. It should be at\nleast intuitively clear that as \u03b1 gets large, the reformulated objective (C2) apparently approaches the\noriginal objective (C1). Our results from [11] suggest that setting \u03b1 = 2 ensures that p\u2217 = q\u2217 (more\non this in the next section). It is important to highlight that C2(p, q) is itself still a valid SSL criterion.\nWhile the \ufb01rst term encourages qi for the labeled vertices to be close to the labels, ri, the last term\nencourages higher entropy pi\u2019s. The second term, in addition to acting as a graph regularizer, also\nacts as glue between the p\u2019s and q\u2019s. The update equations for solving C2(p, q) are given by\n\n(cid:80)\n(cid:80)\n(cid:80)\nij log q(n\u22121)\nexp{ \u00b5\n(y)}\nj w(cid:48)\nij log q(n\u22121)\ny exp{ \u00b5\n(y)}\nj w(cid:48)\n(cid:48)\nij. Intuitively, discrete probability measures are being propagated between\n\nri(y)\u03b4(i \u2264 l) + \u00b5(cid:80)\n\u03b4(i \u2264 l) + \u00b5(cid:80)\n\nwhere \u03b3i = \u03bd + \u00b5(cid:80)\n\n(cid:48)\njip(n)\n(cid:48)\nji\n\nvertices along edges, so we refer to this algorithm as measure propagation (MP).\nWhen AM is used to solve an optimization problem, a closed form solution to each of the steps\nof the AM is desired but not always guaranteed [7]. It can be seen that solving C2 using AM has\na single additional hyper-parameter while other approaches such as MOM can have as many as 7.\nFurther, as we show in section 4, the AM update equations can be easily parallelized.\nWe brie\ufb02y comment on the relationship to previous work. As noted in section 1, a majority of\nthe previous graph-based SSL algorithms are based on minimizing squared error [6, 5]. While\n\nand q(n)\n\n(y) =\n\nj w\nj w\n\nj w\n\np(n)\ni\n\n(y) =\n\n(y)\n\n\u03b3i\n\n\u03b3i\n\nj\n\nj\n\nj\n\ni\n\n3\n\n\fthese objectives are convex and in some cases admit closed-form (i.e., non-iterative) solutions, they\nrequire inverting a matrix of size m \u00d7 m. Thus in the case of very large data sets (e.g., like the\none we consider in section 5), it might not be feasible to use this approach. Therefore, an iterative\nupdate is employed in practice. Also, squared-error is only optimal under a Gaussian loss model\nand thus more suitable for regression rather than classi\ufb01cation problems. Squared-loss penalizes\nabsolute error, while KLD, on the other-hand penalizes relative error (pages 226 and 235 of [10]).\nHenceforth, we refer to a multi-class extension of algorithm 11.2 in [20] as SQ-Loss.\nThe Information Regularization (IR) [7] approach and subsequently the algorithm of Tsuda [8] use\nKLD based objectives and utilize AM to solve the problem. However these algorithms are motivated\nfrom a different perspective. In fact, as stated above, one of the steps of the AM procedure in the\ncase of IR does not admit a closed form solution. In addition, neither IR nor the work of Tsuda\nuse an entropy regularizer, which, as our results will show, leads to improved performance. While\nthe two steps of the AM procedure in the case of Tsuda\u2019s work have closed form solutions and the\napproach is applicable to hyper-graphs, one of the updates (equation 13 in [8]) is a special of the\nupdate for p(n)\n3 Optimal Convergence of AM on C2\nWe show that AM on C2 converges to the minimum of C2, and that there exists a \ufb01nite \u03b1 such that\nthe optimal solutions of C1 and C2 are identical. Therefore, C2 is a perfect tractable surrogate for C1.\nIn general, AM is not always guaranteed to converge to the correct solution. For example, consider\nminimizing f(x, y) = x2 + 3xy + y2 over x, y \u2208 R where f(x, y) is unbounded below (consider\ny = \u2212x). But AM says that (x\u2217, y\u2217) = (0, 0) which is incorrect (see [22] for more examples). For\nAM to converge to the correct solution certain conditions must be satis\ufb01ed. These might include\ntopological properties of the optimization problem [23, 24] or certain geometrical properties [25].\nThe latter is referred to as the Information Geometry approach where the 5-points property (5-\npp) [25] plays an important role in determining convergence and is the method of choice here.\nTheorem 3.1 (Convergence of AM on C2). If\n\n. For more connections to previous approaches, see Section 4 in [11].\n\ni\n\np(n) = argmin\np\u2208(cid:77)m\n\nC2(p, q(n\u22121)), q(n) = argmin\nq\u2208(cid:77)m\n\nC2(p(n), q) and q(0)\n\ni\n\n(y) > 0 \u2200 y \u2208 Y, \u2200i then\n\n(a) C2(p, q) + C2(p, p(0)) \u2265 C2(p, q(1)) + C2(p(1), q(1)) for all p, q \u2208(cid:77)m, and\n(b)\n\nn\u2192\u221eC2(p(n), q(n)) = inf p,q\u2208(cid:77)m C2(p, q).\nlim\n\nProof Sketch: (a) is the 5-pp for C2(p, q). 5-pp holds if the 3-points (3-pp) and 4-points (4-pp)\nproperties hold. In order to show 3-pp, let f(t) (cid:44) C2(p(t), q(0)) where p(t) = (1\u2212 t)p + tp(1), 0 <\nt \u2264 1. Next we use the fact that the \ufb01rst-order Taylor\u2019s approximation underestimates a convex\nfunction to upper bound the gradient of f(t) w.r.t t. We then pass this to the limit as t \u2192 1 and use\nthe monotone convergence theorem to exchange the limit and the summation. This gives the 3-pp.\n(cid:3)\nThe proof for 4-pp follows in a similar manner. (b) follows as a result of Theorem 3 in [25].\nn=1 is generated by AM of C2(p, q) and\nTheorem 3.2 (Test for Convergence). If {(p(n), q(n))}\u221e\nC2(p\u2217, q\u2217) (cid:44) inf p,q\u2208(cid:77)m C2(p, q) then\n\nC2(p(n), q(n)) \u2212 C2(p\u2217, q\u2217) \u2264 m(cid:88)\n\n(cid:0)\u03b4(i \u2264 l) + di\n\n(cid:1)\u03b2i where \u03b2i (cid:44) log sup\n\nw(cid:48)\nij.\n\n, dj =(cid:88)\n\ni\n\nq(n)\n(y)\ni\nq(n\u22121)\n(y)\n\ni\n\ny\n\ni=1\n\nProof Sketch: As the 5-pp holds for all p, q \u2208(cid:77)m, it also holds for p = p\u2217 and q = q\u2217. We use fact\nthat E(f(Z)) \u2264 supz f(z) where Z is a random variable and f(\u00b7) is an arbitrary function.\n(cid:3)\nThe above means that AM on C2 converges to its optimal value. We also have the following theorems\nthat show the existence of a \ufb01nite lower-bound on \u03b1 such that the optimum of C1 and C2 are the same.\nii = 0) is C2 when the diagonal elements of the af\ufb01nity matrix are all zero\nLemma 3.3. If C2(p, q; w(cid:48)\nthen we have that\n\nmin\np,q\u2208(cid:77)m\n\nC2(p, q; w(cid:48)\n\nii = 0) \u2264 min\np\u2208(cid:77)m\n\nC1(p)\n\n4\n\n\fTheorem 3.4. Given any A, B, S \u2208(cid:77)m (i.e., A = [a1, . . . , an] , B = [b1, . . . , bn] , S =\n[s1, . . . , sn]) such that ai(y), bi(y), si(y) > 0, \u2200 i, y and A (cid:54)= B (i.e., not all ai(y) = bi(y))\nthen there exists a \ufb01nite \u03b1 such that\n\nC2(A, B) \u2265 C2(S, S) = C1(S)\n\nC2(p, q; \u02dc\u03b1) for an arbitrary \u03b1 = \u02dc\u03b1 > 0 where p\u2217\n\nTheorem 3.5 (Equality of Solutions of C1 and C2). Let \u02c6p = argmin\np\u2208(cid:77)m\n\u02dc\u03b1 = (p\u2217\nargmin\np,q\u2208(cid:77)m\n1; \u02dc\u03b1,\u00b7\u00b7\u00b7 , q\u2217\n(q\u2217\n\u02c6p = p\u2217\n\u02c6\u03b1 = q\u2217\n\n\u02dc\u03b1) =\n\u02dc\u03b1 =\nm; \u02dc\u03b1). Then there exists a \ufb01nite \u02c6\u03b1 such that at convergence of AM, we have that\n\u02c6\u03b1. Further, it is the case that if p\u2217\n\n\u02dc\u03b1, q\u2217\nm; \u02dc\u03b1) and q\u2217\n\nC1(p) and (p\u2217\n\n1; \u02dc\u03b1,\u00b7\u00b7\u00b7 , p\u2217\n\n\u02dc\u03b1 (cid:54)= q\u2217\n\n\u02dc\u03b1, then\n\n\u00b5(cid:80)n\n\n\u02c6\u03b1 \u2265 C1(\u02c6p) \u2212 C2(p\u2217, q\u2217; \u03b1 = 0)\ni; \u02dc\u03b1) .\n\ni=1 DKL(p\u2217\n\ni; \u02dc\u03b1||q\u2217\n\nAnd if p\u2217\n\n\u02dc\u03b1 = q\u2217\n\n\u02dc\u03b1 then \u02c6\u03b1 \u2265 \u02dc\u03b1.\n\n4 Parallelism and Scalability to Large Datasets\nOne big advantage of AM on C2 over optimizing C1 directly is that it is naturally amenable to a\nparallel implementation, and is also amenable to further optimizations (see below) that yield a near\nlinear speedup. Consider the update equations of Section 2. We see that one set of measures is held\n\ufb01xed while the other set is updated without any required communication amongst set members, so\nthere is no write contention. This immediately yields a T \u2265 1-threaded implementation where the\ngraph is evenly T -partitioned and each thread operates over only a size m/T = (l + u)/T subset of\nthe graph nodes.\nWe constructed a 10-NN graph using the standard TIMIT training and development sets (see sec-\ntion 5). The graph had 1.4 million vertices. We ran a timing test on a 16 core symmetric multipro-\ncessor with 128GB of RAM, each core operating at 1.6GHz. We varied the number T of threads\nfrom 1 (single-threaded) up to 16, in each case running 3 iterations of AM (i.e., 3 each of p and q\nupdates). Each experiment was repeated 10 times, and we measured the minimum CPU time over\nthese 10 runs (total CPU time only was taken into account). The speedup for T threads is typically\nde\ufb01ned as the ratio of time taken for single thread to time taken for T threads. The solid (black)\nline in \ufb01gure 1(a) represents the ideal case (a linear speedup), i.e., when using T threads results in a\nspeedup of T . The pointed (green) line shows the actual speedup of the above procedure, typically\nless than ideal due to inter-process communication and poor shared L1 and/or L2 microprocessor\ncache interaction. When T \u2264 4, the speedup (green) is close to ideal, but for increasing T the\nperformance diminishes away from the ideal case.\nOur contention is that the sub-linear speedup is due to the poor cache cognizance of the algorithm.\nAt a given point in time, suppose thread t \u2208 {1, . . . , T} is operating on node it. The collec-\nt=1N (it)} and this, along with\ntive set of neighbors that are being used by these T threads are {\u222aT\nt=1{it} (and all memory for the associated measures), constitute the current working set.\nnodes \u222aT\nThe working set should be made as small as possible to increase the chance it will \ufb01t in the mi-\ncroprocessor caches, but this becomes decreasingly likely as T increases since the working set is\nmonotonically increasing with T . Our goal, therefore, is for the nodes that are being simultane-\nously operated on to have a large amount of neighbor overlap thus minimizing the working set size.\nViewed as an optimization problem, we must \ufb01nd a partition (V1, V2, . . . , Vm/T ) of V that mini-\nmizes maxj\u2208{1,...,m/T} | \u222av\u2208Vj N (v)|. With such a partition, we may also order the subsets so that\nthe neighbors of Vi would have maximal overlap with the neighbors of Vi+1. We then schedule the\nT nodes in Vj to run simultaneously, and schedule the Vj sets successively.\nOf course, the time to produce such a partition cannot dominate the time to run the algorithm itself.\nTherefore, we propose a simple fast node ordering procedure (Algorithm 1) that can be run once\nbefore the parallelization begins. The algorithm orders the nodes such that successive nodes are\nlikely to have a high amount of neighbor overlap with each other and, by transitivity, with nearby\nIt does this by, given a node v, choosing another node v(cid:48) from amongst\nnodes in the ordering.\nv\u2019s neighbors\u2019 neighbors (meaning the neighbors of v\u2019s neighbors) that has the highest neighbor\noverlap. We need not search all V nodes for this, since anything other than v\u2019s neighbors\u2019 neighbors\n\n5\n\n\fAlgorithm 1 Graph Ordering Algorithm\n\nSelect an arbitrary node v.\nwhile there are any unselected nodes remaining do\n\nLet N (v) be the set of neighbors, and N 2(v) be the set of neighbors\u2019 neighbors, of v.\nSelect a currently unselected v(cid:48) \u2208 N 2(v) such that |N (v) \u2229 N (v(cid:48))| is maximized.\nintersection is empty, select an arbitrary unselected v(cid:48).\nv \u2190 v(cid:48).\nend while\n\nIf the\n\nFigure 1: (a) speedup vs. number of threads for the TIMIT graph (see section 5). The process was\nrun on a 128GB, 16 core machine with each core at 1.6GHz. (b) The actual CPU times in seconds\non a log scale vs. number of threads for with and without ordering cases.\n\nhas no overlap with the neighbors of v. Given such an ordering, the tth thread operates on nodes\n{t, t + m/T, t + 2m/T, . . .}. If the threads proceed synchronously (which we do not enforce) the\nset of nodes being processed at any time instant are {1 + jm/T, 2 + jm/T, . . . , T + jm/T}. This\nassignment is bene\ufb01cial not only for maximizing the set of neighbors being simultaneously used,\nbut also for successive chunks of T nodes since once a chunk of T nodes have been processed, it is\nlikely that many of the neighbors of the next chunk of T nodes will already have been pre-fetched\ninto the caches. With the graph represented as an adjacency list, and sets of neighbor indices sorted,\nour algorithm is O(mk3) in time and linear in memory since the intersection between two sorted\nlists may be computed in O(k) time. This is sometimes better than O(m log m) for cases where\nk3 < log m, true for very large m.\nWe ordered the TIMIT graph nodes, and ran timing tests as explained above. To be fair, the time\nrequired for node ordering is charged against every run. The results are shown in \ufb01gure 1(a) (pointed\nred line) where the results are much closer to ideal, and there are no obvious diminishing returns\nlike in the unordered case. Running times are given in \ufb01gure 1(b). Moreover, the ordered case\nshowed better performance even for a single thread T = 1 (CPU time of 539s vs. 565s for ordered\nvs. unordered respectively, on 3 iterations of AM).\nWe conclude this section by noting that (a) re-ordering may be considered a pre-processing (of\ufb02ine)\nstep, (b) the SQ-Loss algorithm may also be implemented in a multi-threaded manner and this is\nsupported by our implementation, (c) our re-ordering algorithm is general and fast and can be used\nfor any graph-based algorithm where the iterative updates for a given node are a function of its\nneighbors (i.e., the updates are harmonic w.r.t. the graph [5]), and (d) while the focus here was on\nparallelization across different processors on a symmetric multiprocessor, this would also apply for\ndistributed processing across a network with a shared network disk.\n\n5 Results\nIn this section we present results on two popular phone classi\ufb01cation tasks. We use SQ-Loss as\nthe competing graph-based algorithm and compare its performance against that of MP because (a)\nSQ-Loss has been shown to outperform its other variants, such as, label propagation [4] and the\nharmonic function algorithm [5], (b) SQ-Loss scales easily to very large data sets unlike approaches\nlike spectral graph transduction [6], and (c) SQ-Loss gives similar performance as other algorithms\nthat minimize squared error such as manifold regularization [20].\n\n6\n\n246810121416246810121416Number of Threadsspeed\u2212up Linear Speed\u2212UpRe\u2212Ordered GraphOriginal Graph2468101214163.544.555.566.5Number of Threadslog(CPU Time) After Re\u2212OrderingBefore Re\u2212Ordering\fFigure 2: Phone accuracy on the TIMIT test set (a,left) and phone accuracy vs. amount of SWB\ntraining data (b,right). With all SWB data added, the graph has 120 million nodes.\n\nTIMIT Phone Classi\ufb01cation: TIMIT is a corpus of read speech that includes time-aligned phonetic\ntranscriptions. As a result, it has been popular in the speech community for evaluating supervised\nphone classi\ufb01cation algorithms [26]. Here, we use it to evaluate SSL algorithms by using fractions\nof the standard TIMIT training set, i.e., simulating the case when only small amounts of data are\nlabeled. We constructed a symmetrized 10-NN graph (Gtimit) over the TIMIT training and devel-\nopment sets (minimum graph degree is 10). The graph had about 1.4 million vertices. We used\nsim(xi, xj) = exp{\u2212(xi \u2212 xj)T \u03a3\u22121(xi \u2212 xj)} where \u03a3 is the covariance matrix computed over\nthe entire training set. In order to obtain the features, xi, we \ufb01rst extracted mel-frequency cepstral\ncoef\ufb01cients (MFCC) along with deltas in the manner described in [27]. As phone classi\ufb01cation per-\nformance is improved with context information, each xi was constructed using a 7 frame context\nwindow. We follow the standard practice of building models to classify 48 phones (|Y| = 48) and\nthen mapping down to 39 phones for scoring [26].\nWe compare the performance of MP against MP with no entropy regularization (\u03bd = 0), SQ-Loss,\nand a supervised state-of-the-art L2 regularized multi-layered perceptron (MLP) [10]. The hyper-\nparameters in each case, i.e., number of hidden units and regularization weight in case of MLP, \u00b5\nand \u03bd in the case of MP and SQ-Loss, were tuned on the development set. For the MP and SQ-\nLoss, the hyper-parameters were tuned over the following sets \u00b5 \u2208 {1e\u20138, 1e\u20134, 0.01, 0.1} and\n\u03bd \u2208 {1e\u20138, 1e\u20136, 1e\u20134, 0.01, 0.1}. We found that setting \u03b1 = 1 in the case of MP ensured that\np = q at convergence. As both MP and SQ-Loss are transductive, in order to measure performance\non an independent test set, we induce the labels using the Nadaraya-Watson estimator (see section\n6.4 in [2]) with 50 NNs using the similarity measure de\ufb01ned above.\nFigure 2(a) shows the phone classi\ufb01cation results on the NIST Core test set (independent of the\ndevelopment set). We varied the number of labeled examples by sampling a fraction f of the TIMIT\ntraining set. We show results for f \u2208 {0.005, 0.05, 0.1, 0.25, 0.3}. In all cases, for MP and SQ-\nLoss, we use the same graph Gtimit, but the set of labeled vertices changes based on f.\nIn all\ncases the MLP was trained fully-supervised. We only show results on the test set, but the results\non the development set showed similar trends. It can be seen that (i) using an entropy regularizer\nleads to much improved results in MP, (ii) as expected, the MLP being fully-supervised, performs\npoorly compared to the semi-supervised approaches, and most importantly, (iii) MP signi\ufb01cantly\noutperforms all other approaches. We believe that MP outperforms SQ-Loss as the loss function\nin the case of MP is better suited for classi\ufb01cation. We also found that for larger values of f (e.g.,\nat f = 1), the performances of MLP and MP did not differ signi\ufb01cantly. But those are more\nrepresentative of the supervised training scenarios which is not the focus here.\nSwitchboard-I Phone Classi\ufb01cation: Switchboard-I (SWB) is a collection of about 2,400 two-sided\ntelephone conversations among 543 speakers [28]. SWB is often used for the training of large\nvocabulary speech recognizers. The corpus is annotated at the word-level. In addition, less reliable\nphone level annotations generated in an automatic manner by a speech recognizer with a non-zero\nerror rate are also available [29]. The Switchboard Transcription Project (STP) [30] was undertaken\nto accurately annotate SWB at the phonetic and syllable levels. As a result of the arduous and costly\nnature of this transcription task, only 75 minutes (out of 320 hours) of speech segments selected from\ndifferent SWB conversations were annotated at the phone level and about 150 minutes annotated at\nthe syllable level. Having such annotations for all of SWB could be useful for speech processing in\ngeneral, so this is an ideal real-world task for SSL.\n\n7\n\n05101520253046485052545658606264Percentage of TIMIT Training Set UsedPhone Accuracy MPMP (n = 0)MLPSQ\u2212Loss02040608010034363840424446Percentage of SWB Training DataPhone Recognition Accuracy MPSQ\u2212Loss\fWe make use of only the phonetic labels ignoring the syllable annotations. Our goal is to phonet-\nically annotate SWB in STP style while treating STP as labeled data, and in the process show that\nour aforementioned parallelism efforts scale to extremely large datasets. We extracted features xi\nfrom the conversations by \ufb01rst windowing them using a Hamming window of size 25ms at 100Hz.\nWe then extracted 13 perceptual linear prediction (PLP) coef\ufb01cients from these windowed features\nand appended both deltas and double-deltas resulting in a 39 dimensional feature vector. As with\nTIMIT, we are interested in phone classi\ufb01cation and we use a 7 frame context window to generate\nxi, stepping successive context windows by 10ms as is standard in speech recognition.\nWe randomly split the 75 minute phonetically annotated part of STP into three sets, one each for\ntraining, development and testing containing 70%, 10% and 20% of the data respectively (the size\nof the development set is considerably smaller than the size of the training set). This procedure was\nrepeated 10 times (i.e. we generated 10 different training, development and test sets by random sam-\npling). In each case, we trained a phone classi\ufb01er using the training set, tuned the hyper-parameters\non the development set and evaluated the performance on the test set. In the following, we refer to\nSWB that is not a part of STP as SWB-STP. We added the unlabeled SWB-STP data in stages. The\npercentage, s, included, 0%, 2%, 5%, 10%, 25%, 40%, 60%, and 100% of SWB-STP. We ran both\nMP and SQ-Loss in each case. When s =100%, there were about 120 million nodes in the graph!\nDue to the large size m = 120M of the dataset, it was not possible to generate the graph using\nthe conventional brute-force search which is O(m2). Nearest neighbor search is a well-researched\nproblem with many approximate solutions [31]. Here we make use of the Approximate Nearest\nNeighbor (ANN) library (see http://www.cs.umd.edu/\u02dcmount/ANN/) [32]. It constructs\na modi\ufb01ed version of a kd-tree which is then used to query the NNs. The query process requires\nthat one specify an error term, \u0001, and guarantees that (d(xi,N (xi))/d(xi,N\u0001(xi))) \u2264 1 + \u0001. where\nN (xi) is a function that returns the actual NN of xi while N\u0001(xi) returns the approximate NN.\nWe constructed graphs using the STP data and s% of (unlabeled) SWB-STP data. For all the exper-\niments here we used a symmetrized 10-NN graph and \u0001 = 2.0. The labeled and unlabeled points in\nthe graph changed based on training, development and test sets used. In each case, we ran both the\nMP and SQ-Loss objectives. For each set, we ran a search over \u00b5 \u2208 {1e\u20138, 1e\u20134, 0.01, 0.1} and\n\u03bd \u2208 {1e\u20138, 1e\u20136, 1e\u20134, 0.01, 0.1} for both the approaches. The best value of the hyper-parameters\nwere chosen based on the performance on the development set and the same value was used to\nmeasure the accuracy on the test set. The mean phone accuracy over the different test sets (and the\nstandard deviations) are shown in \ufb01gure 2(b) for the different values of s. It can be seen that MP\noutperforms SQ-Loss in all cases. Equally importantly, we see that the performance on the STP data\nimproves with the addition of increasing amounts of unlabeled data.\n\nReferences\n[1] O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning. MIT Press, 2007.\n[2] X. Zhu, \u201cSemi-supervised learning literature survey,\u201d tech. rep., Computer Sciences, University of\n\nWisconsin-Madison, 2005.\n\n[3] M. Szummer and T. Jaakkola, \u201cPartially labeled classi\ufb01cation with Markov random walks,\u201d in Advances\n\nin Neural Information Processing Systems, vol. 14, 2001.\n\n[4] X. Zhu and Z. Ghahramani, \u201cLearning from labeled and unlabeled data with label propagation,\u201d tech. rep.,\n\nCarnegie Mellon University, 2002.\n\n[5] X. Zhu, Z. Ghahramani, and J. Lafferty, \u201cSemi-supervised learning using gaussian \ufb01elds and harmonic\n\nfunctions,\u201d in Proc. of the International Conference on Machine Learning (ICML), 2003.\n\n[6] T. Joachims, \u201cTransductive learning via spectral graph partitioning,\u201d in Proc. of the International Confer-\n\nence on Machine Learning (ICML), 2003.\n\n[7] A. Corduneanu and T. Jaakkola, \u201cOn information regularization,\u201d in Uncertainty in Arti\ufb01cial Intelligence,\n\n2003.\n\n[8] K. Tsuda, \u201cPropagating distributions on a hypergraph by dual information regularization,\u201d in Proceedings\n\nof the 22nd International Conference on Machine Learning, 2005.\n\n[9] M. Belkin, P. Niyogi, and V. Sindhwani, \u201cOn manifold regularization,\u201d in Proc. of the Conference on\n\nArti\ufb01cial Intelligence and Statistics (AISTATS), 2005.\n\n[10] C. Bishop, ed., Neural Networks for Pattern Recognition. Oxford University Press, 1995.\n\n8\n\n\f[11] A. Subramanya and J. Bilmes, \u201cSoft-supervised text classi\ufb01cation,\u201d in EMNLP, 2008.\n[12] R. Collobert, F. Sinz, J. Weston, L. Bottou, and T. Joachims, \u201cLarge scale transductive svms,\u201d Journal of\n\nMachine Learning Research, 2006.\n\n[13] V. Sindhwani and S. S. Keerthi, \u201cLarge scale semi-supervised linear svms,\u201d in SIGIR \u201906: Proceedings of\n\nthe 29th annual international ACM SIGIR, 2006.\n\n[14] O. Delalleau, Y. Bengio, and N. L. Roux, \u201cEf\ufb01cient non-parametric function induction in semi-supervised\n\nlearning,\u201d in Proc. of the Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2005.\n\n[15] M. Karlen, J. Weston, A. Erkan, and R. Collobert, \u201cLarge scale manifold transduction,\u201d in International\n\nConference on Machine Learning, ICML, 2008.\n\n[16] I. W. Tsang and J. T. Kwok, \u201cLarge-scale sparsi\ufb01ed manifold regularization,\u201d in Advances in Neural\n\nInformation Processing Systems (NIPS) 19, 2006.\n\n[17] A. Tomkins, \u201cKeynote speech.\u201d CIKM Workshop on Search and Social Media, 2008.\n[18] A. Jansen and P. Niyogi, \u201cSemi-supervised learning of speech sounds,\u201d in Interspeech, 2007.\n[19] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley Series in Telecommunications,\n\nNew York: Wiley, 1991.\n\n[20] Y. Bengio, O. Delalleau, and N. L. Roux, Semi-Supervised Learning, ch. Label Propogation and Quadratic\n\nCriterion. MIT Press, 2007.\n\n[21] Dempster, Laird, and Rubin, \u201cMaximum likelihood from incomplete data via the em algorithm,\u201d Journal\n\nof the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1\u201338, 1977.\n\n[22] T. Abatzoglou and B. O. Donnell, \u201cMinimization by coordinate descent,\u201d Journal of Optimization Theory\n\nand Applications, 1982.\n\n[23] W. Zangwill, Nonlinear Programming: a Uni\ufb01ed Approach. Englewood Cliffs: N.J.: Prentice-Hall Inter-\n\nnational Series in Management, 1969.\n\n[24] C. F. J. Wu, \u201cOn the convergence properties of the EM algorithm,\u201d The Annals of Statistics, vol. 11, no. 1,\n\npp. 95\u2013103, 1983.\n\n[25] I. Csiszar and G. Tusnady, \u201cInformation Geometry and Alternating Minimization Procedures,\u201d Statistics\n\nand Decisions, 1984.\n\n[26] A. K. Halberstadt and J. R. Glass, \u201cHeterogeneous acoustic measurements for phonetic classi\ufb01cation,\u201d in\n\nProc. Eurospeech \u201997, (Rhodes, Greece), pp. 401\u2013404, 1997.\n\n[27] K. F. Lee and H. Hon, \u201cSpeaker independant phone recognition using hidden markov models,\u201d IEEE\n\nTransactions on Acoustics, Speech and Signal Processing, vol. 37, no. 11, 1989.\n\n[28] J. Godfrey, E. Holliman, and J. McDaniel, \u201cSwitchboard: Telephone speech corpus for research and\ndevelopment,\u201d in Proceedings of ICASSP, vol. 1, (San Francisco, California), pp. 517\u2013520, March 1992.\n[29] N. Deshmukh, A. Ganapathiraju, A. Gleeson, J. Hamaker, and J. Picone, \u201cResegmentation of switch-\n\nboard,\u201d in Proceedings of ICSLP, (Sydney, Australia), pp. 1543\u20131546, November 1998.\n\n[30] S. Greenberg, \u201cThe Switchboard transcription project,\u201d tech. rep., The Johns Hopkins University (CLSP)\n\nSummer Research Workshop, 1995.\n\n[31] J. Friedman, J. Bentley, and R. Finkel, \u201cAn algorithm for \ufb01nding best matches in logarithmic expected\n\ntime,\u201d ACM Transaction on Mathematical Software, vol. 3, 1977.\n\n[32] S. Arya and D. M. Mount, \u201cApproximate nearest neighbor queries in \ufb01xed dimensions,\u201d in ACM-SIAM\n\nSymp. on Discrete Algorithms (SODA), 1993.\n\n9\n\n\f", "award": [], "sourceid": 1021, "authors": [{"given_name": "Amarnag", "family_name": "Subramanya", "institution": null}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": null}]}