{"title": "Zeta Hull Pursuits: Learning Nonconvex Data Hulls", "book": "Advances in Neural Information Processing Systems", "page_first": 46, "page_last": 54, "abstract": "Selecting a small informative subset from a given dataset, also called column sampling, has drawn much attention in machine learning. For incorporating structured data information into column sampling, research efforts were devoted to the cases where data points are fitted with clusters, simplices, or general convex hulls. This paper aims to study nonconvex hull learning which has rarely been investigated in the literature. In order to learn data-adaptive nonconvex hulls, a novel approach is proposed based on a graph-theoretic measure that leverages graph cycles to characterize the structural complexities of input data points. Employing this measure, we present a greedy algorithmic framework, dubbed Zeta Hulls, to perform structured column sampling. The process of pursuing a Zeta hull involves the computation of matrix inverse. To accelerate the matrix inversion computation and reduce its space complexity as well, we exploit a low-rank approximation to the graph adjacency matrix by using an efficient anchor graph technique. Extensive experimental results show that data representation learned by Zeta Hulls can achieve state-of-the-art accuracy in text and image classification tasks.", "full_text": "Zeta Hull Pursuits:\n\nLearning Nonconvex Data Hulls\n\nYuanjun Xiong\u2020 Wei Liu\u2021 Deli Zhao(cid:2) Xiaoou Tang\u2020\n\n\u2020Information Engineering Department, The Chinese University of Hong Kong, Hong Kong\n\n\u2021IBM T. J. Watson Research Center, Yorktown Heights, New York, USA\n\n{yjxiong,xtang}@ie.cuhk.edu.hk weiliu@us.ibm.com deli zhao@htc.com\n\n(cid:2)Advanced Algorithm Research Group, HTC, Beijing, China\n\nAbstract\n\nSelecting a small informative subset from a given dataset, also called column sam-\npling, has drawn much attention in machine learning. For incorporating structured\ndata information into column sampling, research efforts were devoted to the cases\nwhere data points are \ufb01tted with clusters, simplices, or general convex hulls. This\npaper aims to study nonconvex hull learning which has rarely been investigated in\nthe literature. In order to learn data-adaptive nonconvex hulls, a novel approach\nis proposed based on a graph-theoretic measure that leverages graph cycles to\ncharacterize the structural complexities of input data points. Employing this mea-\nsure, we present a greedy algorithmic framework, dubbed Zeta Hulls, to perform\nstructured column sampling. The process of pursuing a Zeta hull involves the\ncomputation of matrix inverse. To accelerate the matrix inversion computation\nand reduce its space complexity as well, we exploit a low-rank approximation to\nthe graph adjacency matrix by using an ef\ufb01cient anchor graph technique. Exten-\nsive experimental results show that data representation learned by Zeta Hulls can\nachieve state-of-the-art accuracy in text and image classi\ufb01cation tasks.\n\nIntroduction\n\n1\nIn the era of big data, a natural idea is to select a small subset of m samples Ce = {xe1\n}\n, . . . , xem\nfrom a whole set of n data points X = {x1, . . . , xn} such that the selected points Ce can capture\nthe underlying properties or structures of X . Then machine learning and data mining algorithms can\nbe carried out with Ce instead of X , thereby leading to signi\ufb01cant reductions in computational and\nspace complexities. Let us write the matrix forms of Ce and X as C = [xe1 , . . . , xem] \u2208 R\nd\u00d7m\nand X = [x1, . . . , xn] \u2208 R\nd\u00d7n, respectively. Here d is the dimensions of input data points. In\nother words, C is a column subset selection of X. The task of selecting C from X is also called by\ncolumn sampling in the literature, and maintains importance in a variety of \ufb01elds besides machine\nlearning, such as signal processing, geoscience and remote sensing, and applied mathematics. This\npaper concentrates on solving the column sampling problem by means of graph-theoretic methods.\nExisting methods in column sampling fall into two main categories according to their objectives: 1)\napproximate the data matrix X, and 2) discover the underlying data structures. For machine learning\nmethods using kernel or similar \u201cN-Body\u201d techniques, the Nystr\u00a8om matrix approximation is usually\napplied to approximate large matrices. Such circumstances include fast training of nonlinear kernel\nsupport vector machines (SVM) in the dual form [30], spectral clustering [8], manifold learning [25],\netc. Minimizing a relative approximation error is typically harnessed as the objective of column sam-\npling, by which the most intuitive solution is to perform uniform sampling [30]. Other non-uniform\nsampling schemes choose columns via various criteria, such as probabilistic samplings according\nto diagonal elements of a kernel matrix [7], reconstruction errors [15], determinant measurements\n[1], cluster centroids [33], and statistical leverage scores [21]. On the other hand, column sampling\n\n1\n\n\fmay be cast into a combinatorial optimization problem, which can be tackled by using greedy strate-\ngies in polynomial time [4] and boosted by using advanced sampling strategies to further reduce the\nrelative approximation error [14].\nFrom another perspective, we are aware that data points may form some interesting structures. Un-\nderstanding these structures has been proven bene\ufb01cial to approximate or represent data inputs [11].\nOne of the most famous algorithms for dimensionality reduction, Non-negative Matrix Factorization\n(NMF) [16], learns a low-dimensional convex hull from data points through a convex relaxation [3].\nThis idea was extended to signal separation by pursuing a convex hull with a maximized volume\n[27] to enclose input data points. Assuming that vertices are equally distant, the problem of \ufb01tting\na simplex with a maximized volume to data reduces to a simple greedy column selection procedure\n[26]. The simplex \ufb01tting approach demonstrated its success in face recognition tasks [32]. Paral-\nlel research in geoscience and remote sensing is also active, where the vertices of a convex hull\nare coined as endmembers or extreme points, leading to a classic \u201cN-Finder\u201d algorithm [31]. The\nabove approaches tried to learn data structures that are usually characterized by convexity. Hence,\nthey may fail to reveal the intrinsic data structures when the distributions of data points are diverse,\ne.g., data being on manifolds or concave structures. Probabilistic models like Determinantal Point\nProcess (DPP) [13] measure data densities, so they are likely to overcome the convexity issue. How-\never, few previous work accessed structural information of possibly nonconvex data for column\nsampling/subset selection tasks.\nThis paper aims to address the issue of learning nonconvex structures of data in the case where\nthe data distributions can be arbitrary. More speci\ufb01cally, we learn a nonconvex hull to encapsulate\nthe data structure. The on-hull points tightly enclose the dataset but do not need to form a convex\nset. Thus, nonconvex hulls can be more adaptive to capture practically complex data structures.\nAkin to convex hull learning, our proposed approach also extracts extreme points from an input\ndataset. To complete this task, we start with exploring the property of graph cycles in a neighborhood\ngraph built over the input data points. Using cycle-based measures to characterize data structures\nhas been proven successful in clustering data of multiple types of distributions [34]. To induce a\nmeasure of structural complexities stemming from graph cycles, we introduce the Zeta Function\nwhich applies the integration of graph cycles to model the linkage properties of the neighborhood\ngraph. The key advantage of the Zeta function is uniting both global and local connection properties\nof the graph. As such, we are able to learn a hull which encompasses almost all input data points\nbut is not necessary to be convex. With structural complexities captured in the form of the Zeta\nfunction, we present a leave-one-out strategy to \ufb01nd the extreme points. The basic idea is that\nremoving the on-hull points only has weak impact on structural complexities of the graph. The\ndecision of removal will be based on extremeness of a data point. Our model, dubbed Zeta Hulls, is\nderived by computing and analyzing the extremeness of data points. The greedy pursuit of the Zeta\nHull model requires the computation of the inversion of a matrix obtained from the graph af\ufb01nity\nmatrix, which is computationally prohibitive for massive-scale data. To accelerate such a matrix\nmanipulation, we employ the Anchor Graph [18] technique in the sense that the original graph can\nbe approximated with respect to the anchors originating from a randomly sampled data subset. Our\nmodel is testi\ufb01ed through extensive experiments on toy data and real-world text and image datasets.\nExperimental results show that in terms of unsupervised data representation learning, the Zeta Hull\nbased methods outperform the state-of-the-art methods used in convex hull learning, clustering,\nmatrix factorization, and dictionary learning.\n\n2 Nonconvex Hull Learning\n\nTo elaborate on our approach, we \ufb01rst introduce and de\ufb01ne the point extremeness. It measures the\ndegree of a data point being prone to lie on or near a nonconvex hull by virtue of a neighborhood\ngraph drawn from an input dataset. As an intuitive criterion, the data point with strong connections\nin the graph should have the low point extremeness. To obtain the extremeness measure, we need to\nexplore the underlying structure of the graph, where graph cycles are employed.\n\n2.1 Zeta Function and Structural Complexity\n\nWe model graph cycles by means of a sum-product rule and then integrate them using a Zeta func-\ntion. There are many variants of original Riemann Zeta Function, one of which is specialized in\n\n2\n\n\f(a) Original Graph\n\n(b) Remaining Graph\n\nFigure 1: An illustration of pursuing on-hull points using the graph measure. (a) shows a point set\nwith a k-nearest neighbor graph. Points in red are ones lying on the hull of the point set, e.g., the\npoints we tend to select by the Zeta Hull Pursuit algorithm. (b) shows the remaining point set and the\ngraph after removing the on-hull points together with their corresponding edges. We observe that\nthe removal of the on-hull (i.e., \u201cextreme\u201d) points yields little impact on the structural complexity\nof the graph.\n\nweighted adjacency graphs. Applying the theoretical results of Zeta functions provides us a pow-\nerful tool for characterizing structural complexities of graphs. The numerical description of graph\nstructures will play a critical role in column sampling/subset selection tasks.\nFormally, given a graph G(X , E) with n nodes being data points in X = {xi}n\ni=1, let the n \u00d7 n\nmatrix W denote the weighted adjacency (or af\ufb01nity) matrix of the graph G built over the dataset X .\nUsually the graph af\ufb01nities are calculated with a proper distance metric. To be generic, we assume\nthat G is directed. Then an edge leaving from xi to xj is denoted as eij. A path of length (cid:2) from\nxi to xj is de\ufb01ned as P (i, j, (cid:2)) = {ehktk\n}(cid:3)\nk=1 with h1 = i and t(cid:3) = j. Note that the nodes in\n(cid:2)\nthis path can be duplicate. A graph cycle, as a special case of paths of length (cid:2), is also de\ufb01ned as\n\u03b3(cid:3) = P (i, i, (cid:2)) (i = 1, . . . , n). The sum-product path af\ufb01nity \u03bd(cid:3) for all (cid:2)-length cycles can then\nbe computed by \u03bd(cid:3) =\n, where \u03ba(cid:3) denotes the set of all\npossible cycles of length (cid:2) and whktk denotes the (hk, tk)-entry of W, i.e., the af\ufb01nity from node\nxhk to node xtk. The edge et(cid:2)\u22121h1 is the last edge that closes the cycle. The computed compound\naf\ufb01nity \u03bd(cid:3) provides a measure for all cyclic connections of length (cid:2). Then we integrate such af\ufb01nities\nfor the cycles of lengths being from one to in\ufb01nity to derive the graph Zeta function as follows,\n\n(cid:3)(cid:3)\u22121\n\nk=1 whktk\n\n\u03b3(cid:2)\u2208\u03ba(cid:2)\n\n\u03bd\u03b3(cid:2) =\n\n\u03b3(cid:2)\u2208\u03ba(cid:2)\n\nwt(cid:2)\u22121h1\n\n(cid:2)\n\n(cid:6)\n\n(cid:4) \u221e(cid:5)\n\n(cid:3)=1\n\n\u03bd(cid:3)\n\nz(cid:3)\n(cid:2)\n\n\u03b6z(G) = exp\n\n,\n\n(1)\n\nwhere z is a constant. We only consider the situation where z is real-valued. The Zeta function\nin Eq. (1) has been proven to enjoy a closed form. Its convergence is also guaranteed when z <\n1/\u03c1(W), where \u03c1(W) is the spectral radius of W. These lead to Theorem 1 [23].\nTheorem 1. Let I be the identity matrix and \u03c1(W) be the spectral radius of the matrix W, respec-\ntively. If 0 < z < 1/\u03c1(W), then \u03b6z(G) = 1/ det(I \u2212 zW).\nNote that W can be asymmetric, implying that \u03bbi can be complex. In this case, Theorem 1 still\nholds. Theorem 1 indicates that the graph Zeta function we formulate in Eq. (1) provides a closed-\nform expression for describing the structural complexity of a graph. The next subsection will give\nthe de\ufb01nition of the point extremeness by analyzing the structural complexity.\n\n2.2 Zeta Hull Pursuits\n\nFrom now on, for simplicity we use \u0001G = \u03b6z(G) to represent the structural complexity of the original\ngraph G. To measure the point extremeness numerically, we perform a leave-one-out strategy in the\nsense that each point in C is successively left out and the variation of \u0001G is investigated. This is a\nnatural way to pursue extreme points, because if a point xj lies on the hull it has few communications\nwith the other points. After removing this point and its corresponding edges, the reductive structural\ncomplexity of the remaining graph G/xj, which we denote as \u0001G/xj , will still be close to \u0001G. Hence,\nthe point extremeness \u03b5xj is modeled as the relative change of the structural complexity \u0001G, that is\n\u03b5xj = \u0001G\n\u0001G/xj\nTheorem 2. Given \u0001G and \u0001G/xj as in Theorem 1, the point extremeness measure \u03b5xj of point xj\nsatis\ufb01es \u03b5xj = (I \u2212 zW)\n\u22121\n(jj), i.e., the point extremeness measure of point xj is equal to the j-th\ndiagonal entry of the matrix (I \u2212 zW)\n\n. Now we have the following theorem.\n\n\u22121.\n\n3\n\n\fAlgorithm 1 Zeta Hull Pursuits\n\nInput: A dataset X , the number m of data points to be selected, and free parameters z, \u03bb and k.\nOutput: The hull of sampled columns Ce := Cm+1.\nInitialize: construct W, C1 \u2190 \u2205, X1 = X , c1 = 0, and W1 = W\nfor i = 1 to m do\n\n(jj), for xj \u2208 Xi\n\u22121\ni e(cid:4)\n\n\u03b5xj := (I \u2212 zWi)\nxei := arg minxj\u2208Xi (\u03b5xj + \u03bb\nCi+1 := Ci \u222a xei\nci+1 := ci + eei\nXi+1 := Ci/xei\nWi+1 := Wi with the ei-th row and column removed\n\nj Wci)\n\nend for\n\n, . . . , xem\n\n(cid:2)m\n\nC\u2282X g(C) + \u03bbc(cid:4)Wc,\n\nAccording to previous analysis, the data point with a small \u03b5xj tends to be on the hull and therefore\nhas a strong extremeness. To seek the on-hull points, we need to select a subset of m points Ce =\n{xe1\n} from X such that they have the strongest point extremenesses. We formulate this\ngoal into the following optimization problem:\nCe = arg min\n\n(2)\nwhere c is a selection vector with m nonzero elements cei = 1 (i = 1, . . . , m), and g(C) is the\nfunction which measures the impact on the structural complexity after removing the extracted points.\nIn our case, g(C) =\ni=1 \u03b5xci . The second term in Eq. (2) is a regularization term enforcing that\nthe selected data points do not intersect with each other. It will enable the selection process to have\na better representative capability. The parameter \u03bb controls the extent of the regularization.\nNaively solving the combinatorial optimization problem in Eq. (2) requires exponential time. By\nadopting a greedy strategy, we can solve this optimization problem in an iterative manner and with\na feasible time complexity. Speci\ufb01cally, in each iteration we extract one point from the current data\nset and add it to the subset of the selected points. Sticking to this greedy strategy, we will attain the\ndesired m on-hull points after m iterations. In the i-th iteration, we extract the point xei according\nto the criterion\n\nxei = arg min\nxj\u2208Xi\u22121\n\n(3)\nwhere ej is the j-th standard basis vector, and ci\u22121 is the selection vector according to i\u2212 1 selected\npoints before the i-th iteration.\nWe name our algorithm Zeta Hull Pursuits in order to emphasize that we use the Zeta function to\npursue the nonconvex data hull. Algorithm 1 summarizes the Zeta Hull Pursuits algorithm.\n\n\u03b5xj +\n\ne(cid:4)\nj Wci\u22121,\n\n\u03bb\ni\n\n3 Zeta Hull Pursuits via Anchors\nAlgorithm 1 is applicable to small to medium-scale data X due to its cubical time complexity and\nquadratic space complexity with respect to the data size |X|. Here we propose a scalable algorithm\nfacilitated by a reasonable prior to tackle the nonconvex hull learning problem ef\ufb01ciently. The idea is\nto build a low-rank approximation to the graph adjacency matrix W with a small number of sampled\ndata points, namely anchor points. We resort to the Anchor Graph technique [18], which has been\nsuccessfully applied to handle large-scale hashing[20] and semi-supervised learning problems.\n\n3.1 Anchor Graphs\nThe anchor graph framework is an elegant way to approximate neighborhood graphs. It \ufb01rst chooses\na subset of l anchor points U = {uj}l\nj=1 from X . Then for each data point in X , its s nearest anchors\nin U are sought, thereby forming an s-nearest anchor graph. The anchor graph theory assumes that\nthe original graph af\ufb01nity matrix W can be reconstructed from the anchor graph with a small number\nof anchors (l (cid:5) n). Anchor points can be selected by random sampling or a rough clustering\nprocess. Many algorithms are available to embed a data point to its s nearest anchor points, as\nsuggested in [18]. Here we adopt the simplest approach to build the anchor embedding matrix \u02c6H;\nsay, \u02c6hij =\n, where dij is the distance from data\n\n, uj \u2208 {s nearest anchors of xi}\n\n(cid:8)\u2212d2\n\nij/\u03c32\n\n(cid:7)\n\n(cid:9)\n\nexp\n0,\n\notherwise\n\n4\n\n\fAlgorithm 2 Anchor-based Zeta Hull Pursuits\n\nInput: A dataset X , the number m of data points to be sampled, the number l of anchors, the\nnumber s of nearest anchors, and a free parameter z.\nOutput: The hull of sampled columns Ce := Cm+1.\nInitialize: construct H, X1 = X , C1 = \u2205, and H1 = H\nfor i = 1 to m do\n\n(cid:2)l\n\n\u03bb2\nj\n\nk=1\n\n1\u2212z\u03bb2\n\n(cid:2)\n\n(Ujk)2, for xj \u2208 Xi\n(cid:4)\n\u03bb\ni hjht\n\nperform SVD to obtain Hi := U\u03a3VT\n\u03b5xj := z\nxei := arg minxj\u2208Xi (\u03b5xj +\nCi+1 := Ci \u222a xei\nXi+1 := Xi/xei\nHi+1 := Hi with the ei-th row removed\n\nxt\u2208Ci\n\nk\n\n)\n\nend for\n\npoint xi to anchor uj, and \u03c3 is a parameter controlling the bandwidth of the exponential function.\nThe matrix \u02c6H is then normalized so that its every row sums to one. In doing so, we can approximate\nthe af\ufb01nity matrix of the original graph as \u02c6W = \u02c6H\u039b\u22121 \u02c6H(cid:4), where \u039b is a diagonal matrix whose i-th\ndiagonal element is equal to the sum of the i-th column of \u02c6H. As a result, all matrix manipulations\nupon the original graph af\ufb01nity matrix W can be approximated by substituting the anchor graph\naf\ufb01nity matrix \u02c6W for W.\n\n\u22121. Using the anchor graph technique, we can write (I \u2212 zW)\n\n3.2 Extremeness Computation via Anchors\nNote that the computation of the point extremeness for \u03b5xj depends on the diagonal elements of\n(I \u2212 zW)\n\u22121,\nwhere H = \u02c6H\u039b\u2212 1\n2 . Thus we have the following theorem that enables an ef\ufb01cient computation of\n\u03b5xj . The proof is detailed in the supplementary material.\nTheorem 3. Let the singular vector decomposition of H be H = U\u03a3V(cid:4)\ndiag(\u03bb1, . . . , \u03bbl).\nHV\u03a3\u22121 and Ujk denotes the (i, j)-th entry of U.\n\nIf H(cid:4)H is not singular, then \u03b5\u22121\n\n, where \u03a3 =\n(Ujk)2, where U =\n\n\u22121 = (I \u2212 zHH(cid:4)\n\nxj = 1 + z\n\n(cid:2)l\n\n1\u2212z\u03bb2\n\nk=1\n\n\u03bb2\nk\n\n)\n\nk\n\n(cid:2)\n\nTheorem 3 reveals that the major computation of \u03b5xj will reduce to the eigendecomposition of a\nmuch smaller matrix H(cid:4)H, which results in a direct acceleration of the Zeta hull pursuit process.\nAt the same time, the second term of Eq. (3) encountered in the i-th iteration can be estimated by\n(cid:4), where hj denotes the j-th row of H and ci\u22121 is the selection vector\ne(cid:4)\nj Wci = 1\ni\nof the extracted point set before the i-th iteration. These lead to the Anchor-based Zeta Hull Pursuits\nalgorithm shown in Algorithm 2.\n\nxt\u2208Ci\n\nhjht\n\n3.3 Downdating SVD\nIn Algorithm 2, the singular value decomposition dominates the total time cost. We notice that\nreusing information in previous iterations can save the computation time. The removal of one row\nfrom H is equivalent to a rank-one modi\ufb01cation to the original matrix. Downdating SVD [10] was\nproposed to handle this operation. Given the diagonal singular value matrix \u03a3i and the point xei\nchosen in the i-th iteration, the singular value matrix \u03a3i+1 for the next iteration can be calculated\nby the eigendecomposition of an l \u00d7 l matrix D derived from \u03a3i, where D = (I \u2212 1\nei )\u03a3i,\n(cid:6)2\nand \u03bc2 + (cid:6)hei\n2 = 1. The decomposition of D can be ef\ufb01ciently performed in O(l2) time [10].\nThen the computation of Ui+1 is achieved by a multiplication of Ui with an l \u00d7 l matrix produced\nby the decomposition operation on D, which permits a natural parallelism. Consequently, we can\nfurther accelerate Algorithm 2 by using a parallel computing scheme.\n\n1+\u03bc hei h(cid:4)\n\n3.4 Complexity Analysis\nWe now analyze the complexities of Algorithms 1 and 2. For Algorithm 1, the most time-consuming\nstep is to solve the matrix inverse of n \u00d7 n size, which costs a time complexity of O(n3). The\noverall time complexity is thus O(mn3) for extracting m points. In the implementation we can use\n\n5\n\n\f(a) m = 20, ZHP\n\n(b) m = 40, ZHP\n\n(c) m = 80, ZHP\n\n(d) m = 200, ZHP\n\n(e) m = 20, A-ZHP\n\n(f) m = 40, A-ZHP\n\n(g) m = 80, A-ZHP\n\n(h) m = 200, A-ZHP\n\n(k) m = 40, CUR\n\n(l) m = 40, K-medoids\n\n(i) m = 40, Leverage Score\n\n(j) m = 40, Simplex\n\nFigure 2: Zeta hull pursuits on the two-moon toy dataset. We select m data points from the dataset\nwith various methods. In the sub-\ufb01gures, blue dots are data points. The selected samples are sur-\nrounded with red circles. The caption of each sub-\ufb01gure describes the number of selected points m\nand the method used to select those data points. First two rows shows the results of our algorithms\nwith different m. The third row illustrates the comparisons with other methods when m = 40. For\nthe leverage score approach, we follow the steps in [21].\nthe sparse matrix computation to reduce the constant factor [5]. For Algorithm 2, the most time-\nconsuming step is to perform SVD over H, so the overall time complexity is O(mnl2). Leveraging\ndowndating SVD, we only need to calculate the full SVD of H once in O(nl2) time and iteratively\nupdate the decomposition in O(l2) time per iteration. The matrix multiplication operation then\ndominates the total time cost. Also, it can be parallelized using a multi-core CPU or a modern GPU,\nresulting in a very small constant factor in the time complexity. Since l is usually less than 10% of n,\nAlgorithm 2 is orders of magnitude faster than Algorithm 1. For cases where l needs to be relatively\nlarge (20% of n for example), the computational cost will not show a considerable increase since H\nis usually a very sparse matrix.\n4 Experiments\nThe Zeta Hull model aims at learning the structures of dataset. We evaluate how well our model\nachieves this goal by performing classi\ufb01cation experiments. For simplicity, we abbreviate our al-\ngorithms as follows: the original Zeta Hull Pursuit algorithm (Algorithm 1), ZHP and its anchor\nversion (Algorithm 2), A-ZHP. To compare with the state-of-the-art, we choose some renowned\nmethods: K-medoids, CUR matrix factorization (CUR) [29], simplex volume maximization (Sim-\nplex) [26], sparse dictionary learning (DictLearn) [22] and convex non-negative matrix factorization\n(C-NMF) [6]. Basically, we use the extracted data points to learn a representation for each data\npoint in an unsupervised manner. Classi\ufb01cation is done by feeding the representation into a clas-\nsi\ufb01er. The representation will be built in two ways: 1) the sparse coding [22] and 2) the locality\nsimplex coding [26]. To differentiate our algorithms from the original anchor graph framework, we\nconduct a set of experiments using the left singular vectors of the anchor embedding matrix H as\nthe representation. In these experiments, anchors used in the anchor graph technique are randomly\nselected from the training set. To compare with existing low-dimension embedding approaches, we\nrun the Large-Scale Manifold method [24] using the same number of landmarks as that of extracted\npoints.\n4.1 Toy Dataset\nFirst we illustrate our algorithms on a toy dataset. The dataset, commonly known as \u201dthe two\nmoons\u201d, consists of 2000 data points on the 2D plane which are manifold-structured and comprise\nnonconvex distributions. This experiment on the two moons provides illustrative results of our\nalgorithms in the presence of nonconvexity. We select different numbers of column subsets m =\n{20, 40, 80, 200} and compare with various other methods. A visualization of the results is shown\nin Figure 2. We can see that our algorithms can extract the nonconvex hull of the data cloud more\naccurately.\n4.2 Text and Image Datasets\nFor the classi\ufb01cation experiments in this section, we derive the two types of data representations (the\nsparse coding and the local simplex coding) from the points/columns extracted by compared meth-\n\n6\n\n\fTable 1: Classi\ufb01cation error rates in percentage (%) on texts (TDT2 and Newsgroups) and hand-\nwritten number datasets (MNIST). The numbers in bold font highlight best results under the settings.\nIn this table, \u201cSC\u201d refers to the results using the sparse coding to form the representation, while\n\u201cLSC\u201d refers to the results using local simplex coding. The cells with \u201c-\u201d indicate that the ZHP\nmethod is too expensive to be performed under the associated settings. The \u201cAnchor Graph\u201d refers\nto the additional experiments using the original anchor graph framework [18].\n\nTDT2\n\nNewsgroups\n\nm = 500\nSC\n2.31\n2.52\n3.79\n3.73\n4.83\n6.82\n9.14\n\nLSC\n1.97\n2.68\n1.73\n5.62\n3.46\n3.73\n7.87\n\nm = 1000\nLSC\nSC\n0.48\n1.53\n0.96\n2.08\n1.51\n1.77\n1.18\n2.57\n2.31\n2.07\n1.52\n2.37\n3.73\n4.69\n\nm = 500\nSC\n-\n\n-\n\nLSC\n\n11.79\n13.55\n9.51\n11.68\n15.32\n19.73\n\n10.77\n10.41\n10.76\n11.83\n11.44\n12.02\n\n-\n\nm = 1000\nSC\nLSC\n-\n7.1\n8.16\n6.72\n7.72\n12.38\n19.67\n\n6.58\n8.04\n9.63\n7.42\n9.47\n10.04\n\nMNIST\n\nm = 500\nSC\n-\n\n-\n\nLSC\n\nm = 2000\nSC\nLSC\n-\n\n-\n\n3.45\n5.79\n3.16\n5.07\n10.13\n9.28\n\n3.07\n5.79\n3.16\n5.27\n10.13\n9.28\n\n1.43\n2.27\n1.36\n3.01\n3.79\n2.72\n\n1.19\n1.51\n2.11\n3.04\n5.27\n2.31\n\nMethods\n\nZHP\nA-ZHP\n\nSimplex [26]\nDictLearn [22]\n\nC-NMF [6]\nCUR [29]\n\nK-medoids [12]\nAnchor Graph [18]\n\n5.81\n\n2.68\n\n12.32\n\n8.76\n\n3.17\n\n2.33\n\nTable 2: Recognition error rates in percentage (%) on object and face datasets. We select L samples\nfor each class in the training set for training or forming the gallery. The numbers in bold font\nhighlight best results under the settings. In this table, \u201cSC\u201d refers to the results using the sparse\ncoding to form the representation, while \u201cLSC\u201d refers to the results using local simplex coding. The\n\u201cRaw Feature\u201d refers to the experiments conducted on the raw features vectors. The face recognition\nprocess is described in Sec. (4.2).\n\nMethods\n\nCaltech101\n\nd = 21504, L = 30\n\nCaltech101\n\nd = 5120, L = 30\n\nMultiPIE\n\nd = 2000, L = 30\n\nA-ZHP\n\nSimplex [26]\nDictLearn [22]\n\nC-NMF [6]\nCUR [21]\n\nK-medoids [12]\nAnchor Graph [18]\nLarge Manifold [24]\nRaw Feature [28]\n\nm = 500\nSC\n25.77\n29.83\n26.95\n30.66\n29.74\n27.82\n\nLSC\n26.82\n26.16\n29.73\n27.83\n28.77\n27.64\n\n26.32\n28.71\n\nm = 1000\nLSC\nSC\n23.13\n25.81\n25.18\n26.83\n26.73\n29.51\n27.62\n28.72\n26.81\n26.16\n26.09\n25.73\n\n25.15\n27.92\n\nm = 500\nSC\n29.61\n32.43\n29.15\n32.57\n31.69\n29.85\n\nLSC\n28.95\n29.66\n31.83\n31.13\n32.57\n29.63\n\n30.53\n32.67\n\nm = 1000\nLSC\nSC\n26.59\n25.62\n27.47\n30.62\n28.93\n29.67\n28.73\n31.15\n31.13\n30.72\n28.97\n28.28\n\n28.14\n30.19\n\nm = 500\nSC\n20.8\n19.9\n19.6\n20.4\n21.3\n29.7\n\nLSC\n14.2\n15.8\n20.8\n17.5\n21.9\n19.8\n\n17.6\n31.4\n\nm = 2000\nLSC\nSC\n11.3\n19.6\n17.7\n13.7\n18.5\n19.7\n14.8\n19.9\n21.6\n20.7\n25.4\n17.7\n\n14.4\n30.1\n\n26.7\n\n31.18\n\n27.6\n\nods. By measuring the performance of applying these representations to solving the classi\ufb01cation\ntasks, we can evaluate the representative power of the compared point/column selection methods.\nThe sparse coding is widely used for obtaining the representation for classi\ufb01cation. Here a standard\n(cid:2)1-regularized projection algorithm (LASSO) [22] is adopted to learn the sparse representation from\nthe extracted data points. LASSO will deliver a sparse coef\ufb01cient vector, which is applied as the\nrepresentation of the data point. We use \u201cSC\u201d to indicate the related results in Table 1 and Table 2.\nThe local simplex coding reconstructs one data point as a convex combination of a set of nearest\nexemplar points, which form local simplexes [26]. Imposing this convex reconstruction constraint\nleads to non-negative combination coef\ufb01cients. The sparse coef\ufb01cients vector will be used as data\nrepresentation. \u201cLSC\u201d indicates the related results in Table 1 and Table 2.\nThe classi\ufb01cation pipeline is as follows. After extracting m points/columns from the training set,\nall data points will be represented with these selected points using the two approaches above. Then\nwe feed the representations into a linear SVM for the training and testing. The better classi\ufb01cation\naccuracy will reveal the stronger representative power of the column selection algorithm.\nIn all\nexperiments, the parameter z is \ufb01xed at 0.05 to guarantee the convergence of the Zeta function. We\n\ufb01nd that \ufb01nal results are robust to z once the convergence is guaranteed. For the A-ZHP algorithm,\nthe parameter s is \ufb01xed at 10 and the number of anchor points l is set as 10% of the training set\nsize. The bandwidth parameter \u03c3 of the exponential function is tuned on the training set to obtain a\nreasonable anchor embedding.\nThe classi\ufb01cation of text contents relies on the informative representation of the plain words or sen-\ntences. Two text datasets are adopted for classi\ufb01cation, i.e. the TDT2 dataset and the Newsgroups\ndataset [2]. In experiments, a subset of TDT2 is used (TDT2-30). It has 9394 samples from 30\nclasses. Each feature vector is of 36771 dimensions and normalized into unit length. The training\nset contains 6000 samples randomly selected from the dataset and rest of the samples are used for\n\n7\n\n\ftesting. The parameter m is set to be 500 and 1000 on this dataset. The Newsgroups dataset con-\ntains 18846 samples from 20 classes. The training set contains 11314, while the testing set has 7532.\nThe two sets are separated in advance [2] and ordered in time sequence to be more challenging for\nclassi\ufb01ers. The parameter m is set to be 500 and 1000 on this dataset. The classi\ufb01cation results are\nreported in Table 1.\nFor object and face recognition tasks we conduct experiments under three classic scenarios, the\nhand-written digits classi\ufb01cation, the image recognition, and the human face recognition. Related\nexperimental results are reported in Table 1 and Table 2.\nThe MNIST dataset serves as a standard benchmark for machine learning algorithms. It contains 10\nclasses of images corresponding to hand-written numbers from 0 to 9. The training set has 60000\nimages and the testing set has 10000 images. Each sample is a 784-dimensional vector.\nThe Caltech101 dataset [17] is a widely used benchmark for object recognition systems. It consists\nof images from 102 classes of objects (101 object classes and one background class). We randomly\nselect 30 labeled images from every class for training the classi\ufb01er and 3000 images for testing.\nThe recognition rates averaged over all classes are reported. Every image is processed into a feature\nvector of 21504 dimensions by the method in [28]. We also conduct experiment on a feature subset\nIn this experiment, m is set to be 500 and 1000.\nof the top 5000 dimensions (Caltech101-5k).\nOn-hull points are extracted on the training set.\nThe MultiPIE human face dataset is a widely applied benchmark for face recognition [9]. We follow\na standard gallery-probe protocol of face recognition. The testing set is divided into the gallery set\nand the probe set. The identity predication of a probe image comes from its nearest neighbor of\nEuclidean distance in the gallery. We randomly select 30, 000 images of 200 subjects as the training\nset for learning the data representation. Then we pick out 3000 images of the other 100 subjects\n(L = 30) to form the gallery set and 6000 images as the probes. The head poses of all these faces\nare between \u00b115 degrees. Each face image is processed into a vector of 5000 dimensions using the\nlocal binary pattern descriptor and PCA. We vary the parameter m from 500 to 2000 to evaluate the\nin\ufb02uence of number of sampled points.\nDiscussion. For the experiments on these high-dimensional datasets, the methods based on the\nZeta Hull model outperform most compared methods and also show promising performance im-\nprovements over raw data representation. When the number of extracted points grows, the resulting\nclassi\ufb01cation accuracy increases. This corroborates that the Zeta Hull model can effectively capture\nintrinsic structures of given datasets. More importantly, the discriminative information is preserved\nthrough learning these Zeta hulls. The representation yielded by the Zeta Hull model is sparse and of\nmanageable dimensionality (500-2000), which substantially eases the workload of classi\ufb01er train-\ning. This property is also favorable for tackling other large-scale learning problems. Due to the\ngraph-theoretic measure that uni\ufb01es the local and global connection properties of a graph, the Zeta\nHull model leads to better data representation compared against existing graph-based embedding\nand manifold learning methods. For the comparison with the Large-Scale Manifold method [24]\non the MultiPIE dataset, we \ufb01nd that even using 10K landmarks, its accuracy is still inferior to our\nmethods relying on the Zeta Hull model. We also notice that noise may also affect the quality of Zeta\nhulls. This dif\ufb01culty can be circumvented by running a number of well-established outlier removal\nmethods such as [19].\n\n5 Conclusion\nIn this paper, we proposed a geometric model, dubbed Zeta Hulls, for column sampling through\nlearning nonconvex hulls of input data. The Zeta Hull model was built upon a novel graph-theoretic\nmeasure which quanti\ufb01es the point extremeness to unify local and global connection properties of\nindividual data point in an adjacency graph. By means of the Zeta function de\ufb01ned on the graph,\nthe point extremeness measure amounts to the diagonal elements of a matrix related to the graph\nadjacency matrix. We also reduced the time and space complexities for computing a Zeta hull by\nincorporating an ef\ufb01cient anchor graph technique. A synthetic experiment \ufb01rst showed that the Zeta\nHull model can detect appropriate hulls for non-convexly distributed data. The extensive real-world\nexperiments conducted on benchmark text and image datasets further demonstrated the superiority\nof the Zeta Hull model over competing methods including convex hull learning, clustering, matrix\nfactorization, and dictionary learning.\nAcknowledgement This research is partially supported by project #MMT-8115038 of the Shun\nHing Institute of Advanced Engineering, The Chinese University of Hong Kong.\n\n8\n\n\fReferences\n[1] M.-A. Belabbas and P. J. Wolfe. Spectral methods in machine learning and new strategies for very large\n\ndatasets. PNAS, 106(2):369\u2013374, 2009.\n\n[2] D. Cai, X. Wang, and X. He. Probabilistic dyadic data analysis with local and global consistency. In Proc.\n\n[3] M. Chu and M. Lin. Low dimensional polytope approximation and its application to nonnegative matrix\n\nfactorization. SIAM Journal of Computing, pages 1131\u20131155, 2008.\n\n[4] A. Das and D. Kempe. Submodular meets spectral: greedy algorithms for subset selection, sparse ap-\n\nproximation and dictionary selection. In Proc. ICML, 2011.\n\n[5] T. Davis. SPARSEINV: a MATLAB toolbox for computing the sparse inverse subset using the Takahashi\n\nICML, 2009.\n\nequations, 2011.\n\n2010.\n\n[6] C. Ding, T. Li, and M. Jordan. Convex and semi-nonnegative matrix factorizations. TPAMI, 32(1):45\u201355,\n\n[7] P. Drineas and M. Mahoney. On the Nystr\u00a8om method for approximating a gram matrix for improved\n\nkernel-based learning. JMLR, 6:2153\u20132175, 2005.\n\n[8] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystr\u00a8om method. TPAMI,\n\n[9] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. In Proc. Automatic Face Gesture\n\n26:214\u2013225, 2004.\n\nRecognition, pages 1\u20138, Sept 2008.\n\n[10] M. Gu and S. C. Eisenstat. Downdating the singular value decomposition. SIAM Journal on Matrix\n\nAnalysis and Applications, 16(3):793\u2013810, 1995.\n\n[11] T. Hastie, R. Tibshirani, and J. J. H. Friedman. The elements of statistical learning, volume 1. Springer\n\n[12] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume\n\nNew York, 2001.\n\n344. John Wiley & Sons, 2009.\n\nin Machine Learning, 5(2\u20133), 2012.\n\n1006, 2012.\n\n401(6755):788\u2013791, 1999.\n\n[13] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends\n\n[14] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystr\u00a8om method. In NIPS 23, 2009.\n[15] S. Kumar, M. Mohri, and A. Talwalkar. Sampling methods for the Nystr\u00a8om method. JMLR, 13(1):981\u2013\n\n[16] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature,\n\n[17] F. Li, B. Fergus, and P. Perona. Learning generative visual models from few training examples: An\n\nincremental bayesian approach tested on 101 object categories. CVIU, 106(1):59\u201370, 2007.\n\n[18] W. Liu, J. He, and S.-F. Chang. Large graph construction for scalable semi-supervised learning. In Proc.\n\n[19] W. Liu, G. Hua, and J. Smith. Unsupervised one-class learning for automatic outlier removal. In Proc.\n\n[20] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In Proc. ICML, 2011.\n[21] M. W. Mahoney and P. Drineas. Cur matrix decompositions for improved data analysis. PNAS,\n\n[22] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding.\n\n[23] S. Savchenko. The Zeta-function and gibbs measures. Russian Mathematical Surveys, 48(1):189\u2013190,\n\n[24] A. Talwalkar, S. Kumar, M. Mohri, and H. Rowley. Large-scale SVD and manifold learning. JMLR,\n\n[25] A. Talwalkar, S. Kumar, and H. Rowley. Large-scale manifold learning. In Proc. CVPR, 2008.\n[26] C. Thurau, K. Kersting, and C. Bauckhage. Yes we can: simplex volume maximization for descriptive\n\nweb-scale matrix factorization. In Proc. CIKM, 2010.\n\n[27] F. Wang, C. Chi, T. Chan, and Y. Wang. Nonnegative least correlated component analysis for separation\n\nof dependent sources by volume maximization. TPAMI, 32:875\u2013888, 2010.\n\n[28] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image\n\nclassi\ufb01cation. In Proc. CVPR, 2010.\n\nbound. In NIPS 26, 2012.\n\n[29] S. Wang and Z. Zhang. A scalable cur matrix decomposition algorithm: lower time complexity and tighter\n\n[30] C. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In NIPS 14, 2000.\n[31] M. E. Winter. N-\ufb01nder: an algorithm for fast autonomous spectral end-member determination in hyper-\nspectral data. In SPIE\u2019s International Symposium on Optical Science, Engineering, and Instrumentation.\nInternational Society for Optics and Photonics, 1999.\n\n[32] Y. Xiong, W. Liu, D. Zhao, and X. Tang. Face recognition via archetype hull ranking. In Proc. ICCV,\n\n[33] K. Zhang and J. Kwok. Density weighted Nystr\u00a8om method for computing large kernel eigensystems.\n\nNeural Computation, 21:121\u2013146, 2009.\n\n[34] D. Zhao and X. Tang. Cyclizing clusters via Zeta function of a graph. In NIPS 22, 2008.\n\n2013.\n\nICML, 2010.\n\nCVPR, 2014.\n\n106(3):697\u2013702, 2009.\n\nJMLR, 11:19\u201360, 2010.\n\n1993.\n\n14(1):3129\u20133152, 2013.\n\n9\n\n\f", "award": [], "sourceid": 50, "authors": [{"given_name": "Yuanjun", "family_name": "Xiong", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Wei", "family_name": "Liu", "institution": "IBM T. J. Watson Research Center"}, {"given_name": "Deli", "family_name": "Zhao", "institution": "MultiMedia Lab., Department of Information Engineering, The Chinese University of Hong Kong"}, {"given_name": "Xiaoou", "family_name": "Tang", "institution": "Chinese University of Hong Kong"}]}