{"title": "Online Convex Matrix Factorization with Representative Regions", "book": "Advances in Neural Information Processing Systems", "page_first": 13242, "page_last": 13252, "abstract": "Matrix factorization (MF) is a versatile learning method that has found wide applications in various data-driven disciplines. Still, many MF algorithms do not adequately scale with the size of available datasets and/or lack interpretability. To improve the computational efficiency of the method, an online (streaming) MF algorithm was proposed in Mairal et al., 2010. To enable data interpretability, a constrained version of MF, termed convex MF, was introduced in Ding et al., 2010. In the latter work, the basis vectors are required to lie in the convex hull of the data samples, thereby ensuring that every basis can be interpreted as a weighted combination of data samples. No current algorithmic solutions for online convex MF are known as it is challenging to find adequate convex bases without having access to the complete dataset. We address both problems by proposing the first online convex MF algorithm that maintains a collection of constant-size sets of representative data samples needed for interpreting each of the basis (Ding et al., 2010) and has the same almost sure convergence guarantees as the online learning algorithm of Mairal et al., 2010. Our proof techniques combine random coordinate descent algorithms with specialized quasi-martingale convergence analysis. Experiments on synthetic and real world datasets show significant computational savings of the proposed online convex MF method compared to classical convex MF. Since the proposed method maintains small representative sets of data samples needed for convex interpretations, it is related to a body of work in theoretical computer science, pertaining to generating point sets (Blum et al., 2016), and in computer vision, pertaining to archetypal analysis (Mei et al., 2018). Nevertheless, it differs from these lines of work both in terms of the objective and algorithmic implementations.", "full_text": "Online Convex Matrix Factorization with\n\nRepresentative Regions\n\nAbhishek Agarwal \u08a9\n\nElectrical and Computer Engineering\nUniversity of Illinois Urbana-Champaign\n\nabhiag@illinois.edu\n\nJianhao Peng \u08a9\n\nElectrical and Computer Engineering\nUniversity of Illinois Urbana-Champaign\n\njianhao2@illinois.edu\n\nOlgica Milenkovic\n\nElectrical and Computer Engineering\nUniversity of Illinois Urbana-Champaign\n\nmilenkov@illinois.edu\n\nAbstract\n\nMatrix factorization (MF) is a versatile learning method that has found wide ap-\nplications in various data-driven disciplines. Still, many MF algorithms do not\nadequately scale with the size of available datasets and/or lack interpretability. To\nimprove the computational e\ufb03ciency of the method, an online (streaming) MF\nalgorithm was proposed in [1]. To enable data interpretability, a constrained ver-\nsion of MF, termed convex MF, was introduced in [2]. In the latter work, the basis\nvectors are required to lie in the convex hull of the data samples, thereby ensuring\nthat every basis can be interpreted as a weighted combination of data samples. No\ncurrent algorithmic solutions for online convex MF are known as it is challenging\nto \ufb01nd adequate convex bases without having access to the complete dataset. We\naddress both problems by proposing the \ufb01rst online convex MF algorithm that\nmaintains a collection of constant-size sets of representative data samples needed\nfor interpreting each of the basis [2] and has the same almost sure convergence\nguarantees as the online learning algorithm of [1]. Our proof techniques combine\nrandom coordinate descent algorithms with specialized quasi-martingale conver-\ngence analysis. Experiments on synthetic and real world datasets show signi\ufb01cant\ncomputational savings of the proposed online convex MF method compared to\nclassical convex MF. Since the proposed method maintains small representative\nsets of data samples needed for convex interpretations, it is related to a body of\nwork in theoretical computer science, pertaining to generating point sets [3], and in\ncomputer vision, pertaining to archetypal analysis [4]. Nevertheless, it di\ufb00ers from\nthese lines of work both in terms of the objective and algorithmic implementations.\n\nIntroduction\n\n1\nMatrix Factorization (MF) is a widely used dimensionality reduction technique [5, 6] whose goal\nis to \ufb01nd a basis that allows for a sparse representation of the underlying data [7, 8]. Compared to\nother dimensionality reduction techniques based on eigendecompositions [9], MF enforces fewer\nrestrictions on the choice of the basis and hence ensures larger representation \ufb02exibility for complex\ndatasets. At the same time, it provides a natural, application-speci\ufb01c interpretation for the bases.\n\n\u08a9 Abhishek Agarwal and Jianhao Peng contribute equally to this work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fMF methods have been studied under various modeling constraints [2, 10, 11, 12, 13, 14, 15, 16].\nThe most frequently used constraints are non-negativity, constraints that accelerate convergence\nrates, semi-non-negativity, orthogonality and convexity [11, 2, 17]. Convex MF (cvxMF) [2] is\nof special interest as it requires the basis vectors to be convex combinations of the observed data\nsamples [18, 19]. This constraint allows one to interpret the basis vectors as probabilistic sums of a\n(small) representative subsets of data samples.\nUnfortunately, most of the aforementioned constrained MF problems are non-convex and NP-hard [20,\n21, 22], but can often be suboptimally solved using alternating optimization approaches for \ufb01nding\nlocal optima [13]. Alternating optimization approaches have scalability issues since the number\nof matrix multiplications and convex optimization steps in each iteration depends both on the data\nset size and its dimensionality. To address the scalability issue [23, 24, 25], Mairal, Bach, Ponce\nand Sapiro [1] introduced an online MF algorithm that minimizes a surrogate function amenable to\nsequential optimization. The online algorithm comes with strong performance guarantees, asserting\nthat its solution converges almost surely to a local optima of the generalization loss.\nCurrently, no online/streaming solutions for convex MF are known as it appears hard to satisfy\nthe convexity constraint without having access to the whole dataset. We propose the \ufb01rst online\nMF method accounting for convexity constraints on multi-cluster data sets, termed online convex\nMatrix Factorization (online cvxMF). The proposed method solves the cvxMF problem of Ding,\nLi and Jordan [2] in an online/streaming fashion, and allows for selecting a collection of \u201ctypical\u201d\nrepresentative sets of individual clusters (see Figure 1). The method sequentially processes single data\nsample and updates a running version of a collection of constant-size sets of representative samples\nof the clusters, needed for convex interpretations of each basis element. In this case, the basis also\nplays the role of the cluster centroid, and further increases interpretability. The method also allows\nfor both sparse data and sparse basis representations. In the latter context, sparsity refers to restricting\neach basis to be a convex combination of data samples in a small representative region. The online\ncvxMF algorithm has the same theoretical convergence guarantees as [1].\nWe also consider a more restricted version of the cvxMF problem, in which the representative samples\nare required to be strictly contained within their corresponding clusters. The algorithm is semi-\nheuristic as it has provable convergence guarantees only when sample classi\ufb01cation is error-free, as\nis the case for non-trivial supervised MF [26] (note that applying [1] to each cluster individually is\nclearly suboptimal, as one needs to jointly optimize both the basis and the embedding). The restricted\ncvxMF method nevertheless o\ufb00ers excellent empirical performance when properly initialized.\nIt is worth pointing out that our results complement a large body of work that generalize the method\nof [1] for di\ufb00erent loss functions [27, 28, 29] but do not impose convexity constraints. Furthermore,\nthe proposed online cvxMF exhibits certain similarities with online generating point set methods [3]\nand online archetypal analysis [4]. The goal of these two lines of work is to \ufb01nd a small set of\nrepresentative samples whose convex hull contains the majority of observed samples. In contrast, we\nonly seek a small set of representative samples needed for accurately describing a basis of the data.\nThe paper is organized as follows. Section 2 introduces the problem, relevant notation and introduces\nour approach towards an online algorithm for the cvxMF problem. Section 3 describes the proposed\nonline algorithm and Section 4 establishes that the learned basis almost surely converge to a stationary\npoint of the approximation-error function. The theoretical guarantees hold under mild assumptions\non the data distribution reminiscent of those used in [1], while the proof techniques combine ran-\ndom coordinate descent algorithms with specialized quasi-martingale convergence analysis. The\nperformance of the algorithm is tested on both synthetic and real world datasets, as outlined in Sec-\ntion 5. The real world datasets include are taken from the UCI Machine Learning [30] and the 10X\nGenomics repository [31]. The experiments reveal that our online cvxMF runs four times faster than\nits non-online counterpart on datasets with 104 samples, while for larger sample sets cvxMF becomes\nexponentially harder to execute. The online cvxMF also produces high-accuracy clustering results.\n\n2 Notation and Problem Formulation\nWe denote sets by [\u0001] = {1, \u2026 , \u0001}. Capital letters are reserved for matrices (bold font) and random\nvariables (RVs) (regular font). Random vectors are described by capital underlined letters, while\ndeterministic vectors are denoted by lower-case underlined letters. We use \u0001[\u0001] to denote the \u0001th\ncolumn of the matrix \u0001, \u0001[\u0001, \u0001] to denote the element in row \u0001 and column \u0001, and \u0001[\u0001] to denote the\n\n2\n\n\fFigure 1: A multi-cluster dataset: Stars represent the learned bases, while circles denote representative\nsamples for the basis of the same color. Left: The representative sets for the individual basis elements\nare unrestricted. Right: The representative sets are restricted to lie within their corresponding clusters.\n\nmin\n\n2 + \u0001\u07e0\u0001\u07e01.\n\nwhere\u07e0\u0001\u07e02 \u00e1 \u00dd\n\n\u0001\u0001 \u0001 and\u07e0\u0001\u07e01 \u00e1 \u08a3\n\n\u0001th coordinate of a vector \u0001. Furthermore, col(\u0001) stands for the set of columns of \u0001, while cvx(\u0001)\nstands for the convex hull of col(\u0001).\nLet \u0001 \u08a0 \u00d3\u0001\u00d7\u0001 denote a matrix of \u0001 data samples of constant dimension \u0001 arranged (summarized)\ncolumn-wise, let \u0001 \u08a0 \u00d3\u0001\u00d7\u0001 denote the \u0001 basis vectors used to represent the data and let \u0001 \u08a0 \u00d3\u0001\u00d7\u0001\nstand for the low-dimension embedding matrix. The classical MF problem reads as:\n\n\u0001,\u0001\u07e0\u0001 \u2212 \u0001\u0001\u07e02\n\u0001\u0007\u0007\u0001[\u0001]\u0007\u0007 denote the \u00012-norm and \u00011-norm of the vector \u0001,\n\u0001\u08a0\u00d3\u0001\u07e0X \u2212 \u0001\u0001\u07e02\n\nrespectively.\nIn practice, \u0001 is inherently random and in the stochastic setting it is more adequate to minimize the\nabove objective in expectation. In this case, the data approximation-error \u0001(\u0001) for a \ufb01xed \u0001 equals:\n(2)\nwhere X is a random vector of dimension \u0001 and the parameter \u0001 controls the sparsity of the coe\ufb03cient\nvector \u0001. For analytical tractability, we assume that X is drawn from the union of \u0001 disjoint, convex\ncompact regions (clusters), \u0016(\u0001) \u08a0 \u00d3\u0001, \u0001 \u08a0 [\u0001]. Each cluster is independently selected based on\na given distribution, and the vector X is sampled from the chosen cluster. Both the cluster and\nintra-cluster sample distributions are mildly constrained, as described in the next section.\nThe approximation-error of a single data sample \u0001 \u08a0 \u00d3\u0001 with respect to \u0001 equals\n\n\u0001(\u0001) \u00e1 \u0001X[ min\n\n2 + \u0001\u07e0\u0001\u07e01],\n\n(1)\n\n\u0001(\u0001, \u0001) \u00e1 min\n\n2\u07e0\u0001 \u2212 \u0001\u0001\u07e02\n\n1\n\n2 + \u0001\u07e0\u0001\u07e01.\n\n\u0001\u08a0\u00d3\u0001\n\n(3)\n\u0001\u0001(X, \u0001)\u0001. The function \u0001(\u0001) is non-convex and optimizing it is NP-hard and requires prior\nConsequently, the approximation error-function \u0001(\u0001) in Equation (2) may be written as \u0001(\u0001) =\n\u0001X\nknowledge of the distribution. To mitigate the latter problem, one can revert to an empirical estimate\nof \u0001(\u0001) involving the data samples \u0001\u0001, \u0001 \u08a0 [\u0001],\n\u0001\u0001(\u0001) = 1\n\u0001\n\n\u0001(\u0001\u0001, \u0001).\n\n\u0001\u08a3\n\nMaintaining a running estimate of \u0001\u0001 of an optimizer of \u0001\u0001(\u0001) involves updating the coe\ufb03cient vectors\nfor all the data samples observed up to time \u0001. Hence, it is desirable to use surrogate functions to\nsimplify the updates. The surrogate function \u012e\u0001\u0001(\u0001) proposed in [1] reads as\n\n\u0001=1\n\n\u0001\u08a3\n\n\u012e\u0001\u0001(\u0001) \u00e1 1\n\n2\u07e0\u0001\u0001 \u2212 \u0001\u0001\u0001\u07e02\n\n2 + \u0001\u07e0\u0001\u0001\u07e01,\n\n1\n\n(4)\nwhere \u0001\u0001 is an approximation of the optimal value of \u0001 at step \u0001, computed by solving Equation (3)\nwith \u0001 \ufb01xed to \u0001\u0001\u22121, an optimizer of \u012e\u0001\u0001\u22121(\u0001).\nThe above approach lends itself to an implementation of an online MF algorithm, as the sum in Equa-\ntion (4) may be e\ufb03ciently optimized whenever adding a new sample. However, in order to satisfy the\n\n\u0001=1\n\n\u0001\n\n3\n\n\f\u0001\n\n\u0001 ). The values of \u0001\u0001 are kept constant, and we require \u0001\u0001[\u0001] \u08a0 cvx( \u012e\u0001(\u0001)\n\nconvexity constraint of [2], all previous values of \u0001\u0001 are needed to update \u0001. To mitigate this problem,\nwe introduce for each cluster\u0016(\u0001) a representative set \u012e\u0001(\u0001)\n\u0001 \u08a0 \u00d3\u0001\u00d7\u0001\u0001 and its convex hull (representative\nregion) cvx( \u012e\u0001(\u0001)\n\u0001 ). As illustrated\n\u0001 \u08ea cvx(\u0016\nin Figure 1, we may further restrict the representative regions as follows.\n\u0001 \u0016(\u0001)), \u0001 \u08a0 [\u0001]. This unrestricted case leads\n\u0017\u0001 (Figure 1, Left): We only require that \u012e\u0001(\u0001)\nto an online solution for the cvxMF problem [2] as one may use\u0016\n\u0001 as a single representative\n\u012e\u0001(\u0001)\nregion. The underlying online algorithms has provable performance guarantees.\n\u0001 \u08ea \u0016(\u0001), which is a new cvxMF constraint for both the\n\u0017\u0001 (Figure 1, Right): We require that \u012e\u0001(\u0001)\nclassical and online setting. Theoretical guarantees for the underlying algorithm follow from small and\nfairly-obvious modi\ufb01cations in the proof for the\u0017\u0001 case, assuming error-free sample classi\ufb01cation.\n3 Online Algorithm\nThe proposed online cvxMF method for solving\u0017\u0001 consists of two procedures, described in Algo-\nrithms 1 and 2. Algorithm 1 describes the initialization of the main procedure in Algorithm 2. Algo-\nrithm 1 generates an initial estimate for the basis \u00010 and for the representative regions {cvx( \u012e\u0001(\u0001)\n0 )}\u0001\u08a0[\u0001].\nA similar initialization was used in classical cvxMF, with the bases vectors obtained either through\nclustering (on a potentially subsampled dataset) or through random selection and additional pro-\ncessing [2]. During initialization, one \ufb01rst collects a \ufb01xed prescribed number of \u0001 data samples,\nsample lies in cluster \u0001. The sizes of the generated clusters\u0001\u0001\u0001\nsummarized in \u012e\u0001. Subsequently, one runs the K-means algorithm on the collected samples to obtain\na clustering, described by the cluster indicator matrix \u0001 \u08a0 {0, 1}\u0001\u00d7\u0001, in which \u0001[\u0001, \u0001] = 1 if the \u0001-th\n\u0001\u08a0[\u0001] are used as \ufb01xed cardinalities\nof the representative sets of the online methods. The initial estimate of the basis \u00010[\u0001] equals the\naverage of the samples inside the cluster, i.e. \u00010 \u00e1 \u012e\u0001 \u0001 diag(1\u08a7\u00011, \u2026 , 1\u08a7\u0001\u0001).\nNote again that initialization is performed using only a constant number of \u0001 samples. Hence, K-\nmeans clustering does not signi\ufb01cantly contribute to the complexity of the online algorithm. Second,\nto ensure that the restricted online cvxMF algorithm instantiates each cluster with at least one data\nsample, one needs to take into account the size of the smallest cluster (discussed in the Supplement).\nAlgorithm 1 Initialization\n1: Input: i.i.d samples \u00011, \u00012, \u2026 , \u0001\u0001 of a random vector X \u08a0 \u00d3\u0001 summarized in \u012e\u0001.\n2: Run K-means on \u012e\u0001 to generate the cluster indicator matrix \u0001 \u08a0 {0, 1}\u0001\u00d7\u0001 and determine the\n3: Compute \u00010 and \u012e\u0001(\u0001)\n\ninitial cluster sizes (subsequent representative set sizes) \u0001\u0001, \u0001 \u08a0 [\u0001].\n\n0 \u08a0 \u00d3\u0001\u00d7\u0001\u0001, \u0898\u0001 \u08a0 [\u0001], according to:\n\n\u0001\n\n\u00010 = \u012e\u0001 \u0001 diag(1\u08a7\u00011, \u2026 , 1\u08a7\u0001\u0001)\n\nand summarize the initial representative sets of the clusters into matrices \u012e\u0001(\u0001)\n\n0 , \u0001 = [\u0001].\n\n4: Return: \u00010, { \u012e\u0001(\u0001)\n\n0 }\u0001\u08a0[\u0001].\n\nFigure 2: Illustration of one step of the online cvxMF algorithm with multiple-representative regions.\nFollowing initialization, Algorithm 2 sequentially selects one sample \u0001\u0001 at a time and then updates the\ncurrent representative sets \u012e\u0001(\u0001)\n, \u0001 \u08a0 [\u0001], and bases \u0001\u0001. More precisely, after computing the coe\ufb03cient\nvector \u0001\u0001 inStep5, oneplacesthesample \u0001\u0001 intotheappropriatecluster, indexedby \u0001\u0001. The \u0001\u0001\u0001-subsets\n\n\u0001\n\n4\n\n\fAlgorithm 2 Online cvxMF\n1: Input: Data samples \u0001\u0001, a parameter \u0001 \u08a0 \u00d3, and the maximum number of iterations \u0001.\n2: Initialization: Compute \u00010, { \u012e\u0001(\u0001)\n3: for \u0001 = 1 to \u0001 do\n4:\n5:\n\n0 }\u0001\u08a0[\u0001] using Algorithm 1. Set \u00010 = \u0001, \u00010 = \u0001.\n\nSample \u0001\u0001 from X.\nUpdate \u0001\u0001 according to:\n\n1\n2\n\n\u0001\u0001 = arg min\n\u0001\u08a0\u00d3\u0001\n\n\u0001 and \u0001\u0001 = 1\n\n\u0001(\u0001 \u2212 1)\u0001\u0001\u22121 + \u0001\u0001\u0001\u0001\n\nSet \u0001\u0001 = 1\nChoose the index of the basis \u0001\u0001 to be updated according to \u0001\u0001 = Uniform([\u0001]).\n\n\u07e0\u07e0\u07e0\u0001\u0001 \u2212 \u0001\u0001\u22121\u0001\u07e0\u07e0\u07e02\n+ \u0001\u07e0\u0001\u07e01.\n\u0001(\u0001 \u2212 1)\u0001\u0001\u22121 + \u0001\u0001 \u0001\u0001\n\u0002\nGenerate the augmented representative regions\u0002 \u012e\u0001{\u0001}\n\u0004 \u012e\u0001(\u0001\u0001)\n\n\u0001\u08a0[\u0001\u0001\u0001]\u00de{0}\n\n\u0001.\n\n\u0002\n\n:\n\n2\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001 = \u012e\u0001(\u0001\u0001)\n\u012e\u0001{0}\n\u0001\u22121\n\u0001\u22121[\u0001],\n\u0001\u0001,\n\n[\u0001] =\n\n\u08bc \u012e\u0001{\u0001}\n\u0001\n\n\u0001\u08a0[\u0001\u0001\u0001]\n\n\u0002 \u012e\u0001{\u0001}\n\n\u0001\n\nUpdate { \u012e\u0001(\u0001)\n\n\u0001 }\u0001\u08a0[\u0001] and \u0001\u0001 by executing the following two steps:\nCompute \u0001\u0016, \u012e\u0001\u0016 by solving the optimization problems:\n\na.\n\n6:\n7:\n8:\n\n9:\n\n\u0001\u0016, \u012e\u0001\u0016 =\n\n=\n\nb.\n\nSet\n\n\u0001[\u0001]\u08a0cvx\n\n\u0001[\u0001\u0001]\u08a0cvx\n\n\u0001[\u0001]\u08a0cvx\n\n\u0001[\u0001\u0001]\u08a0cvx\n\narg min\n\u0001, \u0001 s.t.\n\u0001\u22121\n\n1\n\u0001\n\n\u0001\n\n\u0002\n\u0002 \u012e\u0001(\u0001)\n\u0001\u08d4\u0001\u0001,\n\u0002 \u012e\u0001{\u0001}\n\u0002\n\u0002\n\u0002 \u012e\u0001(\u0001)\n\u0001\u08d4\u0001\u0001,\n\u0002 \u012e\u0001{\u0001}\n\u0002\n\u0004 \u012e\u0001{\u0001\u0016}\n\n1\n2\n\n\u0001\n\narg min\n\u0001, \u0001 s.t.\n\u0001\u22121\n\n\u0001\n\u012e\u0001(\u0001)\n\u0001\u22121,\n\n\u012e\u0001(\u0001)\n\u0001 =\n\u0001\u0001 = \u012e\u0001\u0016.\n\n10: end for\n11: return \u0001\u0001, the learned convex dictionary.\n\nif \u0001 \u08a0 [\u0001\u0001] \u001d \u0001\nif \u0001 = \u0001.\n\n\u00031\n\n\u0001\u08a3\n\n\u0001=1\n\n\u07e0\u07e0\u07e0\u0001\u0001 \u2212 \u0001\u0001\u0001\n\n2\n\n\u0003\n\n,\n\n\u07e0\u07e0\u07e02\n\n2\n\n+ \u0001\u07e0\u0001\u0001\u07e01\n\nTr(\u0001\u0001 \u0001\u0001\u0001) \u2212 Tr(\u0001\u0001 \u0001\u0001).\n\n,\n\nif \u0001 = \u0001\u0001\nif \u0001 \u08a0 [\u0001] \u001d \u0001\u0001,\n\n(5)\n\n(6)\n\n(7)\n\n\u0001\n\n\u0001\n\n) \u00de \u0001\u0001} (referred to as the augmented representative sets \u012e\u0001{\u0001}\n\nof {col( \u012e\u0001(\u0001\u0001)\n, \u0001 \u08a0 [\u0001\u0001\u0001] \u00de {0}), are used in\n\u0001+1 for cluster \u0001\u0001. To \ufb01nd the optimal index\nSteps 8 and 9 to determine the new representative region \u012e\u0001(\u0001\u0001)\n\u0001 \u08a0 [\u0001\u0001\u0001] \u00de {0} and the corresponding updated basis \u0001[\u0001\u0001], in Step 9 we solve \u0001\u0001\u0001 convex problems.\nThe minimum of the optimal solutions of these optimization problems determines the new bases \u0001\u0001\nand the representative regions \u012e\u0001(\u0001)\n(see Figure 2 for clari\ufb01cations). Note that the combinatorial search\nstep is executed on a constant-sized set of samples and is hence computationally e\ufb03cient.\nIn Step 7, the new sample may be assigned to a cluster in two di\ufb00erent ways. For the case\u0017\u0001, we\nuse a random assignment. For the case \u0017\u0001, we need to perform the correct sample assignment in\norder to establish theoretical guarantees for the algorithm. Extensive simulations show that using\n\u0001\u0001 = arg max \u0001\u0001 works very well in practice. Note that in either case, in order to minimize \u0001(\u0001), one\ndoes not necessarily require an error-free classi\ufb01cation process.\n\n\u0001\n\n5\n\n\f\u0002\n\n\u0001(\u0001, \u0001)\n\n\u0002\n\u08a7vol(\u0016),\n\n4 Convergence Analysis\nInwhatfollows,weshowthatthesequenceofdictionaries {\u0001\u0001}\u0001 convergesalmostsurelytoastationary\npoint of \u0001(\u0001) under assumptions similar to those used in [1], listed below.\n(A.1) The data distribution on a compact support set\u0016 has bounded \u201cskewness\u201d. The com-\npact support assumption naturally arises in many practical applications. The bounded\nskewness assumption for the distribution of X reads as\n\n\u0847(\u07e0X \u2212 \u0001\u07e02 \u08d8 \u0001\u0007 X \u08a0 \u0016) \u08d9 \u0001 vol\nwhere\u0016 \u00e1 cvx(\u0016\n(8)\n\u0001\u0016(\u0001)), \u0001 is a positive constant and \u0001(\u0001, \u0001) = {\u0001 \u08bc \u07e0\u0001 \u2212 \u0001\u07e02 \u08d8 \u0001} stands\nfor the ball of radius \u0001 around \u0001 \u08a0 \u0016. This assumption is satis\ufb01ed for appropriate values of \u0001\nand distributions of X that are \u201cclose\u201d to uniform.\n(A.2) The quadratic surrogate functions \u012e\u0001\u0001 are strictly convex, and have Hessians that are\nlower-bounded by a positive constant \u00011 > 0. It is straightforward to enforce this assump-\n2 to the surrogate or original objective function; this leads to\ntion by adding a term \u00011\nreplacing the positive semi-de\ufb01nite matrix 1\n(A.3) The approximation-error function \u0001(\u0001, \u0001) is \u201cwell-behaved\u201d. We assume that the func-\ntion \u0001(\u0001, \u0001) de\ufb01ned in Equation (3) is continuously di\ufb00erentiable, and that its expectation\n\u0001(\u0001) = \u0001X[\u0001(X, \u0001)] is continuously di\ufb00erentiable and Lipschitz on the compact set\u0016. This\nassumption parallels the one made in [1, Proposition 2], and it holds if the solution to Equa-\n\u0001\u07e0\u0001\u07e02\ntion (3) is unique. The uniqueness condition can be enforced by adding a regularization term\n2 (\u0001 > 0) to \u0001(\u0016) in Equation (3). This term makes the (LARS) optimization problem\nin Equation (5) strictly convex and hence ensures that it has a unique solution.\n\n\u0001 \u0001\u0001 in Equation (7) by 1\n\n2\u07e0\u0001\u07e02\n\n\u0001 \u0001\u0001 + \u00011\u0001.\n\nIn addition, recall the de\ufb01nition of \u0001\u0001 and de\ufb01ne \u0001\u0016\narg min\n\u0001[\u0001]\u08a0cvx( \u012e\u0001(\u0001)\n\u0001\u0016\n\u0001 = arg min\n\u0001[\u0001]\u08a0\u0016, \u0001\u08a0[\u0001]\n\n\u0001\u0001 =\n\n\u0001 ), \u0001\u08a0[\u0001]\n\u012e\u0001\u0001(\u0001).\n\n\u012e\u0001\u0001(\u0001),\n\n\u0001 as the global optima of the surrogate \u012e\u0001\u0001(\u0001),\n\n4.1 Main Results\nTheorem 1. Under assumptions (A.1) to (A.3), the sequence {\u0001\u0001}\u0001 converges almost surely to a\nstationary point of \u0001(\u0001).\nLemma 2 bounds the di\ufb00erence of the surrogates for two di\ufb00erent dictionary arguments. Lemma 3\nestablishes that restricting the optima of the surrogate function \u012e\u0001\u0001(\u0001) to the representative region\n\u00dd. Lemma 4 establishes\n\u0001 ) does not a\ufb00ect convergence to the asymptotic global optima \u0001\u0016\ncvx( \u012e\u0001(\u0001)\nthat Algorithm 2 converges almost surely and that the limit is an optima \u0001\u0016\n\u00dd. Based on the results\nin Lemma 4, Theorem 1 establishes that the generated sequence of dictionaries \u0001\u0001 converges to a\n\u0001 )\u0007 denote the di\ufb00erence between the surrogate functions for an unrestricted\nLet \u018a\u0001 \u00e1\u0007 \u012e\u0001\u0001(\u0001\u0001) \u2212 \u012e\u0001\u0001(\u0001\u0016\nstationary point of \u0001(\u0001). The proofs are relegated to the Supplement, but sketched below.\n\u0002\nbasis and a basis for which one requires \u0001\u0001[\u0001] \u08a0 cvx\n\u018a\u0001 \u08d8 min\n\n\u0002. Then, one can show that\n\u0002 \u012e\u0001(\u0001)\n\u0001\u22121)\u0007\u0007\u0007\u0002\n\u018a\u0001\u22121,\u0007\u0007\u0007 \u012e\u0001\u0001\u22121(\u0001\u0001) \u2212 \u012e\u0001\u0001\u22121(\u0001\u0016\n\u0002 1\n\u0001 \u2212 \u0001\u0001 \u0001\u0001\u018a\u0001\u22121\n\n\u00021\n\u0002\n\u0002 for \u0001[\u018a\u0001].\n\u0001 \u0001+2\n\nBased on an upper bound on the error of random coordinate descent used for minimizing the surrogate\nfunction and assumption (A.1), one can derive a recurrence relation for \u0001[\u018a\u0001], described in the lemma\nbelow. This recurrence establishes a rate of decrease of \u0001\nLemma 2. Let \u018a\u0001 \u00e1 \u012e\u0001\u0001(\u0001\u0001) \u2212 \u012e\u0001\u0001(\u0001\u0016\n\u0001 \u08d8 \u0001\n\u0001\u0001\u018a\u0001\n\n+ \u0001\u0001\u018a\u0001\u22121\n\n\u0001 ). Then\n\n\u00021\n\n\u00012\u08a7(\u0001+2)\n\n\u0002\n\n2 ,\n\n+ \u0001\n\n.\n\n\u0001\n\n\u0001\n\n\u0001\n\n6\n\n\f\u0002 2\n\n\u0002\u0001\u08a72\n\n\u0001 \u0001\n\n\u0001\n2\n\u0001\n\u0189( \u0001\n\u0001 \u012e\u0001(\u0001+2) vol(\u0016)\n2 +1)\n\nwhere \u0001\u0001 \u00e1 8\u0001 min\u0001 \u0001\u0001\n, and \u0001 is the same constant used in Equation (8) of assump-\ntion (A.1). Also, \u0001 = max\u0001 \u0001\u0001[\u0001, \u0001], \u0898 \u0001, while \u0001 \u012e\u0001 denotes a bound on the condition number of \u0001\u0001, \u0898\u0001,\nand \u0001\u0001 denotes the probability of choosing \u0001\u0001 = \u0001 in Step 7 of Algorithm 2.\nLemma 3 establishes that the optima \u0001\u0001 con\ufb01ned to the representative region and the global optima\nshow that \u018a\u0001 =\u07e0\u0001\u0001 \u2212 \u0001\u0016\n\u0001\u07e0 \u012e\u0001\u0001\n\u0001\u07e02. Lemma 3 then follows from Lemma 2 by applying the\n\u0001 are close. From the Lipschitz continuity of \u012e\u0001\u0001(\u0001) asserted in assumptions (A.1) and (A.2), we can\n\u0001\u0016\nLemma 3. \u08a3\nquasi-martingale convergence theorem stated in the Supplement.\n\u07e0\u0001\u0001\u2212\u0001\u0016\n\u0001\u07e02\n\n\u08d8\u07e0\u0001\u0001 \u2212 \u0001\u0016\n\nconverges almost surely.\n\n\u0001\n\n\u0001+1\n\nLemma 4. The following claims hold true:\n\nP1)\n\nP2)\n\n\u0001 ) converge almost surely;\n\n\u012e\u0001\u0001(\u0001\u0001) and \u012e\u0001\u0001(\u0001\u0016\n\u012e\u0001\u0001(\u0001\u0001) \u2212 \u012e\u0001\u0001(\u0001\u0016\n\u012e\u0001\u0001(\u0001\u0016\n\u0001 ) \u2212 \u0001(\u0001\u0016\nP3)\nP4) \u0001(\u0001\u0016\n\u0001 ) converges almost surely .\n\n\u0001 ) converges almost surely to 0;\n\u0001 ) converges almost surely to 0;\n\nThe proofs of P1) and P2) involve completely new analytic approaches described in the Supplement.\n5 Experimental Validation\nWe compare the approximation error and running time of our proposed online cvxMF algorithm with\nnon-negative MF (NMF), cvxMF [2] and online MF [1]. For datasets with a ground truth, we also\nreport the clustering accuracy. The datasets used include a) clusters of synthetic data samples; b)\nMNIST handwritten digits [32]; c) single-cell RNA sequencing datasets [31] and d) four other real\nworld datasets from the UCI Machine Learning repository [30]. The largest sample size scales as 106.\n\n(a) Well-separated clusters.\n\n(b) Overlapping clusters.\n\nFigure 3: Results for Gaussian mixtures with color-coded clusters. Here, tSNE stands for the t-\ndistributed stochastic neighbor embedding [33], in which the \u0001-axis represents the \ufb01rst and the \u0001-axis\nthe second element of the embedding. Color-coded circles represent samples, diamonds represent\nbasis vectors learned by the di\ufb00erent algorithms, while crosses describe samples in the representative\nregions. The \u201cinterpretability property\u201d can be easily observed visually.\nSynthetic Datasets. The synthetic datasets were generated by sampling from a 3\u0001-truncated Gaussian\nmixture model with 5 components, and with samples-sizes in [103, 106]. Each component Gaussian\nhas an expected value drawn uniformly at random from [0, 20] while the mixture covariance matrix\nequals the identity matrix \u0001 (\u201cwell-separated clusters\u201d) or 2.5 \u0001 (\"overlapping clusters\"). We ran the\nonline cvxMF algorithm with both unconstrained\u0017\u0001 and restricted\u0017\u0001 representative regions, and\n\n7\n\n\fused the normalization factor \u0001 = \u0001\u08a7\u00dd\u0001 suggested in [34]. After performing cross validation on an\n\nevaluation set of size 1000, we selected \u0001 = 0.2. Figure 3 shows the results for two synthetic datasets\neach of size \u0001 = 2, 500 and with \u0001 = 150. The sample size was restricted for ease of visualization\nand to accommodate the cvxMF method which cannot run on larger sets. The number of iterations\nwas limited to 1, 200. Both the cvxMF and online cvxMF algorithms generate bases that provide\nexcellent representations of the data clusters. The MF and online MF method produce bases that\nare hard to interpret and fail to cover all clusters. Note that for the unrestricted version of cvxMF,\nsamples of one representative set may belong to multiple clusters.\nFor the same Gaussian mixture model but larger datasets, we present running times and times to\nconvergence (or, if convergence is slow, the maximum number of iterations) in Figure 4 (a) and (b),\nrespectively. For well-separated synthetic datasets, we let \u0001 increase from 102 to 106 and plot the\nresults in (a). The non-online cvxMF algorithm becomes intractable after 104 sample, while the\ncvxMF and MF easily scale for 106 and more samples. To illustrate the convergence, we used a\nsynthetic dataset with \u0001 = 5, 000 in order to ensure that all four algorithms converge within 100s.\nFigure 4 (b) plots the approximation error \u00012 = 1\nchose a small value of \u0001 so as to be able to run all algorithms, and for this case the online algorithms\nmay have larger errors. But as already pointed out, as \u0001 increases, non-online algorithms become\nintractable while the online counterparts operate e\ufb03ciently (and with provable guarantees).\n\n\u0001\u07e0\u0001 \u2212 \u0001\u0001\u07e02 with respect to the running time. We\n\n(a) time complexity of di\ufb00erent methods\n\n(b) convergence of objective\n\nFigure 4: (a): Running times (s) vs. the log of the dataset sizes; (b) Running times (s) vs. the \u00012 error.\n\n(b) cvxMF\n\n(c) online MF\n\n(d) online cvxMF (\u0001\u0001)\n\n(a) MF\n\nFigure 5: MNIST results (as the eigenimage set is overcomplete, clustering accuracy is omitted).\nThe MNIST Dataset. The MNIST dataset was subsampled to a smaller set of 10, 000 images of\nresolution 28\u00d728toillustratetheperformanceofboththecvxMFandonlinecvxMFmethodsonimage\ndatasets. All algorithms ran 3, 000 iterations with \u0001 = 150 and \u0001 = 0.1 to generate \u201ceigenimages,\u201d\ncapturing the characteristic features used as bases [35]. Figure 5 plots the \ufb01rst 9 eigenimages. The\nresults for the \u0017\u0001 algorithm are similar to that of the non-online cvxMF algorithm and omitted.\nCvxMF produces blurry images since one averages all samples. The results are signi\ufb01cantly better\nfor the\u0017\u0001 case, as one only averages a small subset of representative samples.\nSingle-Cell (sc) RNA Data. scRNAdatatsetscontainexpressions(activities)ofallgenesinindividual\ncells, and each cell represents one data sample. Cells from the same tissue under same cellular\ncondition tend to cluster, and due to the fact that that the sampled tissue are known, the cell labels are\n\n8\n\n\fknown a priori. This setting allows us to investigate the\u0017\u0001 version of the online cvxMF algorithm\nto identify \u201ctypical\u201d samples. For our dataset, described in more detail in the Supplement, the two\nnon-online method failed to converge and required signi\ufb01cantly larger memory. Hence, we only\npresent results for the online methods. Results pertaining to real world datasets from the UCI Machine\n\nFigure 6: Results for the online methods executed on a blood-cell scRNA dataset.\nLearning repository [30], also used for testing cvxMF [2], are presented in the Supplement.\n6 Acknowledgement\nThe authors are grateful to Prof. Bruce Hajek for valuable discussions. This work was funded by the\nDB2K NIH 3U01CA198943-02S1, NSF/IUCR CCBGM Center, and the SVCF CZI 2018-182799\n2018-182797.\nReferences\n[1] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix\nfactorization and sparse coding. Journal of Machine Learning Research, 11(Jan):19\u201360, 2010.\n[2] Chris HQ Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative matrix fac-\ntorizations. IEEE transactions on pattern analysis and machine intelligence, 32(1):45\u201355,\n2010.\n[3] Avrim Blum, Sariel Har-Peled, and Benjamin Raichel. Sparse approximation via generating\npoint sets. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete\nalgorithms, pages 548\u2013557. Society for Industrial and Applied Mathematics, 2016.\n[4] Jieru Mei, Chunyu Wang, and Wenjun Zeng. Online dictionary learning for approximate\narchetypal analysis. In Proceedings of the European Conference on Computer Vision (ECCV),\npages 486\u2013501, 2018.\n[5] Ivana Tosic and Pascal Frossard. Dictionary learning. IEEE Signal Processing Magazine,\n28(2):27\u201338, 2011.\n[6] JulienMairal,JeanPonce,GuillermoSapiro,AndrewZisserman,andFrancisRBach. Supervised\ndictionary learning. In Advances in neural information processing systems, pages 1033\u20131040,\n2009.\n[7] Ron Rubinstein, Michael Zibulevsky, and Michael Elad. Double sparsity: Learning sparse dic-\ntionaries for sparse signal approximation. IEEE Transactions on signal processing, 58(3):1553\u2013\n1564, 2010.\n[8] WeiDai, TaoXu, andWenwuWang. Simultaneouscodewordoptimization(simco)fordictionary\nupdate and learning. IEEE Transactions on Signal Processing, 60(12):6340\u20136353, 2012.\n[9] Laurens Van Der Maaten, Eric Postma, and Jaap Van den Herik. Dimensionality reduction: a\ncomparative. J Mach Learn Res, 10:66\u201371, 2009.\n[10] Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximum-margin matrix factorization.\nIn Advances in neural information processing systems, pages 1329\u20131336, 2005.\n\n9\n\n\f[11] HaifengLiu,ZhaohuiWu,XuelongLi,DengCai,andThomasSHuang. Constrainednonnegative\nmatrix factorization for image representation. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 34(7):1299\u20131311, 2012.\n\n[12] Francis Bach, Julien Mairal, and Jean Ponce. Convex sparse matrix factorizations. arXiv preprint\n\narXiv:0812.1869, 2008.\n\n[13] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In\n\nAdvances in neural information processing systems, pages 556\u2013562, 2001.\n\n[14] Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor model\nwith optimal utilization of error estimates of data values. Environmetrics, 5(2):111\u2013126, 1994.\n[15] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. Community\npreserving network embedding. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n[16] C\u00e9dric F\u00e9votte and Nicolas Dobigeon. Nonlinear hyperspectral unmixing with robust non-\nnegative matrix factorization. IEEE Transactions on Image Processing, 24(12):4810\u20134819,\n2015.\n\n[17] Nicolas Gillis and Fran\u00e7ois Glineur. Accelerated multiplicative updates and hierarchical als\nalgorithms for nonnegative matrix factorization. Neural computation, 24(4):1085\u20131105, 2012.\n[18] Ernie Esser, Michael Moller, Stanley Osher, Guillermo Sapiro, and Jack Xin. A convex model\nfor nonnegative matrix factorization and dimensionality reduction on physical space. IEEE\nTransactions on Image Processing, 21(7):3239\u20133252, 2012.\n\n[19] Guillaume Bouchard, Dawei Yin, and Shengbo Guo. Convex collective matrix factorization. In\n\nArti\ufb01cial Intelligence and Statistics, pages 144\u2013152, 2013.\n\n[20] Xingguo Li, Zhaoran Wang, Junwei Lu, Raman Arora, Jarvis Haupt, Han Liu, and Tuo Zhao.\nSymmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint\narXiv:1612.09296, 2016.\n\n[21] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural compu-\n\ntation, 19(10):2756\u20132779, 2007.\n\n[22] Stephen A Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on\n\nOptimization, 20(3):1364\u20131377, 2009.\n\n[23] R\u00e9mi Gribonval, Rodolphe Jenatton, Francis Bach, Martin Kleinsteuber, and Matthias Seibert.\nSample complexity of dictionary learning and other matrix factorizations. IEEE Transactions\non Information Theory, 61(6):3469\u20133486, 2015.\n\n[24] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. E\ufb03cient sparse coding algorithms.\n\nIn Advances in neural information processing systems, pages 801\u2013808, 2007.\n\n[25] Shiva P Kasiviswanathan, Huahua Wang, Arindam Banerjee, and Prem Melville. Online\nl1-dictionary learning with application to novel document detection. In Advances in Neural\nInformation Processing Systems, pages 2258\u20132266, 2012.\n\n[26] Jun Tang, Ke Wang, and Ling Shao. Supervised matrix factorization hashing for cross-modal\n\nretrieval. IEEE Transactions on Image Processing, 25(7):3157\u20133166, 2016.\n\n[27] Renbo Zhao, Vincent YF Tan, and Huan Xu. Online nonnegative matrix factorization with\n\ngeneral divergences. arXiv preprint arXiv:1608.00075, 2016.\n\n[28] Renbo Zhao and Vincent YF Tan. Online nonnegative matrix factorization with outliers. In\n2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),\npages 2662\u20132666. IEEE, 2016.\n\n[29] Rui Xia, Vincent YF Tan, Louis Filstro\ufb00, and C\u00e9dric F\u00e9votte. A ranking model motivated\nby nonnegative matrix factorization with applications to tennis tournaments. arXiv preprint\narXiv:1903.06500, 2019.\n\n10\n\n\f[30] Dheeru Dua and Casey Gra\ufb00. UCI machine learning repository, 2017.\n[31] Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan\nWilson, Solongo B Ziraldo, Tobias D Wheeler, Geo\ufb00 P McDermott, Junjie Zhu, et al. Massively\nparallel digital transcriptional pro\ufb01ling of single cells. Nature communications, 8:14049, 2017.\n[32] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Ha\ufb00ner, et al. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[33] Laurens van der Maaten and Geo\ufb00rey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[34] Peter J Bickel, Ya\u2019acov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lasso and\n\ndantzig selector. The Annals of Statistics, 37(4):1705\u20131732, 2009.\n\n[35] Ale\u0161 Leonardis and Horst Bischof. Robust recognition using eigenimages. Computer Vision\n\nand Image Understanding, 78(1):99\u2013118, 2000.\n\n11\n\n\f", "award": [], "sourceid": 7255, "authors": [{"given_name": "Jianhao", "family_name": "Peng", "institution": "University of Illinois at Urbana Champaign"}, {"given_name": "Olgica", "family_name": "Milenkovic", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Abhishek", "family_name": "Agarwal", "institution": "University of Illinois at Urbana Champaign"}]}