{"title": "Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension", "book": "Advances in Neural Information Processing Systems", "page_first": 1481, "page_last": 1488, "abstract": null, "full_text": "Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension\n\nManfred K. Warmuth Computer Science Department University of California - Santa Cruz manfred@cse.ucsc.edu\n\nDima Kuzmin Computer Science Department University of California - Santa Cruz dima@cse.ucsc.edu\n\nAbstract\nWe design an on-line algorithm for Principal Component Analysis. In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace. The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically in the dimension of the instances.\n\n1 Introduction\nIn Principal Component Analysis the n-dimensional data instances are projected into a k dimensional subspace (k < n) so that the total quadratic approximation error is minimized. After centering the data, the problem is equivalent to finding the eigenvectors of the k largest eigenvalues of the data covariance matrix. We develop a probabilistic on-line version of PCA: in each trial the algorithm chooses a k dimensional projection matrix P t based on some internal parameter; then an instance xt is received and the algorithm incurs loss xt - P t xt 2 ; finally the internal parameter is updated. The goal is to 2 obtain algorithms whose total loss in all trials is close to the smallest total loss of any k -dimensional subspace P chosen in hindsight. We first develop our algorithms in the expert setting of on-line learning. The algorithm maintains a mixture vector over the n experts. At the beginning of trial t the algorithm chooses a subset P t of k experts based on the current mixture vector wt . It then receives a loss veci or t [0..1]n and incurs t loss equal to the remaining n - k components of the loss vector, i.e. t {1,...,n}-P t i . Finally it updates its mixture vector to wt+1 . Note that now the subset P t corresponds to the subspace onto which we \"project\", i.e. we incur no loss on the k components of P t and are charged only for the remaining n - k components. The trick is to maintain a mixture vector wt as a parameter with the additional constraint that wt i 1 . We will show that these constrained mixture vectors represent an implicit mixture over subsets n-k of experts of size n - k , and given wt we can efficiently sample from the implicit mixture and use it to pretdict. This gives an on-line algorithm whose total loss is close to the smallest n - k components of t and this algorithm generalizes to an on-line PCA algorithm when the mixture vectors are 1 replaced by density matrices whose eigenvalues are bounded by n-k . Now the constrained density matrices represent implicit mixtures of the (n - k )-dimensional subspaces. The complementary k -dimensional space is used to project the current instance.\n\n\f\n2 Standard PCA and On-line PCA\nGiven a sequence of data vectors x1 , . . . , xT , the goal is to find a low-dimensional approximation of this data that minimizes the 2-norm approximation error. Specifically, we want to find a rank k projection matrix P and a bias vector b Rn such that the following cost function is minimized: loss(P , b) = tT\n=1\n\nxt - (P xt + b) 2 . 2\n\n Differentiating and solving for b gives us b = (I - P ) x, where x is the data mean. Substituting this bias b into the loss we obtain loss(P ) = tT\n=1 2\n\n2 (I - P )(xt - x) 2 =\n\ntT\n=1\n\n (xt - x)\n\n(\n\n I - P )2 (xt - x).\n\nSince I - P is a projection matrix, (I - P ) = I - P , and we get: loss(P ) = tr((I - P ) iT\n=1\n\n (xi - x)(xi - x)\n\n)\n\n= tr((I - P ) C ) = tr(C ) - tr(r P r\nank n-k\n\nC ),\n\nank k\n\nwhere C is the data covariance matrix. Therefore the loss is minimized over (n - k )-dimensional subspaces and this is equivalent to maximizing over k -dimensional subspaces. In the on-line setting, learning proceeds in trials. (For the sake of simplicity we are not using a bias term at this point.) At trial t, the algorithm chooses a rank k projection matrix P t . It then receives an instance xt and incurs loss xt - P t xt 2 = tr((I - P t ) xt (xt ) ). Our goal is to obtain an 2 T algorithm whose total loss over a sequence of trials t=1 tr((I - P t ) xt (xt ) ) is close to the total T loss of the best rank k projection matrix P , i.e. inf P tr((I - P ) t=1 xt (xt ) ). Note that the latter loss is equal to the loss of standard PCA on data sequence x1 , . . . , xT (assuming the data is centered).\n\n3 Choosing a Subset of Experts\nRecall that projection matrices are symmetric positive definite matrices with eigenvalues in {0, 1}. k Thus a rank k projection matrix can be written as P = i=1 pi pi , where the pi are the k orthonormal vectors forming the basis of the subspace. Assume for the moment that the eigenvectors are restricted to be standard basis vectors. Now projection matrices become diagonal matrices with entries in {0, 1}, where the number of ones is the rank. Also, the trace of a product of such a diagonal projection matrix and any symmetric matrix becomes a dot product between the diagonals of both matrices and the whole problem reduces to working with vectors: the rank k projection matrices reduce to vectors with k ones and n - k zeros and the diagonal of the symmetric matrix may be seen as a loss vector t . Our goal now is to develop on-line algorithms for finding the lowest n - k components of the loss vectors t so that the total loss is close the to the lowest n - k components T of t=1 t . Equivalently, we want to find the highest k components in t . We begin by developing some methods for dealing with subsets of components. For convenience we encode such subsets as probability vectors: we call r [0, 1]n an m-corner if it has m components 1 set to m and the remaining n - m components set to zero. At trial t the algorithm chooses an (n - k )-corner r t . It then receives a loss vector t and incurs loss (n - k ) r t t . Let An n onsist of all convex combinations of m-corners. In other words, An is the convex hull mc m m 1 of the m -corners. Clearly any component wi of a vector w in An is at most m because it m 1 n n n is a convex combination of numbers in [0.. m ]. Therefore Am Bm , where Bm is the set of ni 1 dimensional vectors w for which |w| = wi = 1 and 0 wi m , for all i. The following n n theorem implies that Am = Bm : Theorem 1. Algorithm 1 produces a convex combination1 of at most n m-corners for any vector in n Bm .\n\n\f\nAlgorithm 1 Mixture Construction n input 1 m < n and w Bm repeat Let r be a corner whose m components correspond to nonzero components of w w and contain all the components of w that are equal to |m| Let s be the smallest of the m chosen components in w and l be the largest value of the remaining n - m components p m w := w - in(m s, |w| - m l) r and output p r until w = 0\nw Proof. Let b(w) be the number of boundary components in w, i.e. b(w) := |{i : wi is 0 or |m| }|. w n Let Bm be all vectors w such that 0 wi |m| , for all i. If b(w) = n, then w is either a corner or 0. The loop stops when w = 0. If w is a corner then it takes one iteration to arrive at 0. We show n n if w Bm and w is neither a corner nor 0, then the successor w Bm and b(w) > b(w). Clearly, w 0, because the amount that is subtracted in the m components of the corner is at most as large b w as the corresponding components of w. We next show that wi |m| . If i belongs to the corner then b b |- w w p wi = wi - m |wm p = |m| . Otherwise wi = wi l, and l |m| follows from the fact that n p |w| - m l. This proves that w Bm .\n\nFor showing that b(w) > b(w) first observe that all boundary components in w remain boundary w components in w: zeros stay zeros and if wi = |m| then i is included in the corner and wi = b| |w|-p w = |m . However, the number of boundary components is increased at least by one because m the components corresponding to s and l are both non-boundary components in w and at least one of them becomes a boundary point in w: if p = m s then the component corresponding to s in p w is s - m = 0 in w and if p = |w| - m l then the component corresponding to l in w is b |w|-p w l = m = |m| . It follows that it may take up to n iterations to arrive at a corner which has n boundary components and one more iteration to arrive at 0. Finally note that there is no weight n vector w Bm s.t. b(w) = n - 1 and therefore the size of the produced linear combination is at most n. More precisely, the size is at most n - b(w) if n - b(w) n - 2 and one if w is a corner. j The algorithm produces a linear combinations of corners, i.e. w = pj r j . Since pj 0 and all j |r j | = 1, pj = 1 and we actually have a convex combination. Fact 1. For any loss vector , the following corner has the smallest loss of any convex combination n of corners in An = Bm : Greedily pick the component of minimum loss (m times). m How can we use the above construction and fact? It seems too hard to maintain information about all nc orners of size n - k . However, the best corner is also the best convex combination of corners, n-k nc i.e. the best from the set An-k where each member of this set is given by n-k oefficients. Luckily, n n this set of convex combinations equals Bn-k and it takes n coefficients to specify a member in that n set. Therefore we can search for the best hypothesis in the set Bn-k and for any such hypothesis we can always construct a convex combination (of size n) of (n - k )-corners which has the same expected loss for each loss vector. This means that any algorithm predicting with a hypothesis vector n in Bn-k can be converted to an algorithm that probabilistically chooses an (n - k )-corner. Finally, the set P t of the k components missed by the chosen (n - k )-corner corresponds to the subspace we project onto. Algorithm 2 spells out the details for this approach. The algorithm chooses a corner probabilistically n and (n - k ) wt t is the expected loss in one trial. The projection wt onto Bn-k can be achieved as 1 follows: find the smallest l s.t. capping the largest l components to n-k and rescaling the remaining l 1 n-l weights to total weight 1- n-k makes none of the rescaled weights go above n-k . The simplest\n The existence of a convex combination of at most n corners is implied by Caratheodory's theorem [Roc70], but the algorithm gives an effective construction.\n1\n\n\f\nalgorithm starts with sorting the weights and then searches for l with a binary search. However, a linear algorithm that recursively uses the median is given in [HW01]. Algorithm 2 Capped Weighted Majority Algorithm\nn input: 1 k < n and an initial probability vector w1 Bn-k for t = 1 to T do j Decompose wt as pj r j with Algorithm 1, where m = n - k Draw a corner r = r j with probability pj Let P t be the k components outside the drawn corner Receive loss vector t i Incur loss (n - k ) r t = t {1,...,n}-P t i . t t wi := wi exp(- t ) / Z , where Z normalizes the weights to one i wt+1 := argmin d(w, wt )\nn wBn-k\n\nend for\nn When k = n - 1, n - k = 1 and B1 is the entire probability simplex. In this case the call to n Algorithm 1 and the projection onto B1 are vacuous and we get the standard Randomized Weighted Majority algorithm [LW94]2 with loss vector t . i u Let d(u, w) denote the relative entropy between two probability vectors: d(u, w) = ui log wi . i\n\nTheorem 2. On an arbitrary sequence of loss vectors 1 , . . . , T [0, 1]n , the total expected loss of Algorithm 2 is bounded as follows: T tT t=1 u t + d(u, w1 ) - d(u, wT +1 ) t t (n - k ) w (n - k ) , 1 - exp(- ) =1\nn for any learning rate > 0 and comparison vector u Bn-k .\n\nProof. The update for wt in Algorithm 2 is the update of the Continuous Weighted Majority for which the following basic inequality is known (essentially [LW94], Lemma 5.3): d(u, wt ) - d(u, wt ) - u t + wt t (1 - exp(- )). (1)\n\nn The weight vector wt+1 is a Bregman projection of vector wt onto the convex set Bn-k . For such projections the Generalized Pythagorean Theorem holds (see e.g [HW01] for details):\n\nd(u, wt ) d(u, wt+1 ) + d(wt+1 , wt ) Since Bregman divergences are non-negative, we can drop the d(wt+1 , wt ) term and get the following inequality: n d(u, wt ) - d(u, wt+1 ) 0, for u Bn-k . Adding this to the previous inequality we get: d(u, wt ) - d(u, wt+1 ) - u t + wt t (1 - exp(- )) By summing over t, multiplying by n - k , and dividing by 1 - exp(- ), the bound follows.\n\n4 On-line PCA\n1 In this context (matrix) corners are density matrices with m eigenvalues equal to m and the rest are n 0. Also the set Am consists of all convex combinations of such corners. The maximum eigenvalue of a convex combination of symmetric matrices is at most as large as the maximum eigenvalue of any of the matrices ([Bha97], Corollary III.2.2). Therefore each convex combination of corners is\n\nThe original Weighted Majority algorithms were described for the absolute loss. The idea of using loss vectors instead was introduced in [FS97].\n\n2\n\n\f\n1 a density matrix whose eigenvalues are bounded by m and An Bn , where Bn consists of all m m m 1 density matrices whose maximum eigenvalue is at most m . Assume we have some density matrix W Bn with eigendecomposition W diag( )W . Algorithm 1 can be applied to the vector of m eigenvaljues of this density matrix. The output convex combination of up to n diagonal corners = pj r j can be turned into a convex combination of matrix corners that expresses the density j matrix: W = pj W diag(r j )W . It follows that An = Bn as in the diagonal case. m m\n\nTheorem 3. For any symmetric matrix S , minW Bn tr(W S ) attains its minimum at the following m matrix corner: greedily choose orthogonal eigenvectors of S of minimum eigenvalue (m times). Proof. Let (W ) denote the vector of eigenvalues of W in descending order and let (S ) be the same vector of S but in ascending order. Since both matrices are symmetric, tr(W S ) (W ) n (S ) ([MO79], Fact H.1.h of Chapter 9). Since (W ) Bm , the dot product is minimized and the inequality is tight when W is an m-corner corresponding to the m smallest eigenvalues of S . Also the greedy algorithm finds the solution (see Fact 1 of this paper). Algorithm 2 generalizes to the matrix setting. The Weighted Majority update is replaced by the corresponding matrix version which employs the matrix exponential and matrix logarithm [WK06] (The update can be seen as a special case of the Matrix Exponentiated Gradient update [TRW05]). The following theorem shows that for the projection we can keep the eigensystem fixed. Here (U , W ) denotes the quantum relative entropy tr(U (log U - log W )). Theorem 4. Projecting a density matrix onto Bn w.r.t. the quantum relative entropy is equivalent m to projecting the vector of eigenvalues w.r.t. the \"normal\" relative entropy: If W has the eigendecomposition W diag( )W , then argmin (U , W ) = W u W\nU Bn m ,\n\nwhere u = argmin d(u, ).\nn uBm\n\nProof. If (S ) denotes the vector of eigenvalues of a symmetric matrix S arranged in descending order, then tr(S T ) (S ) (T ) ([MO79], Fact H.1.g of Chapter 9). This implies that tr(U log W ) (U ) log (W ) and (U , W ) d( (U ), (W )). Therefore min (U , W ) min d(u, ) and if u minimizes the r.h.s. then W diag(u )W minin n\nU Bm\n\nmizes the l.h.s. because (W diag(u )W , W ) = d(u , ).\n\nuBm\n\nAlgorithm 3 On-line PCA algorithm input: 1 k < n and an initial density matrix W 1 Bn-k n for t = 1 to T do D Perform eigendecomposition W t = W W j ecompose as pj r j with Algorithm 1, where m = n - k Draw a corner r = r j with probability pj F Form a matrix corner R = W diag(r )W orm a rank k projection matrix P t = I - (n - k )R Receive data instance vector xt Incur loss xt - P t xt 2 = tr((I - P t ) xt (xt ) ) 2 t W = exp(log W t - xt (xt ) ) / Z , where Z normalizes the trace to 1 W t+1 := argmin (W , W )\nW Bn-k n t\n\nend for The expected loss in trial t of this algorithm is given by (n - k )tr(W t xt (xt ) ) Theorem 5. For an arbitrary sequence of data instances x1 , . . . , xT of 2-norm at most one, the total expected loss of the algorithm is bounded as follows: T tT t=1 tr(U xt (xt ) ) + (U , W 1 ) - (U , W T ) tt t ) (n - k )tr(W x (x ) (n - k ) , 1 - exp(- ) =1\n\n\f\nfor any learning rate > 0 and comparator density matrix U Bn-k .3 n Proof. The update for W is a density matrix version of the standard Weighted Majority update which was used for variance minimization along a single direction (i.e. k = n - 1) in [WK06]. The basic inequality (1) for that update becomes: (U , W t ) - (U , W ) - tr(U xt (xt )\nt ) t\n\n+ tr(W t xt (xt )\n\n)\n\n(1 - exp(- ))\n\nAs in the proof of Theorem 2 of this paper, the Generalized Pythagorean theorem applies and dropping one term we get the following inequality: (U , W ) - (U , W t+1 ) 0, for U Bn-k . n Adding this to the previous inequality we get: (U , W t ) - (U , W t+1 ) - tr(U xt (xt )\n) t\n\n+ tr(W t xt (xt )\n\n)\n\n(1 - exp(- ))\n\nBy summing over t, multiplying by n - k , and dividing by 1 - exp(- ), the bound follows.\nn It is easy to see that (U , W 1 ) (n - k ) log n-k . If k n/2, then this is further bounded by n k log k . Thus, the r.h.s. is essentially linear in k , but logarithmic in the dimension n.\n\nBy tuning [CBFH+ 97, FS97], we can get regret bounds of the form: (expected total loss of alg.) - (total loss best k -space) ( n. n =O total loss of best k -subspace) k log + k log k k\n\n(2)\n\nUsing standard but significantly simplified conversion techniques from [CBFH+ 97] based on the leave-one-out loss we also obtain algorithms with good regret bounds in the following model: the algorithm is given T - 1 instances drawn from a fixed but unknown distribution and produces a k -space based on those instances; it then receives a new instance from the same distribution. We can bound the expected loss on the last instance: (expected loss of alg.) - (expected loss best k -space) (expected loss of best k -subspace) k log =O T\nn k\n\n+\n\nk log T\n\nn k\n\n.\n\n(3)\n\n5 Lower Bound\nThe simplest competitor to our on-line PCA algorithm is the algorithm that does standard (uncentered) PCA on all the data points seen so far. In the expert setting this algorithm corresponds to \"projecting\" to the n - k experts that have minimum loss so far (where ties are broken arbitrarily). When k = n - 1, this becomes the follow the leader algorithm. It is easy to construct an adversary strategy for this type of deterministic algorithm (any k ) that forces the on-line algorithm to incur n times as much loss as the off-line algorithm. In contrast our algorithm is guaranteed to have expected additional loss (regret) of the order of square root of k ln n times the total loss of the best off-line algorithm. When the instances are diagonal matrices then our algorithm specializes to the standard expert setting and in that setting there are probabilistic lower bounds that show that our tuned bounds (2,3) are tight [CBFH+ 97].\n\n6 Simple Experiments\nThe above lower bounds do not justify our complicated algorithms for on-line PCA because natural data might be more benign. However natural data often shifts and we constructed a simple dataset of this type in Figure 1. The first 333 20-dimensional points were drawn from a Gaussian distribution with a rank 2 covariance matrix. This is repeated twice for different covariance matrices of rank\n3\n\nThe xt (xt )\n\nc\n\nan replaced by symmetric matrices S t whose eigenvalues have range at most one.\n\n\f\nFigure 1: The data set used for the experiments. Different colors/symbols denote the data points that came from three different Gaussians with rank 2 covariance matrices. The data vectors are 20-dimensional but we plot only the first 3 dimensions.\n\nFigure 2: The blue curve plots the total loss of on-line algorithm up to trial t for 50 different runs (with k = 2 and fixed to one). Note that the variance of the losses is small. The red single curve plots the total loss of the best subspace of dimension 2 for the first t points.\n\nFigure 3: Behavior of the algorithm around a transition point between two distributions. Each ellipse depicts the projection matrix with the largest coefficient in the decomposition of W t . The transition sequence starts with the algorithm focused on the projection matrix for the first subset of data and ends with essentially the optimal matrix for the second subset. The depicted transition takes about 60 trials.\n\n2. We compare the total loss of our on-line algorithm with the total loss of the best subspace for the first t data points. During the first 333 datapoints the latter loss is zero since the first dataset is 2-dimensional, but after the third dataset is completed, the loss of any fixed off-line comparator is large. Figure 3 depicts how our algorithm transitions between datasets and exploits the on-lineness of the data. Randomly permuting the dataset removes the on-lineness and results in a plot where the total loss of the algorithm is somewhat above that of the off-line comparator (not shown). Any simple \"windowing algorithm\" would also be able to detect the switches. Such algorithms are often unwieldy and we don't know any strong regret bounds for them. In the expert setting there is however a long line of research on shifting (see e.g. [BW02, HW98]). An algorithm that mixes a little bit of the uniform distribution into the current mixture vector is able to restart when the data switches. More importantly, an algorithm that mixes in a little bit of the past average density matrix is able to switch quickly to previously seen subspaces and to our knowledge windowing techniques cannot exploit this type of switching. Preliminary experiments on face image data indicate that the algorithms that accommodate switching work as expected, but more comprehensive experiments still need to be done.\n\n\f\n7 Conclusions\nWe developed a new set of techniques for low dimensional approximation with provable bounds. Following [TRW05, WK06], we essentially lifted the algorithms and bounds developed for diagonal case to the matrix case. Are there general reductions? The on-line PCA problem was also addressed in [Cra06]. However, that paper does not fully capture the PCA problem because their algorithm predicts with a full-rank matrix in each trial, whereas we predict with a probabilistically chosen projection matrix of the desired rank k . Furthermore, that paper proves bounds on the filtering loss, which are typically easier to prove, and it is not clear how this loss relates to the more standard regret bounds proven in this paper. For the expert setting there are alternate techniques for designing on-line algorithms that do as well as the best subsj t of n - k experts: set {i1 , . . . , in-k } receives weight proportional to e jt exp(- <) = exp(-