{"title": "Unbiased estimates for linear regression via volume sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 3084, "page_last": 3093, "abstract": "Given a full rank matrix X with more columns than rows consider the task of estimating the pseudo inverse $X^+$ based  on the pseudo inverse of a sampled subset of columns (of size at least the number of rows). We show that this is possible if the subset of columns is chosen proportional to the squared volume spanned by the rows of the chosen submatrix (ie, volume sampling). The resulting estimator is unbiased and surprisingly the covariance of the estimator also has a closed form: It equals a specific factor times $X^+X^{+\\top}$.  Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of $X$. We assume labels are expensive and we are only given the labels for the small subset of columns we sample from $X$. Using our methods we show that the weight vector of the solution for the sub problem is an unbiased estimator of the optimal solution for the whole problem based on all  column labels.   We believe that these new formulas establish a fundamental connection between linear least squares and volume sampling. We use our methods to obtain an algorithm for volume sampling that is faster than state-of-the-art and for obtaining bounds for the total loss of the estimated least-squares solution on all labeled columns.", "full_text": "Unbiased estimates for linear regression\n\nvia volume sampling\n\nMicha\u0142 Derezi\u00b4nski\n\nDepartment of Computer Science\nUniversity of California Santa Cruz\n\nmderezin@ucsc.edu\n\nManfred K. Warmuth\n\nDepartment of Computer Science\nUniversity of California Santa Cruz\n\nmanfred@ucsc.edu\n\nAbstract\n\nGiven a full rank matrix X with more columns than rows, consider the task of\nestimating the pseudo inverse X+ based on the pseudo inverse of a sampled subset\nof columns (of size at least the number of rows). We show that this is possible if\nthe subset of columns is chosen proportional to the squared volume spanned by the\nrows of the chosen submatrix (ie, volume sampling). The resulting estimator is\nunbiased and surprisingly the covariance of the estimator also has a closed form: It\nequals a speci\ufb01c factor times X+(cid:62)X+.\nPseudo inverse plays an important part in solving the linear least squares problem,\nwhere we try to predict a label for each column of X. We assume labels are\nexpensive and we are only given the labels for the small subset of columns we\nsample from X. Using our methods we show that the weight vector of the solution\nfor the sub problem is an unbiased estimator of the optimal solution for the whole\nproblem based on all column labels.\nWe believe that these new formulas establish a fundamental connection between\nlinear least squares and volume sampling. We use our methods to obtain an\nalgorithm for volume sampling that is faster than state-of-the-art and for obtaining\nbounds for the total loss of the estimated least-squares solution on all labeled\ncolumns.\n\nIntroduction\n\n1\nLet X be a wide full rank matrix with d rows and n columns where n \u2265 d. Our\ngoal is to estimate the pseudo inverse X+ of X based on the pseudo inverse\nof a subset of columns. More precisely, we sample a subset S \u2286 {1..n} of s\ncolumn indices (where s \u2265 d). We let XS be the sub-matrix of the s columns\nindexed by S (See Figure 1). Consider a version of X in which all but the\ncolumns of S are zero. This matrix equals XIS where IS is an n-dimensional\ndiagonal matrix with (IS)ii = 1 if i \u2208 S and 0 otherwise.\nWe assume that the set of s column indices of X is selected proportional to\nthe squared volume spanned by the rows of submatrix XS, i.e. proportional\nto det(XSX(cid:62)\nS ) and prove a number of new surprising expectation formulas\nfor this type of volume sampling, such as\nE[(XIS)+] = X+ and E[ (XSX(cid:62)\n\nX+(cid:62)X+.\n\n] =\n\nn \u2212 d + 1\ns \u2212 d + 1\n\n(cid:124)\n\n(cid:123)(cid:122)\n(cid:125)\nS )\u22121\n(XIS )+(cid:62)(XIS )+\n\nxi\n\nS\n\nXS\n\nX\n\nIS\n\nXIS\n\nX+(cid:62)\n\n(XIS)+(cid:62)\n\n(XS)+(cid:62)\n\nFigure 1: Set S may\nnot be consecutive.\nNote that (XIS)+ has the n \u00d7 d shape of X+ where the s rows indexed by S contain (XS)+ and\nthe remaining n \u2212 s rows are zero. The expectation of this matrix is X+ even though (XS)+ is\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fclearly not a sub-matrix of X+. In addition to the expectation formulas, our new techniques lead\nto an ef\ufb01cient volume sampling procedure which beats the state-of-the-art by a factor of n2 in time\ncomplexity.\nVolume sampling is useful in numerous applications, from clustering to matrix approximation, but we\nfocus on the task of solving linear least squares problems: For an n\u2212dimensional label vector y, let\nw\u2217 = argminw ||X(cid:62)w \u2212 y||2 = X+y. Assume the entire design matrix X is known to the learner\nbut labels are expensive and you want to observe as few of them as possible. Let w\u2217(S) = (XS)+yS\nbe the solution to the sub-problem based on labels yS. What is the smallest number of labels s\nnecessary, for which there is a sampling procedure on sets S of size s st the expected loss of w\u2217(S)\nis at most a constant factor larger than the loss of w\u2217 that uses all n labels (where the constant is\nindependent of n)? More precisely, using the short hand L(w) = ||X(cid:62)w \u2212 y||2 for the loss on all n\nlabels, what is the smallest size s such that E[L(w\u2217(S))] \u2264 const L(w\u2217). This question is a version\nof the \u201cminimal coresets\u201d open problem posed in [3].\nThe size has to be at least d and one can show that randomization is necessary in that any deterministic\nalgorithm for choosing a set of d columns can suffer loss larger by a factor of n. Also any iid sampling\nof S (such as the commonly used leverage scores [8]) requires at least \u2126(d log d) examples to achieve\na \ufb01nite factor. In this paper however we show that with a size d volume sample, E[L(w\u2217(S))] =\n(d + 1)L(w\u2217) if X is in general position. Note again that we have equality and not just an upper\nbound. Also we can show that the multiplicative factor d + 1 is optimal. We further improve this\nfactor to 1 + \u0001 via repeated volume sampling. Moreover, our expectation formulas imply that when S\nis size s \u2265 d volume sampled, then w\u2217(S) is an unbiased estimator for w\u2217, ie E[w\u2217(S)] = w\u2217.\n\n2 Related work\n\nVolume sampling is an extension of a determinantal point process [15], which has been given a lot of\nattention in the literature with many applications to machine learning, including recommendation\nsystems [10] and clustering [13]. Many exact and approximate methods for ef\ufb01ciently generating\nsamples from this distribution have been proposed [6, 14], making it a useful tool in the design of\nrandomized algorithms. Most of those methods focus on sampling s \u2264 d elements. In this paper,\nwe study volume sampling sets of size s \u2265 d, which has been proposed in [1] and motivated with\napplications in graph theory, linear regression, matrix approximation and more. The only known\npolynomial time algorithm for size s > d volume sampling was recently proposed in [16] with time\ncomplexity O(n4s). We offer a new algorithm with runtime O((n \u2212 s + d)nd), which is faster by a\nfactor of at least n2.\nThe problem of selecting a subset of input vectors for solving a linear regression task has been\nextensively studied in statistics literature under the terms optimal design [9] and pool-based active\nlearning [19]. Various criteria for subset selection have been proposed, like A-optimality and D-\nS )\u22121), which is combinatorially\noptimality. For example, A-optimality seeks to minimize tr((XSX(cid:62)\nhard to optimize exactly. We show that for size s volume sampling (for s \u2265 d), E[(XSX(cid:62)\nS )\u22121] =\nn\u2212d+1\ns\u2212d+1 X+(cid:62)X+ which provides an approximate randomized solution for this task.\nA related task has been explored in the \ufb01eld of computational geometry, where ef\ufb01cient algorithms\nare sought for approximately solving linear regression and matrix approximation [17, 5, 3]. Here,\nmultiplicative bounds on the loss of the approximate solution can be achieved via two approaches:\nSubsampling the vectors of the design matrix, and sketching the design matrix X and the label vector\ny by multiplying both by the same suitably chosen random matrix. Algorithms which use sketching\nto generate a smaller design matrix for a given linear regression problem are computationally ef\ufb01cient\n[18, 5], but unlike vector subsampling, they require all of the labels from the original problem to\ngenerate the sketch, so they do not apply directly to our setting of using as few labels as possible.\nThe main competitor to volume sampling for linear regression is iid sampling using the statistical\nleverage scores [8]. However we show in this paper that any iid sampling method requires sample\nsize \u2126(d log d) to achieve multiplicative loss bounds. On the other hand, the input vectors obtained\nfrom volume sampling are selected jointly and this makes the chosen subset more informative. We\nshow that just d volume sampled columns are suf\ufb01cient to achieve a multiplicative bound. Volume\nsampling size s \u2264 d has also been used in this line of work by [7, 11] for matrix approximation.\n\n2\n\n\fs contains(cid:0)n\n\n(cid:1) nodes for sets S \u2286 {1..n} of size s. Every node S at level s > d has s directed\n\n3 Unbiased estimators\nLet n be an integer dimension. For each subset S \u2286 {1..n} of size s we are given a matrix formula\nF(S). Our goal is to sample set S of size s using some sampling process and then develop concise\nexpressions for ES:|S|=s[F(S)]. Examples of formula classes F(S) will be given below.\nWe represent the sampling by a directed acyclic graph (dag), with a single root node corresponding\nto the full set {1..n}, Starting from the root, we proceed along the edges of the graph, iteratively\nremoving elements from the set S. Concretely, consider a dag with levels s = n, n \u2212 1, ..., d. Level\nedges to the nodes S \u2212 {i} at the next lower level. These edges are labeled with a conditional\nprobability vector P (S\u2212i|S). The probability of a (directed) path is the product of the probabilities\nalong its edges. The out\ufb02ow of probability from each node on all but the bottom level is 1. We let the\nprobability P (S) of node S be the probability of all paths from the top node {1..n} to S and set the\nprobability P ({1..n}) of the top node to 1. We associate a formula F(S) with each set node S in the\ndag. The following key equality lets us compute expectations.\nLemma 1 If for all S \u2286 {1..n} of size greater than d we have\n\ns\n\nP (S\u2212i|S)F(S\u2212i),\n\nS:|S|=s P (S)F(S) = F({1..n}).\n\n(cid:88)\n\nT :|T|=s\u22121\n\n(cid:88)\n(cid:124)\n\nj /\u2208T\n\n(cid:123)(cid:122)\n\nP (T )\n\n(cid:125)\n\n(cid:88)\nthen for any s \u2208 {d..n}: ES:|S|=s[F(S)] =(cid:80)\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nF(S) =\n\ni\u2208S\n\nP (S) F(S) =\n\nP (S)\n\nS:|S|=s\n\nS:|S|=s\n\ni\u2208S\n\nProof Suf\ufb01ces to show that expectations at successive layers are equal:\n\nP (S\u2212i|S) F(S\u2212i) =\n\nP (T+j)P (T|T+j)\n\nF(T ).\n\n3.1 Volume sampling\nGiven a wide full-rank matrix X \u2208 Rd\u00d7n and a sample size s \u2208 {d..n}, volume sampling chooses\nsubset S \u2286 {1..n} of size s with probability proportional to volume spanned by the rows of submatrix\nXS, ie proportional to det(XSX(cid:62)\nS ). The following corollary uses the above dag setup to compute the\nnormalization constant for this distribution. When s = d, the corollary provides a novel minimalist\n\nproof for the Cauchy-Binet formula:(cid:80)\n\nS:|S|=s det(XSX(cid:62)\n\nS ) = det(XX(cid:62)).\n\nCorollary 2 Let X \u2208 Rd\u00d7n and S \u2286 {1..n} of size n \u2265 s \u2265 d st det(XSX(cid:62)\nset S of size larger than d and i \u2208 S, de\ufb01ne the probability of the edge from S to S\u2212i as:\n\nS ) > 0. Then for any\n\n1\u2212x(cid:62)\n\nS )\u22121xi\n\ni (XSX(cid:62)\ns\u2212d\n\ndet(XS\u2212iX(cid:62)\n)\n(s\u2212d) det(XSX(cid:62)\nS )\n\nS\u2212i\n\n=\n\nis a proper probability distribution and thus(cid:80)\n\nP (S\u2212i|S) :=\n(reverse iterative volume sampling)\nwhere xi is the ith column of X and XS is the sub matrix of columns indexed by S. Then P (S\u2212i|S)\nS:|S|=s P (S) = 1 for all s \u2208 {d..n}. Furthermore\ndet(XSX(cid:62)\nS )\ns\u2212d\n\n(cid:0)n\u2212d\n(cid:1) det(XX(cid:62))\n\n(volume sampling)\n\nP (S) =\n\n,\n\n.\n\nProof First, for any node S st s > d and det(XSX(cid:62)\n\nS ) > 0, the probabilities out of S sum to 1:\n\nP (S\u2212i|S) =\n\n1 \u2212 tr((XSX(cid:62)\ns \u2212 d\n\nS )\u22121xix(cid:62)\ni )\n\n=\n\ns \u2212 tr((XSX(cid:62)\ns \u2212 d\n\nS )\u22121XSX(cid:62)\nS )\n\ns \u2212 d\ns \u2212 d\n\n=\n\n= 1.\n\n(cid:88)\n\ni\u2208S\n\n(cid:88)\n\ni\u2208S\n\nIt remains to show the formula for the probability P (S) of all paths ending at node S. Consider\nany path from the root {1..n} to S. There are (n \u2212 s)! such paths. The fractions of determinants in\n\n3\n\n\fprobabilities along each path telescope1 and the additional factors accumulate to the same product.\nSo the probability of all paths from the root to S is the same and the total probability into S is\n\n(n \u2212 s)!\n\n(n \u2212 d)(n \u2212 d \u2212 1) . . . (n \u2212 s + 1)\n\ndet(XSX(cid:62)\nS )\ndet(XX(cid:62))\n\n=\n\n3.2 Expectation formulas for volume sampling\n\n1(cid:0)n\u2212d\n\ns\u2212d\n\n(cid:1) det(XSX(cid:62)\n\nS )\ndet(XX(cid:62))\n\n.\n\nAll expectations in the remainder of the paper are wrt volume sampling. We use the short hand\nE[F(S)] for expectation with volume sampling where the size of the sampled set is \ufb01xed to s. The\nexpectation formulas for two choices of F(S) are proven in the next two theorems. By Lemma 1 it\n\nsuf\ufb01ces to show F(S) =(cid:80)\n\ni\u2208S P (S\u2212i|S)F(S\u2212i) for volume sampling.\n\nWe introduce a bit more notation \ufb01rst. Recall that XS is the sub matrix of columns indexed by\nS \u2286 {1..n} (See Figure 1). Consider a version of X in which all but the columns of S are zero. This\nmatrix equals XIS where IS is an n-dimensional diagonal matrix with (IS)ii = 1 if i \u2208 S and 0\notherwise.\nTheorem 3 Let X \u2208 Rd\u00d7n be a wide full rank matrix (ie n \u2265 d). For s \u2208 {d..n}, let S \u2286 1..n be a\nsize s volume sampled set over X. Then\n\nE[(XIS)+] = X+.\n\nWe believe that this fundamental formula lies at the core of why volume sampling is important in\nmany applications. In this work, we focus on its application to linear regression. However, [1]\ndiscuss many problems where controlling the pseudo-inverse of a submatrix is essential. For those\napplications, it is important to establish variance bounds for the estimator offered by Theorem 3. In\nthis case, volume sampling once again offers very concrete guarantees. We obtain them by showing\nthe following formula, which can be viewed as a second moment for this estimator.\nTheorem 4 Let X \u2208 Rd\u00d7n be a full-rank matrix and s \u2208 {d..n}. If size s volume sampling over X\nhas full support, then\n\n(cid:124)\n\nE[ (XSX(cid:62)\n\n(cid:123)(cid:122)\n(cid:125)\nS )\u22121\n(XIS )+(cid:62)(XIS )+\n\nn \u2212 d + 1\ns \u2212 d + 1\n\n] =\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(XX(cid:62))\u22121\nX+(cid:62)X+\n\n(cid:125)\n\n.\n\nIf volume sampling does not have full support then the matrix equality \u201c=\u201d is replaced by the\npositive-de\ufb01nite inequality \u201c(cid:22)\u201d.\nThe condition that size s volume sampling over X has full support is equivalent to det(XSX(cid:62)\nS ) > 0\nfor all S \u2286 1..n of size s. Note that if size s volume sampling has full support, then size t > s also\nhas full support. So full support for the smallest size d (often phrased as X being in general position)\nimplies that volume sampling wrt any size s \u2265 d has full support.\nSurprisingly by combining theorems 3 and 4, we can obtain a \u201ccovariance type formula\u201d for the\npseudo-inverse matrix estimator:\n\nE[((XIS)+ \u2212 E[(XIS)+])(cid:62) ((XIS)+ \u2212 E[(XIS)+])]\n= E[(XIS)+(cid:62)(XIS)+] \u2212 E[(XIS)+](cid:62) E[(XIS)+]\n\n=\n\n(1)\nTheorem 4 can also be used to obtain an expectation formula for the Frobenius norm (cid:107)(XIS)+(cid:107)F of\nthe estimator:\n\nX+(cid:62)X+ \u2212 X+(cid:62)X+ =\n\nn \u2212 s\ns \u2212 d + 1\n\nX+(cid:62)X+.\n\nn \u2212 d + 1\ns \u2212 d + 1\n\nE(cid:107)(XIS)+(cid:107)2\n\nF = E[tr((XIS)+(cid:62)(XIS)+)] =\n\n(cid:107)X+(cid:107)2\nF .\n\n(2)\n\nn \u2212 d + 1\ns \u2212 d + 1\n\nThis norm formula has been shown in [1], with numerous applications. Theorem 4 can be viewed as\na much stronger pre trace version of the norm formula. Also our proof techniques are quite different\n\n1Note that 0\n\n0 determinant ratios are avoided along the path because paths with such ratios always lead to sets\n\nof probability 0 and in the corollary we only consider paths to nodes S for which det(XSXS) > 0.\n\n4\n\n\fi\u2208S\n\n(cid:80)\nProof of Theorem 4 Choose F(S) = s\u2212d+1\ni\u2208S P (S\u2212i|S)F(S\u2212i) for volume sampling:\n1 \u2212 x(cid:62)\n\ni (XSX(cid:62)\ns \u2212 d\n\u0018\u0018\u0018\nTo show this we apply Sherman Morrison to (XS\u2212iX(cid:62)\n\nn\u2212d+1 (XSX(cid:62)\n(cid:88)\nS )\u22121 =\n(cid:18)\n\ns \u2212 d + 1\n(((((\nn \u2212 d + 1\n\n(XSX(cid:62)\n\ni\u2208S\n\n(cid:88)\n\n(1 \u2212 x(cid:62)\n\ni (XSX(cid:62)\n\ni\u2208S\n= (s \u2212 d)(XSX(cid:62)\n\nS )\u22121xi)\n(XSX(cid:62)\nS )\u22121 + \u0018\u0018\u0018\u0018\u0018\u0018\nS )\u22121\nxix(cid:62)\n\u001a\n\u001a\ni\u2208S\n\u001a\n\nS )\u22121 +\n\u001a\u001a(cid:88)\n\n(XSX(cid:62)\n\nS )\u22121xi\n\n)\u22121\n\nS\u2212i\n\n(XS\u2212i X(cid:62)\n\ns \u2212 d\n\u0018\u0018\u0018\n(((((\nn \u2212 d + 1\n)\u22121 on the rhs:\nS )\u22121xix(cid:62)\ni (XSX(cid:62)\nS )\u22121 = (s \u2212 d + 1) (XSX(cid:62)\n\ni (XSX(cid:62)\nS )\u22121xi\n\nS )\u22121\n\n(cid:19)\n\nS\u2212i\n(XSX(cid:62)\n\n1 \u2212 x(cid:62)\ni (XSX(cid:62)\n\nS )\u22121.\n\nS )\u22121xi\n\n, ie:\n\nand much simpler. Note that if size s volume sampling for X does not have full support then (1)\nbecomes a semi-de\ufb01nite inequality (cid:22) between matrices and (2) an inequality between numbers.\nProof of Theorem 3 We apply Lemma 1 with F(S) = (XIS)+. It suf\ufb01ces to show F(S) =\n\n(cid:80)\ni\u2208S P (S\u2212i|S)F(S\u2212i) for P (S\u2212i|S) := 1\u2212x(cid:62)\ni (XSX(cid:62)\ns \u2212 d\n\n(XIS)+ =\n\n(cid:88)\n\n1 \u2212 x(cid:62)\n\ni (XS X(cid:62)\ns\u2212d\nS )\u22121xi\n\n.\n\ni\u2208S\n\n(cid:124)\n\n(cid:88)\n\n(XIS\u2212i)+\n(XIS\u2212i )(cid:62)(XS\u2212i X(cid:62)\n\nProven by applying Sherman Morrison to (XS\u2212iX(cid:62)\n\n(cid:125)\n(cid:123)(cid:122)\n)\u22121\nS \u2212 xix(cid:62)\ni )\u22121 on the rhs:\ni (XSX(cid:62)\n(XSX(cid:62)\nS )\u22121\nS )\u22121xi\nWe now expand the last two factors into 4 terms. The expectation of the \ufb01rst (XIS)(cid:62)(XSX(cid:62)\nS )\u22121 is\n(XIS)+ (which is the lhs) and the expectations of the remaining three terms times s \u2212 d sum to 0:\nS )\u22121\n\nS )\u22121xix(cid:62)\ni (XSX(cid:62)\n\n\u001a\u001a(cid:88)\n\n((XIS)(cid:62) \u2212 eix(cid:62)\ni )\n\ni (XSX(cid:62)\ns \u2212 d\n\n)\u22121 = (XSX(cid:62)\n\ni (XSX(cid:62)\n\n(XSX(cid:62)\n\nS )\u22121 +\n\n(1 \u2212 x(cid:62)\n\nS )\u22121xi\n\n1 \u2212 x(cid:62)\n\n1 \u2212 x(cid:62)\n\n(cid:18)\n\n(cid:19)\n\nS\u2212i\n\nS\u2212i\n\ni (XSX(cid:62)\nei(x(cid:62)\n\nS )\u22121xi) eix(cid:62)\nS )\u22121xi) x(cid:62)\n\ni (XSX(cid:62)\ni (XSX(cid:62)\n\ni (XSX(cid:62)\n\n(XSX(cid:62)\n\nS )\u22121 + (XIS)(cid:62)\u0018\u0018\u0018\u0018\u0018\u0018\nS )\u22121\nxix(cid:62)\n\u001a\n\u001a\ni\u2208S\n\u001a\nS )\u22121 = 0.\n\n\u2212(cid:88)\n\u2212(cid:88)\n\ni\u2208S\n\n.\n\ni\n\nS )\u22121. By Lemma 1 it suf\ufb01ces to show F(S) =\n\nIf some denominators 1\u2212x(cid:62)\nare positive. In this case the above matrix equality becomes a positive-de\ufb01nite inequality (cid:22).\n\nS )\u22121xi are zero, then only sum over i for which the denominators\n\ni (XSX(cid:62)\n\n4 Linear regression with few labels\n\nOur main motivation for studying volume sampling came\nfrom asking the following simple question. Suppose we\nwant to solve a d-dimensional linear regression problem\nwith a matrix X \u2208 Rd\u00d7n of input column vectors and a\nlabel vector y \u2208 Rn, ie \ufb01nd w \u2208 Rd that minimizes the\nleast squares loss L(w) = (cid:107)X(cid:62)w \u2212 y(cid:107)2:\n\nL(w\u2217(Si))\n\nL(\u00b7)\n\nE[L(w\u2217(S))]\n\nL(w\u2217(Sj))\n\nd L(w\u2217)\n\nL(w\u2217)\n\nw\u2217 def= argmin\nw\u2208Rd\n\nL(w) = X+(cid:62)y,\n\nw\u2217(Si)\nFigure 2: Unbiased estimator w\u2217(S) in ex-\nbut the access to label vector y is restricted. We are al-\npectation suffers loss (d + 1) L(w\u2217).\nlowed to pick a subset S \u2286 {1..n} for which the labels yi\n(where i \u2208 S) are revealed to us, and then solve the subproblem (XS, yS), obtaining w\u2217(S). What is\nthe smallest number of labels such that for any X, we can \ufb01nd w\u2217(S) for which L(w\u2217(S)) is only a\nmultiplicative factor away from L(w\u2217) (independent of the number of input vectors n)? This question\nwas posed as an open problem by [3]. It is easy to show that we need at least d labels (when X is\nfull-rank), so as to guarantee the uniqueness of solution w\u2217(S). We use volume sampling to show\nthat d labels are in fact suf\ufb01cient (proof in Section 4.1).\n\nw\u2217 = E(w\u2217(S))\n\nw\u2217(Sj)\n\n5\n\n\fTheorem 5 If the input matrix X \u2208 Rd\u00d7n is in general position, then for any label vector y \u2208 Rn,\nthe expected square loss (on all n labeled vectors) of the optimal solution w\u2217(S) for the subproblem\n(XS, yS), with the d-element set S obtained from volume sampling, is given by\n\nE[L(w\u2217\n\n(S))] = (d + 1) L(w\u2217).\n\nIf X is not in general position, then the expected loss is upper-bounded by (d + 1) L(w\u2217).\n\nThe factor d + 1 cannot be improved when selecting only d labels (we omit the proof):\nProposition 6 For any d, there exists a least squares problem (X, y) with d + 1 vectors in Rd such\nthat for every d-element index set S \u2286 {1, ..., d + 1}, we have\n(S)) = (d + 1) L(w\u2217).\n\nL(w\u2217\n\nNote that the multiplicative factor in Theorem 5 does not depend on n. It is easy to see that this\ncannot be achieved by any deterministic algorithm (without the access to labels). Namely, suppose\nthat d = 1 and X is a vector of all ones, whereas the label vector y is a vector of all ones except for a\nsingle zero. No matter which column index we choose deterministically, if that index corresponds to\nthe label 0, the solution to the subproblem will incur loss L(w\u2217(S)) = n L(w\u2217). The fact that volume\nsampling is a joint distribution also plays an essential role in proving Theorem 5. Consider a matrix\nX with exactly d unique linearly independent columns (and an arbitrary number of duplicates). Any\niid column sampling distribution (like for example leverage score sampling) will require \u2126(d log d)\nsamples to retrieve all d unique columns (ie coupon collector problem), which is necessary to get any\nmultiplicative loss bound.\nThe exact expectation formula for the least squares loss under volume sampling suggests a deep\nconnection between linear regression and this distribution. We can use Theorem 3 to further strengthen\nthat connection. Note, that the least squares estimator obtained through volume sampling can be\nwritten as w\u2217(S) = (XIS)+(cid:62)y. Applying formula for the expectation of pseudo-inverse, we conclude\nthat w\u2217(S) is an unbiased estimator of w\u2217.\nProposition 7 Let X \u2208 Rd\u00d7n be a full-rank matrix and n \u2265 s \u2265 d. Let S \u2286 1..n be a size s volume\nsampled set over X. Then, for arbitrary label vector y \u2208 Rn, we have\n(S)] = E[(XIS)+(cid:62)y] = X+(cid:62)y = w\u2217.\n\nE[w\u2217\n\nFor size s = d volume sampling, the fact that E[w\u2217(S)] equals w\u2217 can be found in an early paper [2].\nThey give a direct proof based on Cramer\u2019s rule. For us the above proposition is a direct consequence\nof the matrix expectation formula given in Theorem 3 that holds for volume sampling of any size\ns \u2265 d. In contrast, the loss expectation formula of Theorem 5 is limited to sampling of size s = d.\nBounding the loss expectation for s > d remains an open problem. However, we consider a different\nstrategy for extending volume sampling in linear regression. Combining Proposition 7 with Theorem\n5 we can compute the variance of predictions generated by volume sampling, and obtain tighter\nmultiplicative loss bounds by sampling multiple d-element subsets S1, ..., St independently.\n\nTheorem 8 Let (X, y) be as in Theorem 5. For k independent size d volume samples S1, ..., Sk,\n\n\uf8eb\uf8ed 1\n\uf8ee\uf8f0L\nk(cid:88)\nProof Denote(cid:98)y def= X(cid:62)w\u2217 and(cid:98)y(S)\n\nj=1\n\nE\n\nk\n\n\uf8f6\uf8f8\uf8f9\uf8fb =\n\n(cid:18)\n\nw\u2217\n\n(Sj)\n\n(cid:19)\n\n1 +\n\nd\nk\n\nL(w\u2217).\n\ndef= X(cid:62)w\u2217(S) as the predictions generated by w\u2217 and w\u2217(S)\nrespectively. We perform bias-variance decomposition of the loss of w\u2217(S) (for size d volume\nsampling):\n\n(S))] = E[(cid:107)(cid:98)y(S) \u2212 y(cid:107)2] = E[(cid:107)(cid:98)y(S) \u2212(cid:98)y +(cid:98)y \u2212 y(cid:107)2]\nn(cid:88)\nE(cid:2)((cid:98)y(S)i \u2212 E[(cid:98)y(S)i])2(cid:3) + L(w\u2217) =\n\n= E[(cid:107)(cid:98)y(S) \u2212(cid:98)y(cid:107)2] + E[2((cid:98)y(S) \u2212(cid:98)y)(cid:62)((cid:98)y \u2212 y)] + (cid:107)(cid:98)y \u2212 y(cid:107)2\nn(cid:88)\n\n(\u2217)\n=\n\nVar[(cid:98)y(S)i] + L(w\u2217),\n\nE[L(w\u2217\n\ni=1\n\ni=1\n\n6\n\n\fNow the expected loss of the average weight vector wrt sampling k independent sets S1, ..., Sk is:\n\nwhere (\u2217) follows from Theorem 3. Now, we use Theorem 5 to obtain the total variance of predictions:\n\n\uf8ee\uf8f0L\n\n\uf8eb\uf8ed 1\n\nk\n\nE\n\nk(cid:88)\n\nj=1\n\nw\u2217\n\n(Sj)\n\n(S))] \u2212 L(w\u2217) = d L(w\u2217).\n\ni=1\n\nn(cid:88)\nVar[(cid:98)y(S)i] = E[L(w\u2217\n\uf8f6\uf8f8\uf8f9\uf8fb =\n\uf8ee\uf8f0 1\nk(cid:88)\nn(cid:88)\n\uf8eb\uf8ed k(cid:88)\n\nVar\n\nj=1\n\ni=1\n\n=\n\nk\n\n1\nk2\n\nd L(w\u2217)\n\nj=1\n\n\uf8f9\uf8fb + L(w\u2217)\n(cid:98)y(Sj)i\n\uf8f6\uf8f8 + L(w\u2217) =\n(cid:18)\n\n(cid:19)\n\nd\nk\n\nL(w\u2217).\n\n1 +\n\nIt is worth noting that the average weight vector used in Theorem 8 is not expected to perform better\nthan taking the solution to the joint subproblem, w\u2217(S1:k), where S1:k = S1 \u222a ... \u222a Sk. However,\ntheoretical guarantees for that case are not yet available.\n\n4.1 Proof of Theorem 5\n\nWe use the following lemma regarding the leave-one-out loss for linear regression [4]:\nLemma 9 Let w\u2217(\u2212i) denote the least squares solution for problem (X\u2212i, y\u2212i). Then, we have\ni w \u2212 yi)2.\n\nL(w\u2217) = L(w\u2217\n\n(\u2212i)) \u2212 x(cid:62)\n\n= (x(cid:62)\n\n(cid:96)i(w)\n\ndef\n\nWhen X has d + 1 columns and X\u2212i is a full-rank d \u00d7 d matrix, then L(w\u2217(\u2212i)) = (cid:96)i(w\u2217(\u2212i)) and\nLemma 9 leads to the following:\n\ndet((cid:101)X(cid:101)X(cid:62))\n\nP (S) (cid:96)j(w\u2217\n\n(S)) =\n\nP (T\u2212j) (cid:96)j(w\u2217\n\n(T\u2212j)).\n\n(4)\n\nWe now use (3) on the matrix XT and test instance xj (assuming rank(XT\u2212j ) = d):\n\nP (T\u2212j) (cid:96)j(w\u2217\n\n(5)\nSince the summand does not depend on the index j \u2208 T , the inner summation in (4) becomes a\nmultiplication by d + 1. This lets us write the expected loss as:\n\n(T\u2212j)) =\n\n(T\u2212j)) =\n\n.\n\ndet(XT\u2212j X(cid:62)\nT\u2212j\ndet(XX(cid:62))\n\n)\n\n(cid:96)j(w\u2217\n\ndet((cid:101)XT(cid:101)X(cid:62)\n\nT )\ndet(XX(cid:62))\n\n(cid:88)\n\ndet((cid:101)XT(cid:101)X(cid:62)\n\nT )\n\nT,|T|=d+1\n\ndet((cid:101)X(cid:101)X(cid:62))\n\ndet(XX(cid:62))\n\n(1)\n= (d + 1)\n\n(2)\n\n= (d + 1) L(w\u2217),\n\nE[L(w\u2217\n\n(S))] =\n\nd + 1\n\ndet(XX(cid:62))\n\n(6)\nwhere (1) follows from the Cauchy-Binet formula and (2) is an application of the \u201cbase \u00d7 height\u201d\nformula. If X is not in general position, then for some summands in (5), rank(XT\u2212j ) < d and\nP (T\u2212j) = 0. Thus the left-hand side of (5) is 0, while the right-hand side is non-negative, so (6)\nbecomes an inequality, completing the proof of Theorem 5.\n\n7\n\nwhere\n\n(\u2212i)),\n\ni (XX(cid:62))\u22121xi (cid:96)i(w\u2217\n(cid:125)(cid:124)\n(cid:123)\n(cid:122)\nwhere (cid:101)X =\n(cid:107)(cid:98)y \u2212 y(cid:107)2\ni (XX(cid:62))\u22121xi)(cid:96)i(w\u2217\n(\u2212i)),\n\n= det(XX(cid:62))\n\nL(w\u2217)\n\n(1)\n\n(3)\n\n(2)\n\n(cid:19)\n\n(cid:18) X\n\ny(cid:62)\n(\u2212i))\n\n= det(XX(cid:62))(1 \u2212 x(cid:62)\n\u2212i)(cid:96)i(w\u2217\n= det(X\u2212iX(cid:62)\n(cid:88)\n\nP (S)L(w\u2217\n\n(S)) =\n\nn(cid:88)\n(cid:88)\n\nj=1\n\n(cid:96)j(w\u2217\n\n(S))\n\nS,|S|=d\n\nP (S)\n\n(cid:88)\n\nT,|T|=d+1\n\nj\u2208T\n\nE[L(w\u2217\n\n(S))] =\n\n=\n\n(cid:88)\n(cid:88)\n\nS,|S|=d\n\n(cid:88)\n\nS,|S|=d\n\nj /\u2208S\n\n(3)\nwhere (1) is the \u201cbase \u00d7 height\u201d formula for volume, (2) follows from Lemma 9 and (3) follows\nfrom a standard determinant formula. Returning to the proof, our goal is to \ufb01nd the expected loss\nE[L(w\u2217(S))], where S is a size d volume sampled set. First, we rewrite the expectation as follows:\n\n\f5 Ef\ufb01cient algorithm for volume sampling\n\nIn this section we propose an algorithm for ef\ufb01ciently performing exact volume sampling for any\ns \u2265 d. This addresses the question posed by [1], asking for a polynomial-time algorithm for the\ncase when s > d. [6, 11] gave an algorithm for the case when s = d, which runs in time O(nd3).\nRecently, [16] offered an algorithm for arbitrary s, which has complexity O(n4s). We propose a\nnew method, which uses our techniques to achieve the time complexity O((n \u2212 s + d)nd), a direct\nimprovement over [16] by a factor of at least n2. Our algorithm also offers an improvement for s = d\nin certain regimes. Namely, when n = o(d2), then our algorithm runs in time O(n2d) = o(nd3),\nfaster than the method proposed by [6].\nOur algorithm implements reverse iterative sampling from Corollary 2. After removing q columns,\nwe are left with an index set of size n\u2212 q that is distributed according to volume sampling for column\nset size n \u2212 q.\nTheorem 10 The sampling algorithm runs in time O((n \u2212 s + d)nd), using O(d2 + n) additional\nmemory, and returns set S which is distributed according to size s volume sampling over X.\n\nProof For correctness we show the following invariants that hold at the beginning of the while loop:\n\npi = 1 \u2212 x(cid:62)\n\ni (XSX(cid:62)\n\nS )\u22121xi = (|S| \u2212 d) P (S\u2212i|S)\n\nand\n\nZ = (XSX(cid:62)\n\nS )\u22121.\n\nAt the \ufb01rst iteration the invariants trivially hold. When updating the pj we use Z and the pi from the\nprevious iteration, so we can rewrite the update as\n\npj \u2190 pj \u2212 (x(cid:62)\n= 1 \u2212 x(cid:62)\n\n= 1 \u2212 x(cid:62)\n\nj v)2\nj (XSX(cid:62)\n(cid:18)\nj (XSX(cid:62)\n\nS )\u22121xj \u2212\nS )\u22121xj \u2212 x(cid:62)\nS )\u22121 +\n\n(x(cid:62)\nj Zxi)2\ni (XSX(cid:62)\n\n1 \u2212 x(cid:62)\nj (XSX(cid:62)\n\nS )\u22121xj\n\nS )\u22121xi\nS )\u22121xix(cid:62)\ni (XSX(cid:62)\n\n1 \u2212 x(cid:62)\nS )\u22121xix(cid:62)\ni (XSX(cid:62)\n\ni (XSX(cid:62)\nS )\u22121xi\n\ni (XSX(cid:62)\nS )\u22121xi\nS )\u22121\n\n(cid:19)\n\n(XSX(cid:62)\n\nReverse iterative volume sampling\n\ni Zxi\n\nInput: X\u2208Rd\u00d7n, s\u2208{d..n}\nZ \u2190 (XX(cid:62))\u22121\n\u2200i\u2208{1..n} pi \u2190 1 \u2212 x(cid:62)\nS \u2190 {1, .., n}\nwhile |S| > s\nSample i \u221d pi out of S\nS \u2190 S \u2212 {i}\n\u221a\nv \u2190 Zxi/\npi\n\u2200j\u2208S pj \u2190 pj \u2212 (x(cid:62)\nZ \u2190 Z + vv(cid:62)\n\nj\n\nS\u2212i\n\n1 \u2212 x(cid:62)\n\n(XSX(cid:62)\nj (XS\u2212i X(cid:62)\n\nxj\n)\u22121xj = (|S| \u2212 1 \u2212 d) P (S\u2212i,j|S\u2212i),\n\n= 1 \u2212 x(cid:62)\n(\u2217)\n= 1 \u2212 x(cid:62)\nwhere (\u2217) follows from the Sherman-Morrison formula. The update of Z is also an application of\nSherman-Morrison and this concludes the proof of correctness.\nRuntime: Computing the initial Z = (XX(cid:62))\u22121 takes O(nd2), as does computing the initial values\nof pj\u2019s. Inside the while loop, updating pj\u2019s takes O(|S|d) = O(nd) and updating Z takes O(d2).\nThe overall runtime becomes O(nd2 + (n \u2212 s)nd) = O((n \u2212 s + d)nd). The space usage (in\naddition to the input data) is dominated by the pi values and matrix Z.\n\nend\nreturn S\n\nj v)2\n\n6 Conclusions\nWe developed exact formulas for E[(XIS)+)] and E[(XIS)+)2] when the subset S of s column\nindices is sampled proportionally to the volume det(XSX(cid:62)\nS ). The formulas hold for any \ufb01xed size\ns \u2208 {d..n}. These new expectation formulas imply that the solution w\u2217(S) for a volume sampled\nsubproblem of a linear regression problem is unbiased. We also gave a formula relating the loss of the\nsubproblem to the optimal loss (ie E(L(w\u2217(S))) = (d + 1)L(w\u2217)). However, this result only holds\n(cid:80)\nfor sample size s = d. It is an open problem to obtain such an exact expectation formula for s > d.\ni Si. We\ni w\u2217(Si)) of the average predictor but it is an\n\nA natural algorithm is to draw k samples Si of size d and return w\u2217(S1:k), where S1:k =(cid:83)\n\nwere able to get exact expressions for the loss L( 1\nopen problem to get nontrivial bounds for the loss of the best predictor w\u2217(S1:k).\nk\n\n8\n\n\fWe were able to show that for small sample sizes, volume sampling a set jointly has the advantage: It\nachieves a multiplicative bound for the smallest sample size d, whereas any independent sampling\nroutine requires sample size at least \u2126(d log d).\nWe believe that our results demonstrate a fundamental connection between volume sampling and\nlinear regression, which demands further exploration. Our loss expectation formula has already been\napplied by [12] to the task of linear regression without correspondence.\n\nAcknowledgements Thanks to Daniel Hsu and Wojciech Kot\u0142owski for many valuable discussions.\nThis research was supported by NSF grant IIS-1619271.\n\nReferences\n[1] Haim Avron and Christos Boutsidis. Faster subset selection for matrices and applications. SIAM\n\nJournal on Matrix Analysis and Applications, 34(4):1464\u20131499, 2013.\n\n[2] Aharon Ben-Tal and Marc Teboulle. A geometric property of the least squares solution of linear\n\nequations. Linear Algebra and its Applications, 139:165 \u2013 170, 1990.\n\n[3] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Rich coresets for constrained\n\nlinear regression. CoRR, abs/1202.3505, 2012.\n\n[4] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge\n\nUniversity Press, New York, NY, USA, 2006.\n\n[5] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in\ninput sparsity time. In Proceedings of the Forty-\ufb01fth Annual ACM Symposium on Theory of\nComputing, STOC \u201913, pages 81\u201390, New York, NY, USA, 2013. ACM.\n\n[6] Amit Deshpande and Luis Rademacher. Ef\ufb01cient volume sampling for row/column subset\nselection. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer\nScience, FOCS \u201910, pages 329\u2013338, Washington, DC, USA, 2010. IEEE Computer Society.\n\n[7] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation\nand projective clustering via volume sampling. In Proceedings of the Seventeenth Annual\nACM-SIAM Symposium on Discrete Algorithm, SODA \u201906, pages 1117\u20131126, Philadelphia, PA,\nUSA, 2006. Society for Industrial and Applied Mathematics.\n\n[8] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast\napproximation of matrix coherence and statistical leverage. J. Mach. Learn. Res., 13(1):3475\u2013\n3506, December 2012.\n\n[9] Valeri Vadimovich Fedorov, W.J. Studden, and E.M. Klimko, editors. Theory of optimal\n\nexperiments. Probability and mathematical statistics. Academic Press, New York, 1972.\n\n[10] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Bayesian low-rank determinantal point\nprocesses. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys \u201916,\npages 349\u2013356, New York, NY, USA, 2016. ACM.\n\n[11] Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix re-\nconstruction. In Proceedings of the Twenty-third Annual ACM-SIAM Symposium on Discrete\nAlgorithms, SODA \u201912, pages 1207\u20131214, Philadelphia, PA, USA, 2012. Society for Industrial\nand Applied Mathematics.\n\n[12] Daniel Hsu, Kevin Shi, and Xiaorui Sun. Linear regression without correspondence. CoRR,\n\nabs/1705.07048, 2017.\n\n[13] Byungkon Kang. Fast determinantal point process sampling with application to clustering. In\nProceedings of the 26th International Conference on Neural Information Processing Systems,\nNIPS\u201913, pages 2319\u20132327, USA, 2013. Curran Associates Inc.\n\n[14] Alex Kulesza and Ben Taskar. k-DPPs: Fixed-Size Determinantal Point Processes. In Proceed-\nings of the 28th International Conference on Machine Learning, pages 1193\u20131200. Omnipress,\n2011.\n\n9\n\n\f[15] Alex Kulesza and Ben Taskar. Determinantal Point Processes for Machine Learning. Now\n\nPublishers Inc., Hanover, MA, USA, 2012.\n\n[16] C. Li, S. Jegelka, and S. Sra. Column Subset Selection via Polynomial Time Dual Volume\n\nSampling. ArXiv e-prints, March 2017.\n\n[17] Michael W. Mahoney. Randomized algorithms for matrices and data. Found. Trends Mach.\n\nLearn., 3(2):123\u2013224, February 2011.\n\n[18] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In\nProceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS\n\u201906, pages 143\u2013152, Washington, DC, USA, 2006. IEEE Computer Society.\n\n[19] Masashi Sugiyama and Shinichi Nakajima. Pool-based active learning in approximate linear\n\nregression. Mach. Learn., 75(3):249\u2013274, June 2009.\n\n10\n\n\f", "award": [], "sourceid": 1748, "authors": [{"given_name": "Michal", "family_name": "Derezinski", "institution": "UC Santa Cruz"}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": "Univ. of Calif. at Santa Cruz"}]}