{"title": "One-shot learning and big data with n=2", "book": "Advances in Neural Information Processing Systems", "page_first": 270, "page_last": 278, "abstract": "We model a one-shot learning\" situation, where very few (scalar) observations $y_1,...,y_n$ are available. Associated with each observation $y_i$ is a very high-dimensional vector $x_i$, which provides context for $y_i$ and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of $x_i$ is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar $c > 1$; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods ($c < 1$), which are far more common in big data analyses. \"", "full_text": "One-shot learning and big data with n = 2\n\nLee H. Dicker\n\nRutgers University\n\nPiscataway, NJ\n\nldicker@stat.rutgers.edu\n\nDean P. Foster\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA\n\ndean@foster.net\n\nAbstract\n\nWe model a \u201cone-shot\nlearning\u201d situation, where very few observations\ny1, ..., yn \u2208 R are available. Associated with each observation yi is a very high-\ndimensional vector xi \u2208 Rd, which provides context for yi and enables us to pre-\ndict subsequent observations, given their own context. One of the salient features\nof our analysis is that the problems studied here are easier when the dimension\nof xi is large; in other words, prediction becomes easier when more context is\nprovided. The proposed methodology is a variant of principal component regres-\nsion (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show\nthat classical PCR estimators may be inconsistent in the speci\ufb01ed setting, unless\nthey are multiplied by a scalar c > 1; that is, unless the classical estimator is ex-\npanded. This expansion phenomenon appears to be somewhat novel and contrasts\nwith shrinkage methods (c < 1), which are far more common in big data analyses.\n\n1\n\nIntroduction\n\nThe phrase \u201cone-shot learning\u201d has been used to describe our ability \u2013 as humans \u2013 to correctly\nrecognize and understand objects (e.g. images, words) based on very few training examples [1, 2].\nSuccessful one-shot learning requires the learner to incorporate strong contextual information into\nthe learning algorithm (e.g. information on object categories for image classi\ufb01cation [1] or \u201cfunction\nwords\u201d used in conjunction with a novel word and referent in word-learning [3]). Variants of one-\nshot learning have been widely studied in literature on cognitive science [4, 5], language acquisition\n(where a great deal of relevant work has been conducted on \u201cfast-mapping\u201d) [3, 6\u20138], and computer\nvision [1, 9]. Many recent statistical approaches to one-shot learning, which have been shown to\nperform effectively in a variety of examples, rely on hierarchical Bayesian models, e.g. [1\u20135, 8].\nIn this article, we propose a simple latent factor model for one-shot learning with continuous out-\ncomes. We propose effective methods for one-shot learning in this setting, and derive risk approx-\nimations that are informative in an asymptotic regime where the number of training examples n\nis \ufb01xed (e.g. n = 2) and the number of contextual features for each example d diverges. These\napproximations provide insight into the signi\ufb01cance of various parameters that are relevant for one-\nshot learning. One important feature of the proposed one-shot setting is that prediction becomes\n\u201ceasier\u201d when d is large \u2013 in other words, prediction becomes easier when more context is provided.\nBinary classi\ufb01cation problems that are \u201ceasier\u201d when d is large have been previously studied in\nthe literature, e.g. [10\u201312]; this article may contain the \ufb01rst analysis of this kind with continuous\noutcomes.\nThe methods considered in this paper are variants of principal component regression (PCR) [13].\nPrincipal component analysis (PCA) is the cornerstone of PCR. High-dimensional PCA (i.e. large\nd) has been studied extensively in recent literature, e.g. [14\u201322]. Existing work that is especially\nrelevant for this paper includes that of Lee et al. [19], who studied principal component scores in\nhigh dimensions, and work by Hall, Jung, Marron and co-authors [10, 11, 18, 21], who have studied\n\u201chigh dimension, low sample size\u201d data, with \ufb01xed n and d \u2192 \u221e, in a variety of contexts, including\n\n1\n\n\fPCA. While many of these results address issues that are clearly relevant for PCR (e.g. consis-\ntency or inconsistency of sample eigenvalues and eigenvectors in high dimensions), their precise\nimplications for high-dimensional PCR are unclear.\nIn addition to addressing questions about one-shot learning, which motivate the present analysis,\nthe results in this paper provide new insights into PCR in high dimensions. We show that the clas-\nsical PCR estimator is generally inconsistent in the one-shot learning regime, where n is \ufb01xed and\nd \u2192 \u221e. To remedy this, we propose a bias-corrected PCR estimator, which is obtained by expand-\ning the classical PCR estimator (i.e. multiplying it by a scalar c > 1). Risk approximations obtained\nin Section 5 imply that the bias-corrected estimator is consistent when n is \ufb01xed and d \u2192 \u221e. These\nresults are supported by a simulation study described in Section 7, where we also consider an \u201cora-\ncle\u201d PCR estimator for comparative purposes. It is noteworthy that the bias-corrected estimator is an\nexpanded version of the classical estimator. Shrinkage, which would correspond to multiplying the\nclassical estimator by a scalar 0 \u2264 c < 1, is a far more common phenomenon in high-dimensional\ndata analysis, e.g. [23\u201325] (however, expansion is not unprecedented; Lee et al. [19] argued for\nbias-correction via expansion in the analysis of principal component scores).\n\n2 Statistical setting\nSuppose that the observed data consists of (y1, x1), ..., (yn, xn), where yi \u2208 R is a scalar outcome\nand xi \u2208 Rd is an associated d-dimensional \u201ccontext\u201d vector, for i = 1, ..., n. Suppose that yi and\nxi are related via\n\n\u221a\n\ndu + \u0001i \u2208 Rd, \u0001i \u223c N (0, \u03c4 2I), i = 1, ..., n.\n\nyi = hi\u03b8 + \u03bei \u2208 R, hi \u223c N (0, \u03b72), \u03bei \u223c N (0, \u03c32),\nxi = hi\u03b3\n\n(1)\n(2)\nThe random variables hi, \u03bei and the random vectors \u0001i = (\u0001i1, ..., \u0001id)T , 1 \u2264 i \u2264 n, are all assumed\nto be independent; hi is a latent factor linking the outcome yi and the vector xi; \u03bei and \u0001i are\nrandom noise. The unit vector u = (u1, ..., ud)T \u2208 Rd and real numbers \u03b8, \u03b3 \u2208 R are taken to be\nnon-random. It is implicit in our normalization that the \u201cx-signal\u201d ||hi\u03b3\ndu||2 (cid:16) d is quite strong.\nObserve that (yi, xi) \u223c N (0, V ) are jointly normal with\n\n\u221a\n\n(cid:18) \u03b82\u03b72 + \u03c32\n\n\u221a\n\nV =\n\n\u221a\n\n\u03b8\u03b72\u03b3\n\nduT\n\n(cid:19)\n\n\u03b8\u03b72\u03b3\n\ndu \u03c4 2I + \u03b72\u03b32duuT\n\n.\n\n(3)\n\nTo further simplify notation in what follows, let y = (y1, ..., yn)T = h\u03b8 + \u03be \u2208 Rn, where h =\n(h1, ..., hn)T , \u03be = (\u03be1, ..., \u03ben)T \u2208 Rn, and let X = (x1, ..., xn)T = \u03b3\ndhuT + E, where E =\n(\u0001ij)1\u2264i\u2264n, 1\u2264j\u2264d.\nGiven the observed data (y, X), our objective is to devise prediction rules \u02c6y : Rd \u2192 R so that the\nrisk\n\n\u221a\n\nRV (\u02c6y) = EV {\u02c6y(xnew) \u2212 ynew}2 = EV {\u02c6y(xnew) \u2212 hnew\u03b8}2 + \u03c32\n\n(4)\n\nis small, where (ynew, xnew) = (hnew\u03b8 + \u03benew, hnew\u03b3\ndu + \u0001new) has the same distribution as\n(yi, xi) and is independent of (y, X). The subscript \u201cV \u201d in RV and EV indicates that the parameters\n\u03b8, \u03b7, \u03c3, \u03c4, \u03b3, u are speci\ufb01ed by V , as in (3); similarly, we will write PV (\u00b7) to denote probabilities with\nthe parameters speci\ufb01ed by V .\nWe are primarily interested in identifying methods \u02c6y that perform well (i.e. RV (\u02c6y) is small) in\nan asymptotic regime whose key features are (i) n is \ufb01xed, (ii) d \u2192 \u221e, (iii) \u03c32 \u2192 0, and (iv)\ninf \u03b72\u03b32/\u03c4 2 > 0. We suggest that this regime re\ufb02ects a one-shot learning setting, where n is small\nand d is large (captured by (i)-(ii) from the previous sentence), and there is abundant contextual\ninformation for predicting future outcomes (which is ensured by (iii)-(iv)). In a speci\ufb01ed asymptotic\nregime (not necessarily the one-shot regime), we say that a prediction method \u02c6y is consistent if\nRV (\u02c6y) \u2192 0. Weak consistency is another type of consistency that is considered below. We say that\n\u02c6y is weakly consistent if |\u02c6y \u2212 ynew| \u2192 0 in probability. Clearly, if \u02c6y is consistent, then it is also\nweakly consistent.\n\n\u221a\n\n2\n\n\fnew\n\n\u221a\n\ni \u03b2, where \u03b2 =\ndu/(\u03c4 2 + \u03b72\u03b32d). This suggests studying linear prediction rules of the form \u02c6y(xnew) =\n\u02c6\u03b2, for some estimator \u02c6\u03b2 of \u03b2. In this paper, we restrict our attention to linear prediction rules,\n\n3 Principal component regression\nBy assumption, the data (yi, xi) are multivariate normal. Thus, EV (yi|xi) = xT\n\u03b8\u03b3\u03b72\nxT\nfocusing on estimators related to principal component regression (PCR).\nLet l1 \u2265 \u00b7\u00b7\u00b7 \u2265 ln\u2227d \u2265 0 denote the ordered n largest eigenvalues of X T X and let \u02c6u1, ..., \u02c6un\u2227d de-\nnote corresponding eigenvectors with unit length; \u02c6u1, ..., \u02c6un\u2227d are also referred to as the \u201cprincipal\ncomponents\u201d of X. Let Uk = (\u02c6u1 \u00b7\u00b7\u00b7 \u02c6uk) be the d \u00d7 k matrix with columns given by \u02c6u1, ..., \u02c6uk,\nfor 1 \u2264 k \u2264 n \u2227 d. In its most basic form, principal component regression involves regressing y\non XUk for some (typically small) k, and taking \u02c6\u03b2 = Uk(U T\nk X T y. In the problem\nconsidered here the predictor covariance matrix Cov(xi) = \u03c4 2I + \u03b72\u03b32duuT has a single eigen-\nvalue larger than \u03c4 2 and the corresponding eigenvector is parallel to \u03b2. Thus, it is natural to restrict\nour attention to PCR with k = 1; more explicitly, consider\n1\nl1\n\nk X T XUk)\u22121U T\n\n\u02c6uT\n1 X T y\u02c6u1.\n\n\u02c6uT\n1 X T y\n1 X T X \u02c6u1\n\n\u02c6\u03b2pcr =\n\n\u02c6u1 =\n\n(5)\n\n\u02c6uT\n\nIn the following sections, we study consistency and risk properties of \u02c6\u03b2pcr and related estimators.\n\n4 Weak consistency and big data with n = 2\n\nBefore turning our attention to risk approximations for PCR in Section 5 below (which contains\nthe paper\u2019s main technical contributions), we discuss weak consistency in the one-shot asymptotic\nregime, devoting special attention to the case where n = 2. This serves at least two purposes. First,\nit provides an illustrative warm-up for the more complex risk bounds obtained in Section 5. Second,\nit will become apparent below that the risk of the consistent PCR methods studied in this paper\ndepends on inverse moments of \u03c72 random variables. For very small n, these inverse moments do\nnot exist and, consequently, the risk of the associated prediction methods may be in\ufb01nite. The main\nimplication of this is that the risk bounds in Section 5 require n \u2265 9 to ensure their validity. On the\nother hand, the weak consistency results obtained in this section are valid for all n \u2265 2.\n\n4.1 Heuristic analysis for n = 2\n\nRecall the PCR estimator (5) and let \u02c6ypcr(x) = xT \u02c6\u03b2pcr be the associated linear prediction rule.\nFor n = 2, the largest eigenvalue of X T X and the corresponding eigenvector are given by simple\nexplicit formulas:\n\nl1 =\n\n||x1||2 + ||x2||2 +\n\n(||x1||2 \u2212 ||x2||2)2 + 4(xT\n\n1 x2)2\n\n||x1||2 \u2212 ||x2||2 +\n\n(||x1||2 \u2212 ||x2||2)2 + 4(xT\n\n1 x2)2\n\nx1 + x2.\n\nThese expressions for l1 and \u02c6u1 yield an explicit expression for \u02c6\u03b2pcr when n = 2 and facilitate\na simple heuristic analysis of PCR, which we undertake in this subsection. This analysis suggests\nthat \u02c6ypcr is not consistent when \u03c32 \u2192 0 and d \u2192 \u221e (at least for n = 2). However, the analysis\nalso suggests that consistency can be achieved by multiplying \u02c6\u03b2pcr by a scalar c \u2265 1; that is, by\nexpanding \u02c6\u03b2pcr. This observation leads us to consider and rigorously analyze a bias-corrected PCR\nmethod, which we ultimately show is consistent in \ufb01xed n settings, if \u03c32 \u2192 0 and d \u2192 \u221e. On the\nother hand, it will also be shown below that \u02c6ypcr is inconsistent in one-shot asymptotic regimes.\nFor large d, the basic approximations ||xi||2 \u2248 \u03b32dh2\nfollowing approximation for \u02c6ypcr(xnew):\n\n1 x2 \u2248 \u03b32dhihj lead to the\n\n1 + \u03c4 2d and xT\n\n\u02c6ypcr(xnew) = xT\n\nnew\n\n\u02c6\u03b2pcr \u2248 \u03b32(h2\n\n1 + h2\n2)\n\n\u03b32(h2\n\n1 + h2\n\n2) + \u03c4 2 hnew\u03b8 + epcr,\n\n(6)\n\n3\n\n(cid:26)\n(cid:26)\n\n1\n2\n\nand \u02c6u1 = \u02c6v1/||\u02c6v1||2, where\n\n\u02c6v1 =\n\n1\n2xT\n1 x2\n\n(cid:113)\n(cid:113)\n\n(cid:27)\n(cid:27)\n\n\fwhere\n\nThus,\n\nepcr =\n\n{\u03b32d(h2\n\n\u03b32hnew\n1 + h2\n\n\u02c6uT\n1 X T \u03be.\n\n2) + \u03c4 2d}2\n\u03c4 2\n2) + \u03c4 2 hnew\u03b8 + epcr \u2212 \u03benew.\n1 + h2\n\n\u02c6ypcr(xnew) \u2212 ynew \u2248 \u2212\n\n\u03b32(h2\n\n(7)\nThe second and third terms on the right-hand side in (7), epcr \u2212 \u03benew, represent a random error\nthat vanishes as d \u2192 \u221e and \u03c32 \u2192 0. On the other hand, the \ufb01rst term on the right-hand side in\n(7), \u2212\u03c4 2hnew\u03b8/{\u03b32(h2\n2) + \u03c4 2}, is a bias term that is, in general, non-zero when d \u2192 \u221e and\n\u03c32 \u2192 0; in other words \u02c6ypcr is inconsistent. This bias is apparent in the expression for \u02c6ypcr(xnew)\ngiven in (6); in particular, the \ufb01rst term on the right-hand side of (6) is typically smaller than hnew\u03b8.\nOne way to correct for the bias of \u02c6ypcr is to multiply \u02c6\u03b2pcr by\n2) + \u03c4 2\n\n1 + h2\n\n1 + h2\n\nl1\n\n\u2265 1,\n\n\u2248 \u03b32(h2\n\n\u03b32(h2\n\n1 + h2\n2)\n\n(cid:26)\n\nl1 \u2212 l2\n\n||x1||2 + ||x2||2 \u2212(cid:113)\n\n(||x1||2 \u2212 ||x2||2)2 + 4(xT\n\n1 x2)2\n\n\u2248 \u03c4 2d\n\n(cid:27)\n\nwhere\n\nl2 =\n\n1\n2\n\nis the second-largest eigenvalue of X T X. De\ufb01ne the bias-corrected principal component regression\nestimator\n\n\u02c6\u03b2bc =\n\nl1\n\nl1 \u2212 l2\n\n\u02c6\u03b2pcr =\n\n1\n\nl1 \u2212 l2\n\n\u02c6uT\n1 X T y\n\nand let \u02c6ybc(x) = xT \u02c6\u03b2bc be the associated linear prediction rule. Then \u02c6ybc(xnew) = xT\nhnew\u03b8 + ebc, where\n\nnew\n\n\u02c6\u03b2bc \u2248\n\nebc =\n\n{\u03b32(h2\n\n1 + h2\n\nhnew\n2) + \u03c4 2}(h2\n\n1 + h2\n\n2)d2\n\n\u02c6uT\n1 X T \u03be.\n\nOne can check that if d \u2192 \u221e, \u03c32 \u2192 0 and \u03b8, \u03b72, \u03b72, \u03c4 2 are well-behaved (e.g. contained in a\ncompact subset of (0,\u221e)), then \u02c6ybc(xnew) \u2212 ynew \u2248 ebc \u2192 0 in probability; in other words, \u02c6ybc\nis weakly consistent. Indeed, weak consistency of \u02c6ybc follows from Theorem 1 below. On the other\nhand, note that E|ebc| = \u221e. This suggests that RV (\u02c6ybc) = \u221e, which in fact may be con\ufb01rmed by\ndirect calculation. Thus, when n = 2, \u02c6ybc is weakly consistent, but not consistent.\n\n4.2 Weak consistency for bias-corrected PCR\nNow suppose that n \u2265 2 is arbitrary and that d \u2265 n. De\ufb01ne the bias-corrected PCR estimator\n\n\u02c6\u03b2bc =\n\nl1\n\nl1 \u2212 ln\n\n\u02c6\u03b2pcr =\n\n1\n\nl1 \u2212 ln\n\n\u02c6uT\n1 X T y\u02c6u1\n\n(8)\n\nand the associated linear prediction rule \u02c6ybc(x) = xT \u02c6\u03b2bc. The main weak consistency result of the\npaper is given below.\nTheorem 1. Suppose that n \u2265 2 is \ufb01xed and let C \u2286 (0,\u221e) be a compact set. Let r > 0 be an\narbitrary but \ufb01xed positive real number. Then\n\nPV {|\u02c6ybc(xnew) \u2212 ynew| > r} = 0.\n\nOn the other hand,\n\nlim\nd\u2192\u221e\n\u03c32\u21920\n\n\u03b8,\u03b7,\u03c4,\u03b3\u2208C\n\nsup\nu\u2208Rd\n\nlim inf\nd\u2192\u221e\n\u03c32\u21920\n\n\u03b8,\u03b7,\u03c4,\u03b3\u2208C\n\ninf\nu\u2208Rd\n\nPV {|\u02c6ypcr(xnew) \u2212 ynew| > r} > 0.\n\n(9)\n\n(10)\n\nA proof of Theorem 1 follows easily upon inspection of the proof of Theorem 2, which may be\nfound in the Supplementary Material. Theorem 1 implies that in the speci\ufb01ed \ufb01xed n asymptotic\nsetting, bias-corrected PCR is weakly consistent (9) and that the more standard PCR method \u02c6ypcr\nis inconsistent (10). Note that the condition \u03b8, \u03b7, \u03c4, \u03b3 \u2208 C in (9) ensures that the x-data signal-to-\nnoise ratio \u03b72\u03b32/\u03c4 2 is bounded away from 0. In (8), it is noteworthy that l1/(l1 \u2212 ln) \u2265 1: in\norder to achieve (weak) consistency, the bias corrected estimator \u02c6\u03b2bc is obtained by expanding \u02c6\u03b2pcr.\nBy contrast, shrinkage is a far more common method for obtaining improved estimators in many\nregression and prediction settings (the literature on shrinkage estimation is vast, perhaps beginning\nwith [23]).\n\n4\n\n\f5 Risk approximations and consistency\nIn this section, we present risk approximations for \u02c6ypcr and \u02c6ybc that are valid when n \u2265 9. A more\ncareful analysis may yield approximations that are valid for smaller n; however, this is not pursued\nfurther here.\nTheorem 2. Let Wn \u223c \u03c72\n\nn be a chi-squared random variable with n degrees of freedom.\n\n(a) If n \u2265 9 and d \u2265 1, then\n\nRV (\u02c6ypcr) = \u03c32\n\n(cid:20)\n\n(b) If d \u2265 n \u2265 9, then\n\n(cid:40)\n\nRV (\u02c6ybc) = \u03c32\n\n1 + E\n\n1 + E\n\n+\u03b82\u03b72EV\n\n\u03b72\u03b32\n\n(cid:32)\n\u03b72\u03b32Wn + \u03c4 2(cid:112)n/d\n(cid:26) l1\n(cid:40)\n\n(uT \u02c6u1)2 \u2212 1\n\nl1 \u2212 ln\n\n(cid:27)(cid:21)\n\n+ O\n\n(cid:40)\n\n(\u03b72\u03b32Wn + \u03c4 2)2\n\n\u03b74\u03b34Wn\n\n(cid:26)\n(cid:8)(uT \u02c6u1)2 \u2212 1(cid:9)2\n(cid:33)(cid:41)\n(cid:27)2\n\n+ O\n\n(cid:19)\n\nn\n\n+ O\n\n(cid:114) n\n(cid:18) \u03c32\n(cid:19)\n(cid:18) \u03b82\u03b72\u03c4 2\n(cid:32)\n\n\u03b72\u03b32d + \u03c4 2\n\n.\n\nd + n\n\n\u03b72\u03b32n + \u03c4 2(cid:112)n/d\n\n\u03c4 2\n\n\u03c32\u221a\ndn\n\n(cid:32)\n(cid:40)\n\n\u03b72\u03b32 + \u03c4 2\n\n\u03b72\u03b32Wn + \u03c4 2(cid:112)n/d\n\u03b72\u03b32n + \u03c4 2(cid:112)n/d\n\n\u03c4 2\n\n+\n\n(cid:33)(cid:41)\n(\u03b72\u03b32n + \u03c4 2(cid:112)n/d)2\n\n\u03c4 4\n\n(11)\n\n(12)\n\n(cid:41)(cid:35)\n\n.\n\n(cid:33)(cid:41)\n\n+\u03b82\u03b72EV\n\n+\n\n\u03b82\u03b72\u03c4 2\n\n\u03b72\u03b32d + \u03c4 2\n\n(cid:34)\n\n+O\n\n\u03b82\u03b72\u03c4 2\n\n\u03b72\u03b32d + \u03c4 2\n\nd\n\n1 + E\n\n(cid:114) n\n\nA proof of Theorem 2 (along with intermediate lemmas and propositions) may be found in the\nSupplementary Material. The necessity of the more complex error term in Theorem 2 (b) (as opposed\nto that in part (a)) will become apparent below.\nWhen d is large, \u03c32 is small, and \u03b8, \u03b7, \u03c4, \u03b3 \u2208 C, for some compact subset C \u2286 (0,\u221e), Theorem 2\nsuggests that\n\nRV (\u02c6ypcr) \u2248 \u03b82\u03b72EV\nRV (\u02c6ybc) \u2248 \u03b82\u03b72EV\n\n(cid:8)(uT \u02c6u1)2 \u2212 1(cid:9)2\n(cid:26) l1\n\n,\n\nl1 \u2212 ln\n\n(uT \u02c6u1)2 \u2212 1\n\n.\n\n(cid:27)2\n\nThus, consistency of \u02c6ypcr and \u02c6ybc in the one-shot regime hinges on asymptotic properties of\nEV {(uT \u02c6u1)2 \u2212 1}2 and EV {l1/(l1 \u2212 ln)(uT \u02c6u1)2 \u2212 1}2. The following proposition is proved\nin the Supplementary Material.\nProposition 1. Let Wn \u223c \u03c72\n\nn be a chi-squared random variable with n degrees of freedom.\n\n(a) If n \u2265 9 and d \u2265 1, then\n\n(b) If d \u2265 n \u2265 9, then\n\n(cid:19)2\n\n(cid:18)(cid:114) n\n\n= E\n\n\u03b72\u03b32Wn + \u03c4 2\n\n+ O\n\nd + n\n\nEV\n\n(cid:8)(uT \u02c6u1)2 \u2212 1(cid:9)2\n(cid:26) l1\n\n(cid:18)\n(cid:27)2\n\n\u03c4 2\n\n(cid:40)\n\nEV\n\nl1 \u2212 ln\n\n(uT \u02c6u1)2 \u2212 1\n\n= O\n\n(\u03b72\u03b32n + \u03c4 2(cid:112)n/d)2\n\n\u03c4 4\n\n\u00b7\n\nn\nd\n\n(cid:19)\n(cid:41)\n\n.\n\n.\n\nProposition 1 (a) implies that in the one-shot regime, EV {(uT \u02c6u1)2 \u2212 1}2 \u2192 E{\u03c4 2/(\u03b72\u03b32Wn +\n\u03c4 2)2} (cid:54)= 0; by Theorem 2 (a), it follows that \u02c6ypcr is inconsistent. On the other hand, Proposition 1\n(b) implies that EV\n(b), \u02c6ybc is consistent. These results are summarized in Corollary 1, which follows immediately from\nTheorem 2 and Proposition 1.\n\n(cid:8)l1/(l1 \u2212 ln)(uT \u02c6u1)2 \u2212 1(cid:9)2 \u2192 0 in the one-shot regime; thus, by Theorem 2\n\n5\n\n\fCorollary 1. Suppose that n \u2265 9 is \ufb01xed and let C \u2286 (0,\u221e) be a compact set. Let Wn \u223c \u03c72\nchi-squared random variable with n degrees of freedom. Then\n\nn be a\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)RV (\u02c6ypcr) \u2212 \u03b82\u03b72E\n\n(cid:18)\n\n\u03c4 2\n\n\u03b72\u03b32Wn + \u03c4 2\n\n(cid:19)2(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0.\n\nlim\nd\u2192\u221e\n\u03c32\u21920\n\n\u03b8,\u03b7,\u03c4,\u03b3\u2208C\n\nsup\nu\u2208Rd\n\nand\n\nlim\nd\u2192\u221e\n\u03c32\u21920\n\n\u03b8,\u03b7,\u03c4,\u03b3\u2208C\n\nsup\nu\u2208Rd\n\nRV (\u02c6ybc) = 0.\n\n(cid:8)l1/(l1 \u2212 ln)(uT \u02c6u1)2 \u2212 1(cid:9)2; this could potentially be leveraged to obtain better\n\nFor \ufb01xed n, and inf \u03b72\u03b32/\u03c4 2 > 0, the bound in Proposition 1 (b) is of order 1/d. This suggests that\nboth terms (11)-(12) in Theorem 2 (b) have similar magnitude and, consequently, are both necessary\nto obtain accurate approximations for RV (\u02c6ybc). (It may be desirable to obtain more accurate approx-\nimations for EV\napproximations for RV (\u02c6ybc).) In Theorem 2 (a), the only non-vanishing term in the one-shot ap-\nproximation for RV (\u02c6ypcr) involves EV {(uT \u02c6u1)2 \u2212 1}2; this helps to explain the relative simplicity\nof this approximation, in comparison with Theorem 2 (b).\nTheorem 2 and Proposition 1 give risk approximations that are valid for all d and n \u2265 9. How-\never, as illustrated by Corollary 1, these approximations are most effective in a one-shot asymptotic\nsetting, where n is \ufb01xed and d is large. In the one-shot regime, standard concepts, such as sam-\nple complexity \u2013 roughly, the sample size n required to ensure a certain risk bound \u2013 may be of\nsecondary importance. Alternatively, in a one-shot setting, one might be more interested in metrics\nlike \u201cfeature complexity\u201d: the number of features d required to ensure a given risk bound. Approx-\nimate feature complexity for \u02c6ybc is easily computed using Theorem 2 and Proposition 1 (clearly,\nfeature complexity depends heavily on model parameters, such as \u03b8, the y-data noise level \u03c32, and\nthe x-data signal-to-noise ratio \u03b72\u03b32/\u03c4 2).\n\n6 An oracle estimator\n\nIn this section, we discuss a third method related to \u02c6ypcr and \u02c6ybc, which relies on information that\nis typically not available in practice. Thus, this method is usually non-implementable; however, we\nbelieve it is useful for comparative purposes.\nRecall that both \u02c6ybc and \u02c6ypcr depend on the \ufb01rst principal component \u02c6u1, which may be viewed as\nan estimate of u. If an oracle provides knowledge of u in advance, then it is natural to consider the\noracle PCR estimator\n\n\u02c6\u03b2or =\n\nuT X T y\nuT X T Xu\n\nu\n\nand the associated linear prediction rule \u02c6yor(x) = xT \u02c6\u03b2or. A basic calculation yields the following\nresult.\nProposition 2. If n \u2265 3, then\n\n(cid:18)\n\nRV (\u02c6yor) =\n\n\u03c32 +\n\n\u03b82\u03b72\u03c4 2\n\n\u03b72\u03b32d + \u03c4 2\n\n1 +\n\n1\n\nn \u2212 2\n\n(cid:19)(cid:18)\n\n(cid:19)\n\n.\n\nClearly, \u02c6yor is consistent in the one-shot regime: if C \u2286 (0,\u221e) is compact and n \u2265 3 is \ufb01xed, then\n\nlim\nd\u2192\u221e\n\u03c32\u21920\n\n\u03b8,\u03b7,\u03c4,\u03b3\u2208C\n\nsup\nu\u2208Rd\n\nRV (\u02c6yor) = 0.\n\n7 Numerical results\n\nIn this section, we describe the results of a simulation study where we compared the performance of\n\u02c6ypcr, \u02c6ybc, and \u02c6yor. We \ufb01xed \u03b8 = 4, \u03c32 = 1/10, \u03b72 = 4, \u03b32 = 1/4, \u03c4 2 = 1, and u = (1, 0, ...., 0) \u2208\n\n6\n\n\fRd and simulated 1000 independent datasets with various d, n. Observe that \u03b72\u03b32/\u03c4 2 = 1. For each\nsimulated dataset, we computed \u02c6\u03b2pcr, \u02c6\u03b2bc, \u02c6\u03b2or and the corresponding conditional prediction error\n\nRV (\u02c6y|y, X) = E(cid:2){\u02c6y(xnew) \u2212 ynew}2(cid:12)(cid:12) y, X(cid:3)\n\n= (\u02c6\u03b2 \u2212 \u03b2)T (\u03c4 2I + \u03b72\u03b32duuT )(\u02c6\u03b2 \u2212 \u03b2) + \u03c32 +\n\n\u03b82\u03b72\n\n\u03c82d + 1\n\n,\n\nfor \u02c6y = \u02c6ypcr, \u02c6ybc, \u02c6yor. The empirical prediction error for each method \u02c6y was then computed by av-\neraging RV (\u02c6y|y, X) over all 1000 simulated datasets. We also computed the\u201ctheoretical\u201d prediction\nerror for each method, using the results from Sections 5-6, where appropriate. More speci\ufb01cally, for\n\u02c6ypcr and \u02c6ybc, we used the leading terms of the approximations in Theorem 2 and Proposition 1 to\nobtain the theoretical prediction error; for \u02c6yor, we used the formula given in Proposition 2 (see Table\n1 for more details). Finally, we computed the relative error between the empirical prediction error\n\nTable 1: Formulas for theoretical prediction error used in simulations (derived from Theorem 2 and\nPropositions 1-2). Expectations in theoretical prediction error expressions for \u02c6ypcr and \u02c6ybc were\ncomputed empirically.\n\n(cid:26)\n\n\u03c32\n\n\u02c6ypcr\n\n\u02c6ybc\n\n\u02c6yor\n\n\u03c32(cid:104)\n\n(cid:18)\n\n1 + E\n\n1 + E\n\n\u03b72\u03b32\n\u03b72\u03b32Wn+\u03c4 2\n\n+ \u03b82\u03b72\u03c4 2\n\u03b72\u03b32d+\u03c4 2\n\nl1\u2212ln\n\n+ \u03b82\u03b72E\n\n(\u03b72\u03b32Wn+\u03c4 2)2\n\n+ \u03b82\u03b72EV\n\n\u03b74\u03b34Wn\n\u221a\n\nn/d\n\n(cid:18)\n\n(cid:111)(cid:105)\n(cid:19)(cid:27)\n\nTheoretical prediction error formula\n\n(cid:16)\n(cid:110)\n(cid:110) l1\n(cid:19)(cid:27)\n(cid:26)\n(cid:17)\n(cid:16)\n(cid:12)(cid:12)(cid:12)(cid:12) (Empirical PE) \u2212 (Theoretical PE)\n\n\u03c32 + \u03b82\u03b72\u03c4 2\n\u03b72\u03b32d+\u03c4 2\n\nn/d\n1 + 1\nn\u22122\n\nEmpirical PE\n\n(cid:17)(cid:16)\n\n\u03b72\u03b32+\u03c4 2\n\n\u221a\n\n\u03b72\u03b32Wn+\u03c4 2\n\n1 + E\n\n(cid:17)2\n\n(cid:111)2\n\n\u03c4 2\n\n\u03b72\u03b32Wn+\u03c4 2\n\n(uT \u02c6u1)2 \u2212 1\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u00d7 100%.\n\nRelative Error =\n\nand the theoretical prediction error for each method,\n\nTable 2: d = 500. Prediction error for \u02c6ypcr (PCR), \u02c6ybc (Bias-corrected PCR), and \u02c6yor (oracle). Rel-\native error for comparing Empirical PE and Theoretical PE is given in parentheses. \u201cNA\u201d indicates\nthat Theoretical PE values are unknown.\n\nTheoretical PE (Relative Error) NA\n\nTheoretical PE (Relative Error) NA\n\nn = 2 Empirical PE\n\nn = 4 Empirical PE\n\nn = 9 Empirical PE\n\nn = 20 Empirical PE\n\nPCR\n\n18.7967\n\n6.4639\n\n1.4187\n\n0.4513\n\nBias-corrected\n\nPCR\n\n(\u221e)\n\n4.8668\n\n\u221e\n\n0.8023\n\nNA\n\n0.3565\n\n0.2732\n\nOracle\n\n1.5836\n\n\u221e\n\n(\u221e)\n\n0.3268\n0.3416 (4.53%)\n0.2587\n\n0.2398\n\nTheoretical PE (Relative Error) 1.2514 (11.79%) 0.2857 (19.86%) 0.2603 (0.62%)\n\nTheoretical PE (Relative Error) 0.2987 (33.81%) 0.2497 (8.60%) 0.2404 (0.25%)\n\nThe results of the simulation study are summarized in Tables 2-3. Observe that \u02c6ybc has smaller\nempirical prediction error than \u02c6ypcr in every setting considered in Tables 2-3, and \u02c6ybc substantially\noutperforms \u02c6ypcr in most settings. Indeed, the empirical prediction error for \u02c6ybc when n = 9 is\nsmaller than that of \u02c6ypcr when n = 20 (for both d = 500 and d = 5000); in other words, \u02c6ybc\noutperforms \u02c6ypcr, even when \u02c6ypcr has more than twice as much training data. Additionally, the\nempirical prediction error of \u02c6ybc is quite close to that of the oracle method \u02c6yor, especially when n\nis relatively large. These results highlight the effectiveness of the bias-corrected PCR method \u02c6ybc in\nsettings where \u03c32 and n are small, \u03b72\u03b32/\u03c4 2 is substantially larger than 0, and d is large.\nFor n = 2, 4, theoretical prediction error is unavailable in some instances. Indeed, while Proposition\n2 and the discussion in Section 4 imply that if n = 2, then RV (\u02c6ybc) = RV (\u02c6yor) = \u221e, we have not\n\n7\n\n\fTable 3: d = 5000. Prediction error for \u02c6ypcr (PCR), \u02c6ybc (Bias-corrected PCR), and \u02c6yor (oracle).\nRelative error comparing Empirical PE and Theoretical PE is given in parentheses. \u201cNA\u201d\u2019 indicates\nthat Theoretical PE values are unknown.\n\nPCR\n\nBias-corrected\n\nPCR\n\nOracle\n\nTheoretical PE (Relative Error) NA\n\nTheoretical PE (Relative Error) NA\n\nn = 2 Empirical PE\n\nn = 4 Empirical PE\n\nn = 9 Empirical PE\n\nn = 20 Empirical PE\n\n17.9564\n\n6.1220\n\n1.2274\n\n0.3150\n\n2.0192\n\n\u221e\n\n0.2039\n\nNA\n\n0.1378\n\n0.1226\n\n(\u221e)\n\n1.0316\n\n\u221e\n\n(\u221e)\n\n0.1637\n0.1692 (3.36%)\n0.1281\n\n0.1189\n\nTheoretical PE (Relative Error) 1.2485 (1.72%) 0.1314 (4.64%) 0.1289 (0.62%)\n\nTheoretical PE (Relative Error) 0.2997 (4.86%) 0.1200 (2.12%) 0.1191 (0.17%)\n\npursued an expression for RV (\u02c6ypcr) when n = 2 (it appears that RV (\u02c6ypcr) < \u221e); furthermore, the\napproximations in Theorem 2 for RV (\u02c6ypcr), RV (\u02c6ybc) do not apply when n = 4. In instances where\ntheoretical prediction error is available, is \ufb01nite, and d = 500, the relative error between empirical\nand theoretical prediction error for \u02c6ypcr and \u02c6ybc ranges from 8.60%-33.81%; for d = 5000, it ranges\nfrom 1.72%-4.86%. Thus, the accuracy of the theoretical prediction error formulas tends to improve\nas d increases, as one would expect. Further improved measures of theoretical prediction error\nfor \u02c6ypcr and \u02c6ybc could potentially be obtained by re\ufb01ning the approximations in Theorem 2 and\nProposition 1.\n\n8 Discussion\n\nIn this article, we have proposed bias-corrected PCR for consistent one-shot learning in a simple\nlatent factor model with continuous outcomes. Our analysis was motivated by problems in one-shot\nlearning, as discussed in Section 1. However, the results in this paper may also be relevant for\nother applications and techniques related to high-dimensional data analysis, such as those involving\nreproducing kernel Hilbert spaces. Furthermore, our analysis sheds new light on PCR, a long-studied\nmethod for regression and prediction.\nMany open questions remain. For instance, consider the semi-supervised setting, where additional\nunlabeled data xn+1, ..., xN is available, but the corresponding yi\u2019s are not provided. Then the\nadditional x-data could be used to obtain a better estimate of the \ufb01rst principal component u and\nperhaps devise a method whose performance is closer to that of the oracle procedure \u02c6yor (indeed,\n\u02c6yor may viewed as a semi-supervised procedure that utilizes an in\ufb01nite amount of unlabeled data\nto exactly identify u). Is bias-correction via in\ufb02ation necessary in this setting? Presumably, bias-\ncorrection is not needed if N is large enough, but can this be made more precise? The simulations\ndescribed in the previous section indicate that \u02c6ybc outperforms the uncorrected PCR method \u02c6ypcr\nin settings where twice as much labeled data is available for \u02c6ypcr. This suggests that role of bias-\ncorrection will remain signi\ufb01cant in the semi-supervised setting, where additional unlabeled data\n(which is less informative than labeled data) is available. Related questions involving transductive\nlearning [26, 27] may also be of interest for future research.\nA potentially interesting extension of the present work involves multi-factor models. As opposed\nto the single-factor model (1)-(2), one could consider a more general k-factor model, where yi =\ni \u03b8 + \u03bei and xi = Shi + \u0001i; here hi = (hi1, ..., hik)T \u2208 Rk is a multivariate normal random vector\nhT\nd(\u03b31u1 \u00b7\u00b7\u00b7 \u03b3kuk)\n(a k-dimensional factor linking yi and xi), \u03b8 = (\u03b81, ..., \u03b8k)T \u2208 Rk, and S =\nis a k \u00d7 d matrix, with \u03b31, ..., \u03b3k \u2208 R and unit vectors u1, ..., uk \u2208 Rd. It may also be of interest\nto work on relaxing the distributional (normality) assumptions made in this paper. Finally, we point\nout that the results in this paper could potentially be used to develop \ufb02exible probit (latent variable)\nmodels for one-shot classi\ufb01cation problems.\n\n\u221a\n\nReferences\n[1] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. Pattern Analysis and Machine\n\nIntelligence, IEEE Transactions on, 28:594\u2013611, 2006.\n\n8\n\n\f[2] R. Salakhutdinov, J.B. Tenenbaum, and A. Torralba. One-shot learning with a hierarchical nonparametric\nBayesian model. JMLR Workshop and Conference Proceedings Volume 26: Unsupervised and Transfer\nLearning Workshop, 27:195\u2013206, 2012.\n\n[3] M.C. Frank, N.D. Goodman, and J.B. Tenenbaum. A Bayesian framework for cross-situational word-\n\nlearning. Advances in Neural Information Processing Systems, 20:20\u201329, 2007.\n\n[4] J.B. Tenenbaum, T.L. Grif\ufb01ths, and C. Kemp. Theory-based Bayesian models of inductive learning and\n\nreasoning. Trends in Cognitive Sciences, 10:309\u2013318, 2006.\n\n[5] C. Kemp, A. Perfors, and J.B. Tenenbaum. Learning overhypotheses with hierarchical Bayesian models.\n\nDevelopmental Science, 10:307\u2013321, 2007.\n\n[6] S. Carey and E. Bartlett. Acquiring a single new word. Proceedings of the Stanford Child Language\n\nConference, 15:17\u201329, 1978.\n\n[7] L.B. Smith, S.S. Jones, B. Landau, L. Gershkoff-Stowe, and L. Samuelson. Object name learning provides\n\non-the-job training for attention. Psychological Science, 13:13\u201319, 2002.\n\n[8] F. Xu and J.B. Tenenbaum. Word learning as Bayesian inference. Psychological Review, 114:245\u2013272,\n\n2007.\n\n[9] M. Fink. Object classi\ufb01cation from a single example utilizing class relevance metrics. Advances in Neural\n\nInformation Processing Systems, 17:449\u2013456, 2005.\n\n[10] P. Hall, J.S. Marron, and A. Neeman. Geometric representation of high dimension, low sample size data.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 67:427\u2013444, 2005.\n\n[11] P. Hall, Y. Pittelkow, and M. Ghosh. Theoretical measures of relative performance of classi\ufb01ers for high\ndimensional data with small sample sizes. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 70:159\u2013173, 2008.\n\n[12] Y.I. Ingster, C. Pouet, and A.B. Tsybakov. Classi\ufb01cation of sparse high-dimensional vectors. Philosoph-\nical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367:4427\u2013\n4448, 2009.\n\n[13] W.F. Massy. Principal components regression in exploratory statistical research. Journal of the American\n\nStatistical Association, 60:234\u2013256, 1965.\n\n[14] I.M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of\n\nStatistics, 29:295\u2013327, 2001.\n\n[15] D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica\n\nSinica, 17:1617\u20131642, 2007.\n\n[16] B. Nadler. Finite sample approximation results for principal component analysis: A matrix perturbation\n\napproach. Annals of Statistics, 36:2791\u20132817, 2008.\n\n[17] I.M. Johnstone and A.Y. Lu. On consistency and sparsity for principal components analysis in high\n\ndimensions. Journal of the American Statistical Association, 104:682\u2013693, 2009.\n\n[18] S. Jung and J.S. Marron. PCA consistency in high dimension, low sample size context. Annals of Statis-\n\ntics, 37:4104\u20134130, 2009.\n\n[19] S. Lee, F. Zou, and F.A. Wright. Convergence and prediction of principal component scores in high-\n\ndimensional settings. Annals of Statistics, 38:3605\u20133629, 2010.\n\n[20] Q. Berthet and P. Rigollet. Optimal detection of sparse principal components in high dimension. arXiv\n\npreprint arXiv:1202.5070, 2012.\n\n[21] S. Jung, A. Sen, and J.S Marron. Boundary behavior in high dimension, low sample size asymptotics of\n\nPCA. Journal of Multivariate Analysis, 109:190\u2013203, 2012.\n\n[22] Z. Ma. Sparse principal component analysis and iterative thresholding. Annals of Statistics, 41:772\u2013801,\n\n2013.\n\n[23] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In\nProceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 1,\npages 197\u2013206, 1955.\n\n[24] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), 58:267\u2013288, 1996.\n\n[25] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,\n\nand Prediction. Springer, 2nd edition, 2009.\n\n[26] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n[27] K.S. Azoury and M.K. Warmuth. Relative loss bounds for on-line density estimation with the exponential\n\nfamily of distributions. Machine Learning, 43:211\u2013246, 2001.\n\n9\n\n\f", "award": [], "sourceid": 218, "authors": [{"given_name": "Lee", "family_name": "Dicker", "institution": "Rutgers University"}, {"given_name": "Dean", "family_name": "Foster", "institution": "University of Pennsylvania"}]}