{"title": "Features as Sufficient Statistics", "book": "Advances in Neural Information Processing Systems", "page_first": 794, "page_last": 800, "abstract": "", "full_text": "Features as Sufficient Statistics \n\nDepartment of Computer Science \n\nDepartment of Computer Science \n\nA. Rudra t \n\nCourant Institute \n\nNew York University \narchi~cs.nyu.edu \n\nD. Geiger \u2022 \n\nCourant Institute \n\nand Center for Neural Science \n\nNew York University \ngeiger~cs.nyu.edu \n\nDepartments of Psychology and Neural Science \n\nL. Maloney t \n\nNew York University \nItm~cns.nyu.edu \n\nAbstract \n\nAn image is often represented by a set of detected features. We get \nan enormous compression by representing images in this way. Fur(cid:173)\nthermore, we get a representation which is little affected by small \namounts of noise in the image. However, features are typically \nchosen in an ad hoc manner. \ntures can be obtained using sufficient statistics. The idea of sparse \ndata representation naturally arises. We treat the I-dimensional \nand 2-dimensional signal reconstruction problem to make our ideas \nconcrete. \n\n\\Ve show how a good set of fea(cid:173)\n\n1 \n\nIntroduction \n\nConsider an image, I, that is the result of a stochastic image-formation process. The \nprocess depends on the precise state, f, of an environment. The image, accordingly, \ncontains information about the environmental state f, possibly corrupted by noise. \nWe wish to choose feature vectors \u00a2leI) derived from the image that summarize this \ninformation concerning the environment. We are not otherwise interested in the \ncontents of the image and wish to discard any information concerning the image \nthat does not depend on the environmental state f . \n\n\u00b7Supported by NSF grant 5274883 and AFOSR grants F 49620-96-1-0159 and F 49620-\n\n96-1-0028 \n\ntpartially supported by AFOSR grants F 49620-96-1-0159 and F 49620-96-1-0028 \ntSupported by NIH grant EY08266 \n\n\fFeatures as Sufficient Statistics \n\n795 \n\nWe develop criteria for choosing sets of features (based on information theory and \nstatistical estimation theory) that extract from the image precisely the information \nconcerning the environmental state. \n\n2 \n\nImage Formation, Sufficient Statistics and Features \n\nAs above, the image I is the realization of a random process with distribution \nPEn1JironmentU). We are interested in estimating the parameters j of the environ(cid:173)\nmental model given the image (compare [4]). We assume in the sequel that j, the \nenvironmental parameters, are themselves a random vector with known prior dis(cid:173)\ntribution. Let \u00a2J(I) denote a feature vector derived from the the image I. Initially, \nwe assume that \u00a2J(I) is a deterministic function of I. \nFor any choice of random variables, X, Y, define[2] the mutual in/ormation of X \n:Ex,Y P(X, Y)log pf*~;:;t). The information about \nand Y to be M(Xj Y) = \nthe environmental parameters contained in the image is then M(f;I), while the \ninformation about the environmental parameters contained in the feature vector \n\u00a2J(I) is then M(fj \u00a2J(I)). As a consequence of the data processing inequality [2] , \nM(f; \u00a2J(I)) ~ M(f; I). \nA vector \u00a2J(I), 'of features is defined to be sufficient if the inequality above is an \nequality. We will use the terms feature and statistic interchangeably. The definition \nof a sufficient feature vector above is then just the usual definition of a set of jointly \nsufficient statistics[2]. \nTo summarize, a feature vector \u00a2J(I) captures all the information about the envi(cid:173)\nronmental state parameters / precisely when it is sufficent. 1 \nGraded Sufficiency: A feature vector either is or is not sufficient. For every \npossible feature vector \u00a2J(I), we define a measure of its failure to be sufficent: \nSuff(\u00a2J(I)) = M(fjI) - MUj \u00a2J(I)). This sufficency measure is always non-negative \nand it is zero precisely when \u00a2J is sufficient. We wish to find feature vectors \u00a2J(I) \nwhere Suff(\u00a2J(I)) is close to O. We define \u00a2J(I) to be t-sufficient if Suff(\u00a2J(I)) ~ t. In \nwhat follows, we will ordinarily say sufficient, when we mean t-sufficient. \nThe above formulation of feature vectors as jointly sufficient statistics, maximiz(cid:173)\ning the mutual information, M(j, \u00a2J(I)), can be expressed as the Kullback-Leibler \ndistance between the conditional distributions, PUll) and P(fI\u00a2J(I)): \n\nE1 [D(PUII) II P(fI\u00a2J(I)))] = M(fj I) - MUj \u00a2J(I)) , \n\n(1) \nwhere the symbol E1 denotes the expectation with respect to I, D denotes the \nKullback-Leibler (K-L) distance, defined by DUlIg) = :Ex j(x) logU(x)jg(x)) 2. \nThus, we seek feature vectors \u00a2J(I) such that the conditional distributions, PUll) \nand PUI\u00a2J(I)) in the K-L sense, averaged across the set of images. However, this \noptimization for each image could lead to over-fitting. \n\n3 Sparse Data and Sufficient Statistics \n\nThe notion of sufficient statistics may be described by how much data can be re(cid:173)\nmoved without increasing the K-L distance between PUI\u00a2J(I)) and PUll). Let us \n\n1 An information-theoretic framework has been adopted in neural networks by others; \n\ne.g., [5] [9][6] [1][8]. However, the connection between features and sufficiency is new. \nproperty to say that P(f, l, \u00a2(I)) = P(l, \u00a2J(l) )P(fll, \u00a2(l)) = P(l)P(fII). \n\n2We won't prove the result here. The proof is simple and uses the Markov chain \n\n\f796 \n\nD. Geiger; A. Rudra and L. T. Maloney \n\nformulate the approach more precisely, and apply two methods to solve it. \n\n3.1 Gaussian Noise Model and Sparse Data \n\nWe are required to construct P(fII) and P(fI\u00a2(I)). Note that according to Bayes' \nrule P(fl\u00a2(!)) = P(\u00a2(I)I!) P(f) j P(\u00a2(I)). We will assume that the form of \nIn order to obtain P(\u00a2(I)I!) we write P(\u00a2(!)l!) = \nthe model P(f) is known. \nEJ P(\u00a2(I)II)P(II!)\u00b7 \nComputing P(fl \u00a2(!)): Let us first assume that the generative process of the image \nI, given the model I, is Gaussian LLd. ,Le., P(II/) = (ljV21To\"i) TIi e-(f,-Ji)2/2tT~ \nwhere i = 0,1, ... , N - 1 are the image pixel index for an image of size N. Fur(cid:173)\nther, P(Iil/i) is a function of (Ii - Ii) and Ii varies from -00 to +00, so that the \nnormalization constant does not depend on Ii. Then, P(fII) can be obtained by \nnormalizing P(f)P(II!). \n\nP(fII) = (ljZ)(II e-(fi-J;)2/2tT~)p(f), \n\ni \n\nwhere Z is the normalization constant. \nLet us introduce a binary decision variable Si = 0,1, which at every image pixel i \ndecides if that image pixel contains \"important\" information or not regarding the \nmodel I. Our statistic \u00a2 is actually a (multivariate) random variable generated \nfrom I according to \n\nPs(\u00a2II) = II \n\nThis distribution gives \u00a2i = Ii with probability 1 (Dirac delta function) when Si = \u00b0 \n(data is kept) and gives \u00a2i uniformly distributed otherwise (Si = 1, data is removed). \nWe then have \n\ni \n\nPs(\u00a2I/) = I P(\u00a2,II!)dI= I P(II!) Ps(\u00a2II) dI \n\n1 I -~(f;-Ji)2 \n\ne 20', \n\nII \ni v21TUr \n\n= II \n\nThe conditional distribution of \u00a2 on I satisfies the properties that we mentioned in \nconnection with the posterior distribution of I on I. Thus, \n\nPs(fl\u00a2) = \n\n(ljZs) P(f) (II e-~(f;-J;)2(1-Si)) \n\n(2) \n\nwhere Zs is a normalization constant. \nIt is also plausible to extend this model to non-Gaussian ones, by simply modifying \nthe quadratic term (fi - Ii)2 and keeping the sparse data coefficient (1 - Si). \n\ni \n\n3.2 Two Methods \n\nWe can now formulate the problem of finding a feature-set, or finding a sufficient \nstatistics, in terms of the variables Si that can remove data. More precisely, we can \nfind S by minimizing \n\n\fFeatures as Sufficient Statistics \n\nE(s,I) = D(P(fII) II Ps(fl\u00a2(l))) + A 2)1 - Si) . \n\n797 \n\n(3) \n\nIt is clear that the K-L distance is minimized when Si = 0 everywhere and all the \ndata is kept. The second term is added on to drive the solution towards a minimal \nsufficient statistic, where the parameter A has to be estimated. Note that, for A \nvery large, all the data is removed (Si = 1), while for A = 0 all the data is kept. \nWe can further write (3) as \n\nE(s,I) = 2:P(fII) log(P(fIl)/Ps(fI\u00a2(I))) + A 2:(1- Si) \n\nI \n\n2: P(fII)log( (Zs/Z) II e -2!rUi-li)2(1-(I-Si\u00bb) + A 2:(1 - Si) \n\nI i i \nlog-Z - Ep ~ 2u[ (h - Ii) + A ~(1 - Si) . \n\n2)] \n\nZs \n\n-\n\n['\" Si \n, \n\n'\" \n, \n\nwhere E p [.] denotes the expectation taken with respect to the distribution P. \nIT we let Si be a continuous variable the minimum E(s, I) will occur when \n\naE \n\n0= aSi = (Ep.[(h - Ii) ] - Ep[(fi - Ii) ]) - A. \n\n2 \n\n2 \n\n(4) \n\nWe note that the Hessian matrix \nHs[i,j] = a~:!j = Ep.[(h - Ii)2(/j - Ij )2] - Ep.[(h - Ii)2]Ep.[(!i - Ij?] , (5) \n\nis a correlation matrix, i.e., it is positive semi-definite. Consequently, E(s) is convex. \nContinuation Method on A: \nIn order to solve for the optimal vector S we consider the continuation method on \nthe parameter A. We know that S = 0, for A = O. Then, taking derivatives of (4) \nwith respect to A, we obtain \n\naSj \"'H- 1 [ \"\naA = LJ \n\ns ~,J. \n\n] \n\ni \n\nIt was necessary the Hessian to be invertible, i.e., the continuation method works \nbecause E is convex. The computations are expected to be mostly spent on esti(cid:173)\nmating the Hessian matrix, i.e., on computing the averages Ep. [(h - Ii)2(iJ - Ij )2], \nE p\u2022 [(h - Ii)2], and Ep. [(fj - Ij )2]. Sometimes these averages can be exactly com(cid:173)\nputed, for example for one dimensional graph lattices. Otherwise these averages \ncould be estimated via Gibbs sampling. \nThe above method can be very slow, since these computations for Hs have to be \nrepeated at each increment in A. We then investigate an alternative direct method. \nA Direct Method: \nOur approach seeks to find a \"large set\" of Si = 1 and to maintain a distribution \nPs(fI\u00a2(I)) close to P(fII), i.e., to remove as many data points as possible. For this \n\n\fD. Geiger, A. Rudra and L. T. Maloney \n\n798 \n\nrj \n\no \n\n10 \n\n20 \n\n. \n\n10 \n\n(a) \n\nFigure 1: (a). Complete results for step edge showing the image, the effective \nvariance and the computed s-value (using the continuation method). (b) Complete \nresults for step edge with added noise. \n\ngoal, we can investigate the marginal distribution \n\nP(fiII) \n\nf dfo .. , dh-l dfHl ... dfN-l P(fII) \n~ e -2!;(f;-I;)2 f II d/j P(f) (II e --0(/;-1;)2) \n\n-\n\nPI; (h) Pell(!;,), \n\nj#i \n(after rearranging the normalization constants) \n\nj#i \n\nwhere Pel I (h) is an effective marginal distribution that depends on all the other \nvalues of I besides the one at pixel i. \nHow to decide if Si = 0 or Si = 1 directly from this marginal distribution P(lilI)? \nThe entropy of the first term HI; (fi) = J dfiPI; (h) logPI; (h) indicates how much \n!;, is conditioned by the data. The larger the entropy the less the data constrain \nIi, thus, there is less need to keep this data. The second term entropy Hell (Ii) = \nJ dhPell(h) lo9Pel/(fi) works the opposite direction. The more h is constrained \nby the neighbors, the lesser the entropy and the lesser the need to keep that data \npoint. Thus, the decision to keep the data, Si = 0, is driven by minimizing the \n\"data\" entropy HI (!;,) and maximizing the neighbor entropy Hell (!;,). The relevant \nquantity is Hell(h)-HI; (!;,). When this is large, the pixel is kept. Later, we will see \na case where the second term is constant, and so the effective entropy is maximized. \nFor Gaussian models, the entropy is the logarithm of the variance and the appro(cid:173)\npriate ratio of variances may be considered. \n\n4 Example: Surface Reconstruction \n\nTo make this approach concrete we apply to the problem of surface reconstruction. \nFirst we consider the 1 dimensional case to conclude that edges are the important \nfeatures. Then, we apply to the two dimensional case to conclude that junctions \nfollowed by edges are the important features. \n\n\fFeatures as Sufficient Statistics \n\n799 \n\n4.1 \n\nID Case: Edge Features \n\nVarious simplifications and manipulations can be applied for the case that the model \nf is described by a first order Markov model, i.e., P(f) = Di Pi(h, h-I).Then the \nposterior distribution is \n\nP(fII) = ~ II e-[~(li-/i)2+\"i(li-/i-l)21, \n\ni \n\nwhere J-ti are smoothing coefficients that may vary from pixel to pixel according to \nhow much intensity change occurs ar pixel i, e.g., J-ti = J-tl+ P(Ii:/i-d 2 with J-t and \np to be estimated. We have assumed that the standard deviation of the noise is \nhomogeneous, to simplify the calculations and analysis of the direct method. Let \nus now consider both methods, the continuation one and the direct one to estimate \nthe features. \nContinuation Method: Here we apply ~ = 2:i H;I[i, j] by computing Hs[i,j], \ngiven by (5), straight forwardly. We use the Baum-Welch method [2] for Markov \nchains to exactly compute Ep\u2022 [(h-li)2(h-Ij?], Ep\u2022 [(h-li?], and Ep.[(f;-Ij)2]. \nThe final result ofthis algorithm, applied to a step-edge data (and with noise added) \nis shown in Figure 1. Not surprisingly, the edge data, both pixels, as well as the \ndata boundaries, were the most important data, Le., the features. \nDirect Method: We derive the same result, that edges and boundaries are the \nmost important data via an analysis of this model. We use the result that \n\nP(filI) = \n\n/ dlo ... dli- I dli+1 ... dIN - I PUII) = Z~e-~(Ii-/.)2 e->.[i(li- r [i)2 , \nwhere >.t\" is obtained recursively, in log2 N steps (for simplicity, we are assuming \nN to be an exact power of 2), as follows \n\n>.~K = \n~ \n\n(>.!< + \n\n~ \n\n>.K \n\n11K \n\ni+KrHK \n\n>.f + J-tf + J-tfrK \n\n+ \n\n>.K \n\n11K \n\ni-Kri K \n\n>'i + J-tf + J-ti-K \n\n) \n\n(6) \n\nThe effective variance is given by varel/(h) = 1/(2)'t'') while the data variance is \ngiven by var/(h) = (72. Since var/(h) does not depend on any pixel i, maximizing \nthe ratio var ell / var / (as the direct method suggested) as equivalent to maximizing \neither the effective variance, or the total variance (see figure(I). \n\nThus, the lower is >.t\" the lower is Si. We note that >.f increases with K, and J-tf \ndecreases with K. Consequently >.K increases less and less as K increases. In a \nperturbative sense A; most contribute to At\" and is defined by the two neighbors \nvalues J-ti and J-ti+I, Le., by the edge information. The larger are the intensity edges \nthe smaller are J-ti and therefore, the smaller will >.r be. Moreover, >.t\" is mostly \ndefined by>.; (in a perturbative sense, this is where most of the contribution comes). \nThus, we can argue that the pixels i with intensity edges will have smaller values \nfor At\" and therefore are likely to have the data kept as a feature (Si = 0). \n\n4.2 2D Case: Junctions, Corners, and Edge Features \n\nLet us investigate the two dimensional version of the ID problem for surface recon(cid:173)\nstruction. Let us assume the posterior \n\nPUll) = .!.e-[~(lii-/ij)2+\":i(li;-/i-l.j)2+\"~j(lij-/;.j-l)21, \n\nZ \n\n\f800 \n\nD. Geiger, A. Rudra and L. T. Maloney \n\nwhere J.L~jh are the smoothing coefficients along vertical and horizontal direction, \nthat vary inversely according to the 'V I along these direction. We can then approx(cid:173)\nimately compute (e.g., see [3)) \n\n(f I) \n\nij I ~ Z e ~ ., \n\n1 _-L(J\u00b7\u00b7 -I \u00b7\u00b7 )2 _>.N(J\u00b7 \u00b7 _rN )2 \n\n., e \n\nP \n\ni j \" \n\nij \n\nwhere, analogously to the ID case, we have \n\n>..K \n\nh,K \n\n>..K \n\n>..~ + i ,j-KJ.Lij + i,j+KJ.Li,j+K + i-K,jJ.Lij + HK,jJ.LHK,j (7) \n~ KKK \nXi-K,j \n\nXi ,j-K \n\nXi,j+K \n\nXHK,j \n\nK \n\nh,K \n\n11,K \n\n11,K \n\n>..K \n\n>.. \n\nh \n\nwere Xi ,j - Aij \n\nK _ \n\n\\ K + h,K + 11,K + h,K + 11,K \n\nJ.Lij \n\nJ.Lij \n\nJ.Li ,HK \n\nJ.LHK,j' an \n\nd h,2K _ \n-\n\nJ.Lij \n\n\" , K h,K \n\nJl.ij \n\nJl.i.i\u00b1K \nx!' . \n' \" \n\nThe larger is the effective variance at one site (i,j), the smaller is >..N, the more \nlikely that image portion to be a feature. The larger the intensity gradient along \nh, v, at (i, j), the smaller J.L~1J. The smaller is J.L~11 the smaller will be contribution \nto >..2. In a perturbative sense ([3)) >..2 makes the largest contribution to >..N. Thus, \nat one site, the more intensity edges it has the larger will be the effective variance. \nThus, T-junctions will produce very large effective variances, followed by corners, \nfollowed by edges. These will be, in order of importance, the features selected to \nreconstruct 2D surfaces. \n\n5 Conclusion \n\nWe have proposed an approach to specify when a feature set has sufficient infor(cid:173)\nmation in them, so that we can represent the image using it. Thus, one can, in \nprinciple, tell what kind of feature is likely to be important in a given model. Two \nmethods of computation have been proposed and a concrete analysis for a simple \nsurface reconstruction was carried out. \n\nReferences \n\n[1] A. Berger and S. Della Pietra and V. Della Pietra \"A Maximum Entropy Approach \nto Natural Language Processing\" Computational Linguistics, Vo1.22 (1), pp 39-71, \n1996. \n\n[2] T. Cover and J. Thomas. Elements of Information Theory. Wiley Interscience, New \n\nYork, 1991. \n\n[3] D. Geiger and J. E. Kogler. Scaling Images and Image Feature via the Renormaliza(cid:173)\n\ntion Group. In Proc. IEEE Con/. on Computer Vision & Pattern Recognition, New \nYork, NY, 1993. \n\n[4] G. Hinton and Z. Ghahramani. Generative Models for Discovering Sparse Distributed \n\nRepresentations To Appear Phil. funs . of the Royal Society B, 1997. \n\n[5] R. Linsker. Self-Organization in a Perceptual Network. Computer, March 1988, \n\n105-117. \n\n[6] J. Principe, U. of Florida at Gainesville Personal Communication \n[7] T. Sejnowski. Computational Models and the Development of Topographic Projec(cid:173)\n\ntions 7rends Neurosci, 10, 304-305. \n\n[8] S.C. Zhu, Y .N. Wu, D. Mumford. Minimax entropy principle and its application to \n\ntexture modeling Neural Computation 1996 B. \n\n[9] P. Viola and W .M. Wells III. \"Alignment by Maximization of Mutual Information\". \n\nIn Proceedings of the International Conference on Computer Vision. Boston. 1995. \n\n\f", "award": [], "sourceid": 1334, "authors": [{"given_name": "Davi", "family_name": "Geiger", "institution": null}, {"given_name": "Archisman", "family_name": "Rudra", "institution": null}, {"given_name": "Laurance", "family_name": "Maloney", "institution": null}]}