{"title": "Designed Measurements for Vector Count Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1142, "page_last": 1150, "abstract": "We consider design of linear projection measurements for a vector Poisson signal model. The projections are performed on the vector Poisson rate, $X\\in\\mathbb{R}_+^n$, and the observed data are a vector of counts, $Y\\in\\mathbb{Z}_+^m$. The projection matrix is designed by maximizing mutual information between $Y$ and $X$, $I(Y;X)$. When there is a latent class label $C\\in\\{1,\\dots,L\\}$ associated with $X$, we consider the mutual information with respect to $Y$ and $C$, $I(Y;C)$. New analytic expressions for the gradient of $I(Y;X)$ and $I(Y;C)$ are presented, with gradient performed with respect to the measurement matrix. Connections are made to the more widely studied Gaussian measurement model. Example results are presented for compressive topic modeling of a document corpora (word counting), and hyperspectral compressive sensing for chemical classification (photon counting).", "full_text": "Designed Measurements for Vector Count Data\n\n1Liming Wang, 1David Carlson, 2Miguel Dias Rodrigues, 3David Wilcox,\n\n1Robert Calderbank and 1Lawrence Carin\n\n1Department of Electrical and Computer Engineering, Duke University\n\n2Department of Electronic and Electrical Engineering, University College London\n\n{liming.w, david.carlson, robert.calderbank, lcarin}@duke.edu\n\n3Department of Chemistry, Purdue University\n\nm.rodrigues@ucl.ac.uk\n\nwilcoxds@purdue.edu\n\nAbstract\n\nWe consider design of linear projection measurements for a vector Poisson signal\nmodel. The projections are performed on the vector Poisson rate, X \u2208 Rn\n+, and the\nobserved data are a vector of counts, Y \u2208 Zm\n+ . The projection matrix is designed\nby maximizing mutual information between Y and X, I(Y ; X). When there is\na latent class label C \u2208 {1, . . . , L} associated with X, we consider the mutual\ninformation with respect to Y and C, I(Y ; C). New analytic expressions for the\ngradient of I(Y ; X) and I(Y ; C) are presented, with gradient performed with re-\nspect to the measurement matrix. Connections are made to the more widely stud-\nied Gaussian measurement model. Example results are presented for compressive\ntopic modeling of a document corpora (word counting), and hyperspectral com-\npressive sensing for chemical classi\ufb01cation (photon counting).\n\nIntroduction\n\n1\nThere is increasing interest in exploring connections between information and estimation theory. For\nexample, mutual information and conditional mean estimation have been discovered to possess close\ninterrelationships. The derivative of mutual information in a scalar Gaussian channel [11] has been\nexpressed in terms of the minimum mean-squared error (MMSE). The connections have also been\nextended from the scalar Gaussian to the scalar Poisson channel model [12]. The gradient of mutual\ninformation in a vector Gaussian channel [17] has been expressed in terms of the MMSE matrix. It\nhas also been found that the relative entropy can be represented in terms of the mismatched MMSE\nestimates [23, 24]. Recently, parallel results for scalar binomial and negative binomial channels have\nbeen established [22, 10].\nInspired by the Lipster-Shiryaev formula [16], it has been demonstrated that for certain channels\n(or measurement models), investigation of the gradient of mutual information can often lead to a\nrelatively simple formulation, relative to computing mutual information itself. Further, it has been\nshown that the derivative of mutual information with respect to key system parameters also relates to\nthe conditional mean estimates in other channel settings beyond Gaussian and Poisson models [18].\nThis paper pursues this overarching theme for a vector Poisson measurement model. Results for\nscalar Poisson signal models have been developed recently [12, 1] for signal recovery; the vector\nresults presented here are new, with known scalar results recovered as a special case. Further, we\nconsider the gradient of mutual information for Poisson data in the context of classi\ufb01cation, for\nwhich there are no previous results, even in the scalar case.\nThe results we present for optimizing mutual information in vector Poisson measurement models are\ngeneral, and may be applied to optical communication systems [15, 13]. The speci\ufb01c applications\nthat motivate this study are compressive measurements for vector Poisson data. Direct observation\nof long vectors of counts may be computationally or experimentally expensive, and therefore it is\nof interest to design compressive Poisson measurements. Almost all existing results for compres-\n\n1\n\n\fsive sensing (CS) directly or implicitly assume a Gaussian measurement model [6], and extension to\nPoisson measurements represents an important contribution of this paper. To the authors knowledge,\nthe only previous examination of CS with Poisson data was considered in [20], and that paper con-\nsidered a single special (random) measurement matrix, it did not consider design of measurement\nmatrices, and the classi\ufb01cation problems was not addressed. It has been demonstrated in the context\nof Gaussian measurements that designed measurement matrices, using information-theoretic met-\nrics, may yield substantially improved performance relative to randomly constituted measurement\nmatrices [7, 8, 21]. In this paper we extend these ideas to vector Poisson measurement systems, for\nboth signal recovery and classi\ufb01cation, and make connections to the Gaussian measurement model.\nThe theory is demonstrated by considering compressive topic modeling of a document corpora, and\nchemical classi\ufb01cation with a compressive photon-counting hyperspectral camera [25].\n2 Mutual Information for Designed Compressive Measurements\n2.1 Motivation\nA source random variable X \u2208 Rn, with probability density function PX (X), is sent through a\nmeasurement channel, the output of which is characterized by random variable Y \u2208 Rm, with\nconditional probability density function PY |X (Y |X); we are interested in the case m < n, relevant\nfor compressive measurements, although the theory is general. Concerning PY |X (Y |X), in this\npaper we focus on Poisson measurement models, but we also make connections to the much more\nwidely considered Gaussian case. For the Poisson and Gaussian measurement models the mean of\nPY |X (Y |X) is \u03a6X, where \u03a6 \u2208 Rm\u00d7n is the measurement matrix. For the Poisson case the mean\nmay be modi\ufb01ed as \u03a6X + \u03bb for \u201cdark current\u201d \u03bb \u2208 Rm\n+ , and positivity constraints are imposed on\nthe elements of \u03a6 and X.\nc=1 \u03c0cPX|C(X|C =\nc=1 \u03c0c = 1, and C may correspond to a latent class label. In this context,\nfor each draw X there is a latent class random variable C \u2208 {1, . . . , L}, where the probability of\nclass c is \u03c0c.\nOur goal is to design \u03a6 such that the observed Y is most informative about the underlying X or C.\nWhen the interest is in recovering X, we design \u03a6 with the goal of maximizing mutual information\nI(X; Y ), while when interested in inferring C we design \u03a6 with the goal of maximizing I(C; Y ).\nTo motivate use of the mutual information as the design metric, we note several results from the\nliterature. For the case in which we are interested in recovering X from Y , it has been shown [19]\nthat\n\nOften the source statistics are characterized as a mixture model: PX (X) =(cid:80)L\nc), where \u03c0c > 0 and(cid:80)L\n\nMMSE \u2265 1\n2\u03c0e\n\n(1)\nwhere h(X) is the differential entropy of X and MMSE = E{trace[(X \u2212 E(X|Y ))(X \u2212\nE(X|Y ))T ]} is the minimum mean-square error.\n\nFor the classi\ufb01cation problem, we de\ufb01ne the Bayesian classi\ufb01cation error as Pe = (cid:82) PY (y)[1 \u2212\n\nexp{2[h(X) \u2212 I(X; Y )]}\n\nmaxcPC|Y (c|y)]dy. It has been shown in [14] that\n\n[H(C|Y ) \u2212 H(Pe)]/ log L \u2264 Pe \u2264 1\n2\n\n(2)\nwhere H(C|Y ) = H(C) \u2212 I(C; Y ), 0 \u2264 H(Pe) \u2264 1, and H(\u00b7) denotes the entropy of a discrete\nrandom variable. By minimizing H(C|Y ) we minimize the upper bound to Pe, and since H(C) is\nindependent of \u03a6, to minimize the upper bound to Pe our goal is to design \u03a6 such that I(C; Y ) is\nmaximized.\n\nH(C|Y )\n\n2.2 Existing results for Gaussian measurements\nThere are recent results for the gradient of mutual information for vector Gaussian measurements,\nwhich we summarize here. Consider the case C \u223c PC(C), X|C \u223c PX|C(X|C), and Y |X \u223c\nN (Y ; \u03a6X, \u039b\u22121), where \u039b \u2208 Rm\u00d7m is a known precision matrix. Note that PC and PX|C are\narbitrary, while PY |X = N (Y ; \u03a6X, \u039b\u22121) corresponds to a Gaussian measurement with mean \u03a6X.\nIt has been established that the gradient of mutual information between the input and the output of\nthe vector Gaussian channel model obeys [17]\n\n(3)\n\n\u2207\u03a6I(X; Y ) = \u039b\u03a6E,\n\n2\n\n\fwhere E = E(cid:2)(X \u2212 E(X|Y ))(X \u2212 E(X|Y ))T(cid:3) denotes the MMSE matrix. The gradient of mu-\nwhere \u02dcE = E(cid:2)(E(X|Y, C) \u2212 E(X|Y ))(E(X|Y, C) \u2212 E(X|Y ))T(cid:3) denotes the equivalent MMSE\n\ntual information between the class label and the output for the vector Gaussian channel is [8]\n\n\u2207\u03a6I(C; Y ) = \u039b\u03a6 \u02dcE,\n\nmatrix.\n2.3 Conditional-mean estimation\nNote from the above discussion that for a Gaussian measurement, \u2207\u03a6I(X; Y ) = E[f (X, E(X|Y ))]\nand \u2207\u03a6I(C; Y ) = E[g(E(X|Y, C), E(X|Y ))], where f (\u00b7) and g(\u00b7) are matrix-valued functions of\nthe respective arguments. These results highlight the connection between the gradient of mutual\ninformation with respect to the measurement matrix \u03a6 and conditional-mean estimation, constituted\nby E(X|Y ) and E(X|Y, C). We will see below that these relationships hold as well for the vector\nPoisson case, with distinct functions \u02dcf (\u00b7) and \u02dcg(\u00b7).\n3 Vector Poisson Data\n3.1 Model\n\n(4)\n\nThe vector Poisson channel model is de\ufb01ned as\n\nPois(Y ; \u03a6X + \u03bb) = PY |X (Y |X) =\n\nm(cid:89)\n\ni=1\n\nm(cid:89)\n\nPYi|X (Yi|X) =\n\nPois (Yi; (\u03a6X)i + \u03bbi)\n\n(5)\n\ni=1\n\n+\n\n+ represents the dark current.\n\n+ represents the channel output, \u03a6 \u2208 Rm\u00d7n\n\n+ represents the channel input, the random\nrepresents a mea-\n\nwhere the random vector X = (X1, X2, . . . , Xn) \u2208 Rn\nvector Y = (Y1, Y2, . . . , Ym) \u2208 Zm\nsurement matrix, and the vector \u03bb = (\u03bb1, \u03bb2, . . . , \u03bbm) \u2208 Rm\nThe vector Poisson channel model associated with arbitrary m and n is a generalization of the scalar\nPoisson model, for which m = n = 1 [12, 1]. In the scalar case PY |X (Y |X) = Pois(Y ; \u03c6X + \u03bb),\nwhere here scalar random variables X \u2208 R+ and Y \u2208 Z+ are associated with the input and output\nof the scalar channel, respectively, \u03c6 \u2208 R+ is a scaling factor, and \u03bb \u2208 R+ is associated with the\ndark current.\nThe goal is to design \u03a6 to maximize the mutual information between X and Y . Toward that end, we\nconsider the gradient of mutual information with respect to \u03a6: \u2207\u03a6I(X; Y ) = [\u2207\u03a6I(X; Y )ij],\nwhere \u2207\u03a6I(X; Y )ij represents the (i, j)-th entry of the matrix \u2207\u03a6I(X; Y ). We also con-\nsider the gradient with respect to the vector dark current, \u2207\u03bbI(X; Y ) = [\u2207\u03bbI(X; Y )i], where\n\u2207\u03bbI(X; Y )i represents the i-th entry of the vector \u2207\u03bbI(X; Y ). For a mixture-model source\nc=1 \u03c0cPX|C=c(X|C = c), for which there is more interest in recovering C than\nin recovering X, we seek \u2207\u03a6I(C; Y ) and \u2207\u03bbI(C; Y ).\n\nPX (X) = (cid:80)L\n\nY with respect to QY provided that P \u03b8\n\nY |X be the Radon-Nikodym derivative of probability measure P \u03b8\n\n3.2 Gradient of Mutual Information for Signal Recovery\nIn order to take full generality of the input distribution into consideration, we utilize the Radon-\nNikodym derivatives to represent the probability measures of interests. Consider random variables\nX \u2208 Rn and Y \u2208 Rm. Let f \u03b8\nY |X\nwith respect to an arbitrary measure QY , provided that P \u03b8\nY |X is absolutely continuous with respect\nY |X (cid:28) QY . \u03b8 \u2208 R is a parameter. f \u03b8\nY is the Radon-Nikodym derivative of the\nto QY , i.e., P \u03b8\nY (cid:28) QY . Note that in the continuous or\nprobability measure P \u03b8\nY are simply probability density or mass functions with QY chosen to be\ndiscrete case, f \u03b8\nthe Lebesgue measure or the counting measure, respectively. We note that similar notation is also\nused for the signal classi\ufb01cation case, except that we may also need to condition both on X and C.\nSome results of the paper require the assumption on the regularity conditions (RC), which are listed\nin the Supplementary Material. We will assume all four regularity conditions RC1\u2013RC4 whenever\nnecessary in the proof and the statement of the results. Recall [9] that for a function f (x, \u03b8) :\nRn \u00d7 R \u2192 R with a Lebesgue measure \u00b5 on Rn, we have \u2202\n\u2202\u03b8 f (x, \u03b8)d\u00b5(x),\nif f (x, \u03b8) \u2264 g(x), where g \u2208 L1(\u00b5). Hence, in light of this criterion, it is straightforward to\nverify that the RC are valid for many common distributions of X. Proofs of the below theorems are\nprovided in the Supplementary Material.\n\n(cid:82) f (x, \u03b8)d\u00b5(x) =(cid:82) \u2202\n\nY |X and f \u03b8\n\n\u2202\u03b8\n\n3\n\n\fTheorem 1. Consider the vector Poisson channel model in (5). The gradient of mutual information\nbetween the input and output of the channel, with respect to the matrix \u03a6, is given by:\n\n[\u2207\u03a6I(X; Y )ij] =(cid:2)E [Xj log((\u03a6X)i + \u03bbi)] \u2212 E [E[Xj|Y ] log E[(\u03a6X)i + \u03bbi|Y ]](cid:3),\n\nand with respect to the dark current is given by:\n\n[\u2207\u03bbI(X; Y )i] =(cid:2)E[log((\u03a6X)i + \u03bbi)] \u2212 E[log E[(\u03a6X)i + \u03bbi|Y ]](cid:3).\n\n(6)\n\n(7)\n\nirrespective of the input distribution PX (X), provided that the regularity conditions hold.\n\n3.3 Gradient of Mutual Information for Classi\ufb01cation\n\nTheorem 2. Consider the vector Poisson channel model in (5) and mixture signal model. The\ngradient with respect to \u03a6 of mutual information between the class label and output of the channel\nis\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)\n\nE[(\u03a6X)i + \u03bbi|Y, C]\nE[(\u03a6X)i + \u03bbi|Y ]\n\n(cid:21)\n\n.\n\n,\n\n(8)\n\n(9)\n\n[\u2207\u03a6I(C; Y )ij] = E\n\nE[Xj|Y, C] log\n\nand with respect to the dark current is given by\n\n(\u2207\u03bbI(C; Y ))i =E\n\nlog\n\nE[(\u03a6X)i + \u03bbi|Y, C]\nE[(\u03a6X)i + \u03bbi|Y ]\n\nirrespective of the input distribution PX|C(X|C), provided that the regularity conditions hold.\n\n3.4 Relationship to known scalar results\n\nIt is clear that Theorem 1 represents a multi-dimensional generalization of Theorems 1 and 2 in [12].\nThe scalar result follows immediately from the vector counterpart by taking m = n = 1.\nCorollary 1. For the scalar Poisson channel model PY |X (Y |X) = Pois(Y ; \u03c6X + \u03bb), we have\n\nI(X; Y ) = E [X log((\u03c6X) + \u03bb)] \u2212 E [E[X|Y ] log E[\u03c6X + \u03bb|Y ]] ,\n\n\u2202\n\u2202\u03c6\n\nI(X; Y ) = E[log(\u03c6X + \u03bb)] \u2212 E[log E[\u03c6X + \u03bb|Y ]].\n\n\u2202\n\u2202\u03bb\n\nirrespective of the input distribution PX (X), provided that the regularity conditions hold.\n\n(10)\n\n(11)\n\nWhile the scalar result in [12] for signal recovery is obtained as a special case of our Theorem 1, for\nrecovery of the class label C there are no previous results for our Theorem 2, even in the scalar case.\n\n3.5 Conditional mean and generalized Bregman divergence\nConsidering the results in Theorem 1, and recognizing that E[(\u03a6X) + \u03bb|Y ] = \u03a6E(X|Y ) + \u03bb, it\nis clear that for the Poisson case \u2207\u03a6I(X; Y ) = E[ \u02dcf (X, E(X|Y ))]. Similarly, for the classi\ufb01cation\ncase, \u2207\u03a6I(C; Y ) = E[\u02dcg(E(X|Y, C), E(X|Y ))]. The gradient with respect to the dark current \u03bb\nhas no analog for the Gaussian case, but similarly we have \u2207\u03bbI(X; Y ) = E[ \u02dcf1(X, E(X|Y ))] and\n\u2207\u03bbI(C; Y ) = E[\u02dcg1(E(X|Y, C), E(X|Y ))].\nFor the scalar Poisson channel in Corollary 1, it has been shown in [1] that \u2202\n\u2202\u03c6 I(X; Y ) =\nE[(cid:96)(X, E(X|Y ))], where (cid:96)(X, E(X|Y )) is de\ufb01ned by the right side of (10), and is related to the\nBregman divergence [5, 2].\n\nscope of\n\nthis paper,\n\none may show that\n\nWhile beyond the\nand\n\u02dcg(E(X|Y, C), E(X|Y )) may be interpreted as generalized Bregman divergences, where here\nthe generalization is manifested by the fact that these are matrix-valued measures, rather than the\nscalar one in [1]. Further, for the vector Gaussian cases one may also show that f (X, E(X|Y ))\nand g(E(X|Y, C), E(X|Y )) are also generalized Bregman divergences. These facts are primarily\nof theoretical interest, as they do not affect the way we perform computations. Nevertheless,\nthese theoretical results, through generalized Bregman divergence, underscore the primacy the\nconditional mean estimators E(X|Y ) and E(X|Y, C) within the gradient of mutual information\nwith respect to \u03a6, for both the Gaussian and Poisson vector measurement models.\n\n\u02dcf (X, E(X|Y ))\n\n4\n\n\fi=1\n\n(a)\n\n+, \u03a8 \u2208 Rn\u00d7T\n\n+\n\nand Sd \u2208 RT\n\n(b)\n\nbution to Ydk = (cid:80)n\n\nFigure 1: Results on the 20 Newsgroups dataset. Random denotes\na random binary matrix with 1% non-zero values. Rand-Ortho de-\nnotes a random binary matrix restricted to an orthogonal matrix\nwith one non-zero entry per column. Optimized denotes the meth-\nods discussed in Section 4.3. Full denotes when each word is ob-\nserved. The error estimates were obtained by running the algorithm\nover 10 different random splits of the corpus. (a) Per-word predic-\ntive log-likelihood estimate versus the number of projections. (b)\nKL Divergence versus the number of projections.\n\n4 Applications\n4.1 Topic Models\nConsider the case for which the Poisson rate vector for document d may be represented Xd = \u03a8Sd,\nwhere Xd \u2208 Rn\n+. Here T represents the number of topics, and in\nthe context of documents, n represents the total number of words in dictionary D. The count for\nthe number of times each of the n words is manifested in document d may often be modeled as\nYd|Sd \u223c Pois(Yd; \u03a8Sd); see [26] and the extensive set of references therein.\nRather than counting the number of\ntimes each of the n words are sepa-\nrately manifested, we may more ef-\n\ufb01ciently count the number of times\nwords in particular subsets of D are\nmanifested. Speci\ufb01cally, consider a\ncompressive measurement for docu-\nment d, as Yd|Xd \u223c Pois(Yd; \u03a6Xd),\nwhere \u03a6 \u2208 {0, 1}m\u00d7n, with m (cid:28)\nLet \u03c6k \u2208 {0, 1}n represent\nn.\nthe kth row of \u03a6, with Ydk the kth\ncomponent of Yd. Then Ydk|Xd \u223c\nk Xd) is equal in distri-\nPois(Ydk; \u03c6T\n\u02dcYdki, where\n\u02dcYdki|Xdi \u223c Pois(\u03c6kiXdi), with\n\u03c6ki \u2208 {0, 1} the ith component of\n\u03c6k and Xdi the ith component of Xd.\nTherefore, Ydk represents the number\nof times words in the set de\ufb01ned by\nthe non-zero elements of \u03c6k are man-\nifested in document d; Yd therefore\nrepresents the number of times words are manifested in a document in m distinct sets.\nOur goal is to use the theory developed above to design the binary \u03a6 such that the compressive\nYd|Xd \u223c Pois(Yd; \u03a6Xd) is as informative as possible. In our experiments we assume that \u03a8 may\nbe learned separately based upon a small subset of the corpus, and then with \u03a8 so \ufb01xed the statistics\nof Xd are driven by the statistics of Sd. When performing learning of \u03a8, each column of \u03a8 is\nassumed drawn from an n-dimensional Dirichlet distribution, and Sd is assumed drawn from a\ngamma process, as speci\ufb01ed in [26]. We employ variational Bayesian (VB) inference on this model\n[26] to estimate \u03a8 (and retain the mean).\nWith \u03a8 so \ufb01xed, we then design \u03a6 under two cases. For the case in which we are interested in\ninferring Sd from the compressive measurements, i.e., based on counts of words in sets, we employ\na gamma process prior for pS(Sd), as in [26]. The result in Theorem 1 is then used to perform\ngradients for design of \u03a6. For the classi\ufb01cation case, for each document class c \u2208 {1, . . . , L} we\nlearn a p(Sd|C) based on a training sub-corpus for class C. This is done for all document classes,\nand we design a compressive matrix \u03a6 \u2208 {0, 1}m\u00d7n, with gradient performed using Theorem 2.\nIn the testing phase, using held-out documents, we employ the matrix \u03a6 to group the counts of\nwords in document d into counts on m sets of words, with sets de\ufb01ned by the rows of \u03a6. Using\nthese Yd, which we assume are drawn Yd|Sd \u223c Pois(Yd; \u03a6\u03a8Sd), for known \u03a6 and \u03a8, we then use\nVB computations for the model in [26] to infer a posterior distribution on Sd or class C, depending\non the application. The VB inference for this model was not considered in [26], and the update\nequations are presented in the Supplementary Material.\n4.2 Model for Chemical Sensing\nThe model employed for the chemical sensing [25] considered below is very similar in form to that\nused for topic modeling, so we reuse notation. Assume that there are T fundamental (building-block)\nchemicals of interest, and that the hyperspectral sensor performs measurements at n wavelengths.\nThen the observed data for sample d may be represented Yd|Sd \u223c Pois(Yd; \u03a8Sd + \u03bb), where Yd \u2208\n+ represents the count of photons at the n sensor wavelengths, \u03bb \u2208 Rn\nZn\n+ represents the sensor\ndark current, and the tth column of \u03a8 \u2208 Rn\u00d7T\nre\ufb02ects the mean Poisson rate for chemical t (the\n\n+\n\n5\n\n20406080100120140\u22129\u22128.5\u22128\u22127.520 Newsgroups: PLL of Hold\u2212out SetNumber of ProjectionsPer\u2212word Predictive Log\u2212Likelihood RandomOrthoNNMFLDAOptimizedFull0501001500.811.21.41.61.822.22.420Newsgroups: KL\u2212Divergence on Topic Mixture EstimatesNumber of ProjectionsPer\u2212Document K\u2212L Divergence RandomRand\u2212OrthoNNMFLDAOptimized\f(a)\n\n(b)\n\n(c)\n\n+ re\ufb02ects the amount of\n\nFigure 2: Results on the NYTimes corpus. Optimized denotes the methods discussed in Section 4.3. Full\ndenotes when each word is observed. The error estimates were obtained by running the algorithm over 10\ndifferent random subsets of 20,000 documents. (a) Predictive log-likelihood estimate versus the number of pro-\njections. (b) KL Divergence versus the number of projections. (c) Predictive log-likelihood versus processing\ntime.\ndifferent chemicals play a role analogous to topics). The vector Sd \u2208 RT\neach fundamental chemical present in the sample under test.\nFor the compressive chemical-sensing system discussed in Section 4.5, the measurement matrix is\nagain binary, \u03a6 \u2208 {0, 1}m\u00d7n. Through calibrations and known properties of chemicals and charac-\nteristics of the camera, one may readily constitute \u03a8 and \u03bb, and a model similar to that employed for\ntopic modeling is utilized to model Sd; here \u03bb is a characteristic of the camera, and is not optimized.\nIn the experiments reported below the analysis of the chemical-sensing data is performed analo-\ngously to how the documents were modeled (which we detail), and therefore no further modeling\ndetails are provided explicitly for the chemical-sensing application, for brevity. For the chemical\nsensing application, the goal is to classify the chemical sample under test, and therefore \u03a6 is de\ufb01ned\nbased on optimization using the Theorem 2 gradient.\n4.3 Details on Designing \u03a6\nWe wish to use Theorems 1 and 2 to design a binary \u03a6, for the document-analysis and chemical-\nsensing applications. To do this, instead of directly optimizing \u03a6, we put a logistic link on each\nvalue \u03a6ij = logit(Mij). We can state the gradient with respect to M as:\n[\u2207M I(X; Y )ij] = [\u2207\u03a6I(X; Y )ij][\u2207M \u03a6ij]\n\n(12)\nSimilar results hold for \u2207M I(C; Y )ij.\u03a6 was initialized at random, and we threshold the logistic at\n0.5 to get the \ufb01nal binary \u03a6.\nTo estimate the expectations needed for the results in Theorems 1 and 2, we used Monte Carlo\nintegration methods, where we simulated X and Y from the appropriate distribution. The number\nof samples in the Monte Carlo integration was set to n (data dimension), and 1000 gradient steps\nwere used for optimizing \u03a6.\nThe explicit forms for the gradients in Theorems 1 and 2 play an important role in making opti-\nmization of \u03a6 tractable for the practical applications considered here. One could in principle take a\nbrute-force gradient of I(Y ; X) and I(Y ; C) with respect to \u03a6, and evaluate all needed integrals via\nMonte Carlo sampling. This leads to a cumbersome set of terms that need be computed. The \u201cclean\u201d\nforms of the gradients in Theorems 1 and 2 signi\ufb01cantly simpli\ufb01ed design implementation within\nthe below experiments, with the added value of allowing connections to be made to the Gaussian\nmeasurement model.\n4.4 Examples for Document Corpora\n\nWe demonstrate designed projections on the NYTimes and 20 Newsgroups data. The NYTimes data\nhas n = 8000 unique words, and the Newsgroup data has n = 8052 unique words. When learning\n\u03a8, we placed the prior Dir(0.1, . . . , 0.1) on the columns of \u03a8, and the components Sdk had a prior\nGamma(0.1, 0.1). We tried many different settings for these priors, and as in [26], the learned \u03a8\nwas insensitive to \u201creasonable\u201d settings. The number of topics (columns) in \u03a8 was set to T = 100.\nIn addition to designing \u03a6 using the proposed theory, we also considered four comparative designs:\n(i) binary \u03a6 constituted uniformly at random, with 1% of the entries non-zero; (ii) orthogonal\nbinary rows of \u03a6, with one non-zero element in each column selected uniformly at random; (iii)\nperforming non-negative matrix factorization [3] on (NNMF) \u03a8, and projecting onto the principal\nvectors; and (iv) performing latent Dirichlet allocation [4] on the documents, and projecting onto\nthe topic-dependent probabilities of words. For (iii) and (iv), the top (highest amplitude) 5% of\n\n6\n\n20406080100120140\u22128.4\u22128.2\u22128\u22127.8\u22127.6\u22127.4\u22127.2NYTimes: PLL of Hold\u2212out SetNumber of ProjectionsPer\u2212word Predictive Log\u2212Likelihood RandomOrthoNNMFLDAOptimizedFull0501001500.811.21.41.61.822.22.4NYTimes: KL\u2212Divergence on Topic Mixture EstimatesNumber of ProjectionsPer\u2212Document K\u2212L Divergence RandomRand\u2212OrthoNNMFLDAOptimized00.511.52\u22129\u22128.5\u22128\u22127.5NYTimes: Predictive Log\u2212Likelihood vs TimePer\u2212Document Processing Time, msHoldout Per\u2212Word PLL RandomRand\u2212OrthoNNMFLDAOptimized\f(a)\n\n(b)\n\nFigure 3:\n(a) Classi\ufb01cation accuracy of projected measurements and the fully observed case. Random uses\n10% non-zero values, Ortho is a random matrix limited to orthogonal projections, and Optimized uses designed\nprojections. The error bars are the standard deviation of the algorithm run independently on 10 random splits of\nthe dataset. (b) Subset of confusion matrix of of the fully observed counts. White numbers denote percentage\nof documents classi\ufb01ed in that manner. Only those classes in the \u201ccomp\u201d subgroup are shown. The \u201ccomp\u201d\ngroup is the least accurate subgroup. (c) The confusion matrix on the \u201ccomp\u201d subgroup for 150 compressive\nmeasurements.\n\n(c)\n\n(cid:80)K\nk=1 S(cid:48)\n\ndk,p log(S(cid:48)\n\nd,p||S(cid:48)\n\nd as the normalized version of Sd. We calculate DKL(S(cid:48)\n\nthe words in each vector on which we project (e.g., topic) were set to have projection amplitude 1,\nand all the rest were set to zero. The settings on (i), (iii) and (iv), i.e., with regard to the fraction\nof words with non-zero values in \u03a6, were those that yielded the best results (other settings often\nperformed much worse).\nWe show results using two metrics, Kullback-Leibler (KL) divergence and predictive log-likelihood.\nFor the KL divergence, we compare the topic mixture learned from the projection measurements to\nthe topic mixture learned from the case where each word is observed (no compressive measurement).\nWe de\ufb01ne the topic mixture S(cid:48)\nd,f ) =\ndk,p/S(cid:48)\ndk,f ), where S(cid:48)\ndk,p is the relative weight on document d, topic k for the\nfull set of words, and S(cid:48)\ndk,p is the same for the compressive topic model. We also calculate per-\nword predictive log-likelihood. Because different projection metrics are in different dimensions,\nwe use 75% of a document\u2019s words to get the projection measurements Yd and use the remaining\n25% as the original word tokens Wd. We then calculate the predictive log-likelihood (PLL) as\nlog(Wd|\u03a8, \u03a6, Yd).\nWe split the 20 Newgroups corpus into 10 random splits of 60% training and 40% testing to get an\nestimate of uncertainty. The results are shown in Figure 1. Figure 1(a) shows the per-word predic-\ntive log-likelihood (PLL). At very low numbers of compressive measurements we get similar PLL\nbetween the designed matrix and the random methods. As we increase the number of measurements,\nwe get dramatic improvements by optimizing the sensing matrix and the optimized methods quickly\napproach the fully observed case. The same trends can be seen in the KL divergence shown in Figure\n1(b). Note that the relative quality of the NNMF and LDA based designs of \u03a6 depends on the metric\n(KL or PLL), but for both metrics the proposed mutual-information-based design of \u03a6 yields best\nperformance.\nTo test the NYTimes corpus, we split the corpus into 10 random subsets with 20,000 training docu-\nments and 20,000 testing documents. The results are shown in Figure 2. As in the 20 Newsgroups\nresults, the predictive log-likelihood and KL divergence of the random and designed measurements\nare similar when the number of projections are low. As we increase the number of projections the\noptimized projection matrix offers dramatic improvements over the random methods. We also con-\nsider predictive log-likelihood versus time in Figure 2(c). The compressive measurements give near\nthe same performance with half the per-document processing time. Since the total processing time\nincreases linearly with the total number of documents, a 50% decrease in processing time can make\na signi\ufb01cant difference in large corpora.\nWe also consider the classi\ufb01cation problem over the 20 classes in the 20 Newsgroups dataset, split\ninto 10 groups of 60% training and 40% testing. We learn a \u03a8 with T = 20 columns (topics) and\nwith the prior on the columns as above. Within the prior, we draw Sdcd|cd \u223c Gamma(1, 1) and\nSdc(cid:48)|cd = 0 for all c(cid:48) (cid:54)= cd. Separate topics are associated with each of the 20 classes, and we use\nd = arg max(c|Yd). Classi\ufb01cation versus number of pro-\nthe MAP estimate to get the class label c\u2217\njections for random projections and designed projections are shown in Figure 3(a). It is also useful\nto look at the type of errors made in the classi\ufb01er when we use the designed projections. Figure\n3(b) and Figure 3(c) show the newsgroups under the \u201ccomp\u201d (computer) heading, which is the least\n\n7\n\n05010015000.10.20.30.40.50.60.70.8Number of ProjectionsHold\u2212out Classification Accuracy20 Newsgroups: Classification Accuracy FullRandomRand\u2212OrthoNNMFLDAOptimizedConfusion Matrix for Fully Observed Word Counts 72 5 6 3 510 66611 210 6 1 775 7 2 7 2 2 977 0 910 5 1 178 4alt.atheismcomp.graphicscomp.os.ms\u2212windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.xOthercomp.graphicscomp.os.ms\u2212windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.x00.20.40.60.8Confusion Matrix for Projected (N=150) Counts 67 6 7 3 511 95615 113 6 2 873 8 1 8 3 31470 1 9 9 7 2 075 7alt.atheismcomp.graphicscomp.os.ms\u2212windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.xOthercomp.graphicscomp.os.ms\u2212windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.x00.20.40.6\faccurate section. In the compressed case, many of the additional errors go into nearby topics with\noverlapping ideas. For example, most additional misclassi\ufb01cations in \u201ccomp.os.ms-windows.misc\u201d\ngo into \u201ccomp.sys.ibm.pc.hardware\u201d and \u201ccomp.windows.x,\u201d which have many similar discussions.\nAdditionally, 4% of the articles were originally posted in more than one topic, showing the intimate\nrelationship between similar discussion groups, and so misclassifying into a related (and overlap-\nping) class is less of a problem than misclassi\ufb01cation into a completely disjoint class.\n4.5 Poisson Compressive Sensing for Chemical Classi\ufb01cation\nWe consider chemical sensing based on the wavelength-dependent signature of chemicals, at optical\nfrequencies (here we consider a 850-1000 nm laser system). In Figure 4(a) the measurement system\nis summarized; details of this system are described in [25]. In Part 1 of Figure 4(a) multi-wavelength\nphotons are scattered off a chemical sample. In Part 2 of this \ufb01gure a volume holographic grating\n(VHG) is employed to diffract the photons in a wavelength-dependent manner, and therefore pho-\ntons are distributed spatially across a digital mirror microdevice (DMD); distinct wavelengths are\nassociated with each micromirror. The DMD consists of 1920 \u00d7 1080 aluminum mirrors. Each mir-\nror is in a binary state, either re\ufb02ecting light back to a detector, or not. Each mirror approximately\nsamples a single wavelength, as a result of the VHG, and the photon counter counts all photons at\nwavelengths for which the mirrors direct light to the sensor. Hence, the sensor counts all photons at\na subset of the wavelengths, those for which the mirror is at the appropriate angle.\nThe measurement may be repre-\nsented Y |Sd \u223c Pois[\u03a6(\u03a8Sd + \u03bb0)],\nwhere \u03bb0 \u2208 Rn\n+ is known from cali-\nbration. The elements of the rate vec-\ntor of \u03bb0 vary from .07 to 1.5 per bin,\nand the cumulative dark current \u03a6\u03bb0\ncan provide in excess of 50% of the\nsignal energy, depending on the mea-\nsurement (very noisy measurements).\nDesign of \u03a6 was based on Theorem\n2, and \u03bb0 here is treated as the sig-\nnature of an additional chemical (ac-\ntually associated with measurement\nnoise); \ufb01nally, \u03bb = \u03a6\u03bb0 is the mea-\nsurement dark current.\nThe ten chemicals considered in this\ntest were acetone, acetonitrile, ben-\nzene, dimethylacetamide, dioxane,\nethanol, hexane, methylcyclohexane,\noctane, and toluene, and we note\nfrom Figure 4 that after only \ufb01ve compressive measurements excellent chemical classi\ufb01cation is\nmanifested based on designed CS measurements. There are n > 1000 wavelengths in a conven-\ntional measurement of these data, this system therefore re\ufb02ecting signi\ufb01cant compression. In Figure\n4(b) we show results of measured data and performance predictions based on our model, with good\nagreement manifested. Note that designed projection measurements perform markedly better than\nrandom, where here the probability of a one in the random design was 10% (this yielded best random\nresults in simulations).\n5 Conclusions\nNew results are presented for the gradient of mutual information with respect to the measurement\nmatrix and a dark current, within the context of a Poisson model for vector count data. The mutual\ninformation is considered for signal recovery and classi\ufb01cation. For the former we recover known\nscalar results as a special case, and the latter results for classi\ufb01cation have not been addressed in any\nform previously. Fundamental connections between the gradient of mutual information and condi-\ntional expectation estimates have been made for the Poisson model. Encouraging applications have\nbeen demonstrated for compressive topic modeling, and for compressive hyperspectral chemical\nsensing (with demonstration on a real compressive camera).\nAcknowledgments\nThe work reported here was supported in part by grants from ARO, DARPA, DOE, NGA and ONR.\n\nFigure 4:\n(a) Measurement system. The VHG is a volume holo-\ngraphic grating, that spatially spreads photons in a wavelength-\ndependent manner across the digital mirror microdevice (DMD),\nand the DMD is employed to implement binary coding. (b) Per-\nformance of the compressive-measurement classi\ufb01er as a function\nof the number of compressive measurements; ten chemicals are\nconsidered. Experimental results are shown (Exp), as well as pre-\ndictions from simulations (Sim).\n\n(a)\n\n(b)\n\n8\n\nD.S. Wilcox et al. / Analytica Chimica Acta 755 (2012) 17\u2013 2721Fig. 1. Schematic of the DMD-based near infrared digital compressive detection instrument.As for which vector \u0001 we should use in (7), we believe that apractical set of \ufb01lters F can be designed assuming that the purecomponent emission rates are normalized to the same value,\u0001i= \u0001j(8)for all i and j, i.e., we design measurement \ufb01lters F to min-imize the error in estimating a mixture where the rate ofphotons emitted by all chemical species are the same. Setting\u0001 = (1, 1, . . . , 1)Tsuf\ufb01ces. This determines A = FTP, B, and T. Mat-lab software to determine OB \ufb01lters is available on request. Seewww.math.purdue.edu/\u223cbuzzard/software/ for more details.3. Experimental3.1. Experimental apparatusThe compressive detection spectrometer, shown in Fig. 1,employs a Raman backscattering collection geometry. Part 1 issimilar to that described in [2]. The excitation source is a 785 nmsingle mode laser (Innovative Photonic Solutions). After passingthrough a laser-line bandpass \ufb01lter (Semrock, LL01-785-12.5), thelaser is focused onto the sample with a NIR lens (Olympus, LMPlanIR, 20\u00d7). The Raman scattering is collected and separated fromthe laser Rayleigh scattering with a dichroic mirror (Semrock,LPD01-785RS-25) and a 785 nm notch \ufb01lter (Semrock, NF03-785E-25).The Raman scattered light is then sent to Part 2, where it is \ufb01rst\ufb01ltered with a 900 nm shortpass \ufb01lter (Thorlabs, FES0900) and sub-sequently directed to a volume holographic grating (1200 L mm\u22121,center wavelength 830 nm, Edmund Optics, 48\u2013590). The windowof the dispersed light is \u223c200\u20131700 cm\u22121with a spectral resolutionof 30 cm\u22121(this resolution is limited by the beam quality andhence the image of the diode laser focal spot size, which spansapproximately 15 mirrors on the surface of the DMD). The light iscollimated with an achromatic lens with a focal length of f = 50 mm(Thorlabs, AC254-050-B) and focused onto the DMD (Texas Instru-ments, DLP Discovery 4000). The DMD consists of 1920 \u00d7 1080aluminum mirrors (10.8 \u242em pitch) that can tilt \u00b112\u25e6relative tothe \ufb02at state of the array, controlled by an interface card (DLPD4000, Texas Instruments). All 1080 mirrors in each rows of thearray are set to the same angle, and the 1920 columns are dividedinto adjacent groupings \u2013e.g., if we want to divide the energy ofthe photons into 128 \u201cbins\u201d, then groups of 15 adjacent columnsare set in unison. The DMD is mounted at an angle such that the\u221212\u25e6mirror position directs photons back with a vertical offsetof \u223c1\u25e6below the incident light in order to spatially separate theincident and re\ufb02ected photons. The latter photons are recombinedin a second pass through the holographic grating, and focusedonto a \ufb01ber optic cable that is connected to a photodiode photoncounting module (PerkinElmer, SPCMCD2969PE). The photoncounting module has a dark count rate of \u223c200 photons s\u22121andno read noise. A TTL pulse is output by the photon counter aseach photon is detected, and the pulses are counted in a USB dataacquisition (DAQ) card (National Instruments, USB-6212BNC).Integration timing is controlled by setting the sampling rate andnumber of samples to acquire with the DAQ card in Labview 2009.Binary \ufb01lter functions (F), optimal times (T), and the estimator(B) were generated from the spectra of all pure components (seeSection 3.2 for more information) using functions from Matlab 7.13R2011b. The input binary optical \ufb01lter function determined whichmirrors will point toward the detector (assigned a value of 1) orpoint away (assigned a value of 0). The binary (0\u20131) mathematical\ufb01lters are con\ufb01gured to the DMD through Labview software (TexasInstruments, DDC4100, Load Blocks.vi) that sets blocks of mirrorson the DMD array corresponding to different wavelengths to theappropriate \u00b112\u25e6position. Labview scripts were used to sequen-tially apply the \ufb01lters and integrate for the corresponding times, tostore the raw photon counts, and to calculate the photon rates. Lin-ear and quadratic discriminant analyses were performed in Matlab7.13 R2011b. Data was further processed and plotted in Igor Pro6.04.3.2. Constructing \ufb01ltersGenerating accurate \ufb01lters for a given application requires highsignal-to-noise training spectra of each of the components of inter-est. Measuring full spectra with the DMD is achieved by notchscanning. This is done by sequentially directing one mirror (ora small set of mirrors) toward the detector (with all other mir-rors directed away) and counting the number of photons detectedat each notch position. Notch scanning measurements were per-formed using 1 s per notch to obtain spectra with a signal-to-noiseratio of \u223c500:1. A background spectrum is present in all of our train-ing spectra, arising from the interaction of the excitation laser andthe intervening optical elements. We have implemented two com-pressive detection strategies for removing this background. The\ufb01rst method involves measuring the background (with no sample)123450.20.30.40.50.60.70.80.91Classification of 10 ChemicalsNumber of MeasurementsAccuracy Exp\u2212DesignedExp\u2212RandomSim\u2212DesignedSim\u2212Random\fReferences\n[1] R. Atar and T. Weissman. Mutual information, relative entropy, and estimation in the Poisson channel.\n\nIEEE Transactions on Information Theory, 58(3):1302\u20131318, March 2012.\n\n[2] A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Clustering with bregman divergences. JMLR, 2005.\n[3] M.W Berry, M. Browne, A.N. Langville, V.P. Pauca, and R. J. Plemmons. Algorithms and applications\n\nfor approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 2007.\n\n[4] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. JMLR, 2003.\n[5] L.M. Bregman. The relaxation method of \ufb01nding the common point of convex sets and its application to\nthe solution of problems in convex programming. USSR computational mathematics and mathematical\nphysics, 1967.\n\n[6] E. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from\n\nhighly incomplete frequency information. IEEE Trans. on Inform. Theory, 2006.\n\n[7] W.R. Carson, M. Chen, M.R.D. Rodrigues, R. Calderbank, and L. Carin. Communications-inspired pro-\n\njection design with application to compressive sensing. SIAM J. Imaging Sciences, 2013.\n\n[8] M. Chen, W. Carson, M. Rodrigues, R. Calderbank, and L. Carin. Communications inspired linear dis-\n\ncriminant analysis. In ICML, 2012.\n\n[9] G.B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley New York, 1999.\n[10] D. Guo.\n\nInformation and estimation over binomial and negative binomial models. arXiv preprint\n\narXiv:1207.7144, 2012.\n\n[11] D. Guo, S. Shamai, and S. Verd\u00b4u. Mutual information and minimum mean-square error in Gaussian\n\nchannels. IEEE Transactions on Information Theory, 51(4):1261\u20131282, April 2005.\n\n[12] D. Guo, S. Shamai, and S. Verd\u00b4u. Mutual information and conditional mean estimation in Poisson chan-\n\nnels. IEEE Transactions on Information Theory, 54(5):1837\u20131849, May 2008.\n\n[13] S.M. Haas and J.H. Shapiro. Capacity of wireless optical communications. IEEE Journal on Selected\n\nAreas in Communications, 21(8):1346\u20131357, Aug. 2003.\n\n[14] M. Hellman and J. Raviv. Probability of error, equivocation, and the Chernoff bound. IEEE Transactions\n\non Information Theory, 1970.\n\n[15] A. Lapidoth and S. Shamai. The poisson multiple-access channel. IEEE Transactions on Information\n\nTheory, 44(2):488\u2013501, Feb. 1998.\n\n[16] R.S. Liptser and A.N. Shiryaev. Statistics of Random Processes: II. Applications, volume 2. Springer,\n\n2000.\n\n[17] D.P. Palomar and S. Verd\u00b4u. Gradient of mutual information in linear vector Gaussian channels. IEEE\n\nTransactions on Information Theory, 52(1):141\u2013154, Jan. 2006.\n\n[18] D.P. Palomar and S. Verd\u00b4u. Representation of mutual information via input estimates. IEEE Transactions\n\non Information Theory, 53(2):453\u2013470, Feb. 2007.\nCertain relations between mutual\n\n[19] S. Prasad.\n\nhttp://arxiv.org/pdf/1010.1508v1.pdf, 2012.\n\ninformation and \ufb01delity of statistical estimation.\n\n[20] M. Raginsky, R.M. Willett, Z.T. Harmany, and R.F. Marcia. Compressed sensing performance bounds\n\nunder poisson noise. IEEE Trans. Signal Processing, 2010.\n\n[21] M. Seeger, H. Nickisch, R. Pohmann, and B. Schoelkopf. Optimization of k-space trajectories for com-\n\npressed sensing by bayesian experimental design. Magnetic Resonance in Medicine, 2010.\n\n[22] C.G. Taborda and F. Perez-Cruz. Mutual information and relative entropy over the binomial and negative\nbinomial channels. In IEEE International Symposium on Information Theory Proceedings (ISIT), pages\n696\u2013700. IEEE, 2012.\n\n[23] S. Verd\u00b4u. Mismatched estimation and relative entropy.\n\n56(8):3712\u20133720, Aug. 2010.\n\nIEEE Transactions on Information Theory,\n\n[24] T. Weissman. The relationship between causal and noncausal mismatched estimation in continuous-time\n\nawgn channels. IEEE Transactions on Information Theory, 2010.\n\n[25] D.S. Wilcox, G.T. Buzzard, B.J. Lucier, P. Wang, and D. Ben-Amotz. Photon level chemical classi\ufb01cation\n\nusing digital compressive detection. Analytica Chimica Acta, 2012.\n\n[26] M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson factor anal-\n\nysis. AISTATS, 2012.\n\n9\n\n\f", "award": [], "sourceid": 601, "authors": [{"given_name": "Liming", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "David", "family_name": "Carlson", "institution": "Duke University"}, {"given_name": "Miguel", "family_name": "Rodrigues", "institution": "UCL"}, {"given_name": "David", "family_name": "Wilcox", "institution": "Purdue University"}, {"given_name": "Robert", "family_name": "Calderbank", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}