{"title": "Thinning Measurement Models and Questionnaire Design", "book": "Advances in Neural Information Processing Systems", "page_first": 307, "page_last": 315, "abstract": "Inferring key unobservable features of individuals is an important task in the applied sciences. In particular, an important source of data in fields such as marketing, social sciences and medicine is questionnaires: answers in such questionnaires are noisy measures of target unobserved features. While comprehensive surveys help to better estimate the latent variables of interest, aiming at a high number of questions comes at a price: refusal to participate in surveys can go up, as well as the rate of missing data; quality of answers can decline; costs associated with applying such questionnaires can also increase. In this paper, we cast the problem of refining existing models for questionnaire data as follows: solve a constrained optimization problem of preserving the maximum amount of information found in a latent variable model using only a subset of existing questions. The goal is to find an optimal subset of a given size. For that, we first define an information theoretical measure for quantifying the quality of a reduced questionnaire. Three different approximate inference methods are introduced to solve this problem. Comparisons against a simple but powerful heuristic are presented.", "full_text": "Thinning Measurement Models and\n\nQuestionnaire Design\n\nDepartment of Statistical Science\n\nUniversity College London\n\nRicardo Silva\n\nGower Street, London WC1E 6BT\nricardo@stats.ucl.ac.uk\n\nAbstract\n\nInferring key unobservable features of individuals is an important task in the ap-\nplied sciences. In particular, an important source of data in \ufb01elds such as mar-\nketing, social sciences and medicine is questionnaires: answers in such question-\nnaires are noisy measures of target unobserved features. While comprehensive\nsurveys help to better estimate the latent variables of interest, aiming at a high\nnumber of questions comes at a price: refusal to participate in surveys can go up,\nas well as the rate of missing data; quality of answers can decline; costs associ-\nated with applying such questionnaires can also increase. In this paper, we cast\nthe problem of re\ufb01ning existing models for questionnaire data as follows: solve\na constrained optimization problem of preserving the maximum amount of infor-\nmation found in a latent variable model using only a subset of existing questions.\nThe goal is to \ufb01nd an optimal subset of a given size. For that, we \ufb01rst de\ufb01ne an\ninformation theoretical measure for quantifying the quality of a reduced question-\nnaire. Three different approximate inference methods are introduced to solve this\nproblem. Comparisons against a simple but powerful heuristic are presented.\n\n1 Contribution\n\nA common goal in the applied sciences is to measure concepts of interest that are not directly ob-\nservable (Bartholomew et al., 2008). Such is the case in the social sciences, medicine, economics\nand other \ufb01elds, where quantifying key attributes such as \u201cconsumer satisfaction,\u201d \u201canxiety\u201d and \u201cre-\ncession\u201d requires the development of indicators: observable variables that are postulated to measure\nthe target latent variables up to some measurement error (Bollen, 1989; Carroll et al., 1995).\n\nIn a probabilistic framework, this often boils down to a latent variable model (Bishop, 1998). One\ncommon setup is to assume each observed indicator Yi as being generated independently given the\nset of latent variables X. Conditioning on any given observed data point Y gives information about\nthe distribution of the latent vector X, which can then be used for ranking, clustering, visualization\nor smoothing, among other tasks. Figure 1 provides an illustration.\nQuestionnaires from large surveys are sometimes used to provide such indicators, each Yi recording\nan answer that typically corresponds to a Bernoulli or ordinal variable. For instance, experts can\nbe given questions concerning whether there is freedom of press in a particular nation, as a way\nof measuring its democratization level (Bollen, 1989; Palomo et al., 2007). Nations can then be\nclustering or ranked within an interpretable latent space. Long questionnaires have nevertheless\ndrawbacks, as summarized by Stanton et al. (2002) in the context of psychometric studies:\n\nLonger surveys take more time to complete, tend to have more missing data,\nand have higher refusal rates than short surveys. Arguably, then, techniques to re-\nducing the length of scales while maintaining psychometric quality are wortwhile.\n\n1\n\n\fX\n\n1\n\n(Industrialization)\n\nX 2\n\n(Democratization)\n\n1Y\n\nY\n\n2\n\nY\n\n3\n\nY\n\n4\n\nY\n\n5\n\nFactor scores: countries in the latent space\n\nDem1960\nDem1965\n\nn\no\n\ni\nt\n\na\nz\ni\nt\n\na\nr\nc\no\nm\ne\nD\n\n0\n0\n1\n1\n\n5\n5\n\n0\n0\n\n(a)\n\n1\n1\n\n5\n5\n\n9 13 18 23 28 33 38 43 48 53 58 63 68 73\n9 13 18 23 28 33 38 43 48 53 58 63 68 73\n\nCountry (ordered by industrialization factor)\n\n(b)\n\nFigure 1: (a) A graphical representation of a latent variable model. Notice that in general latent vari-\nables will be dependent. Here, the question is how to quantify democratization and industrialization\nlevels of nations given observed indicators Y such as freedom of press and gross national product,\namong others (Bollen, 1989; Palomo et al., 2007). (b) An example of a result implied by the model\n(adapted from Palomo et al. (2007)): barplots of the conditional distribution of democratization lev-\nels given the observed indicators at two time points, ordered by the posterior mean industrialization\nlevel. The distribution of the latent variables given the observations is the basis of the analysis.\n\nOur contribution is a methodology for choosing which indicators to preserve (e.g., which items to\nkeep in a questionnaire) given: i.) a latent variable model speci\ufb01cation of the domain of interest;\nii.) a target number of indicators that should be preserved. To accomplish this, we provide: i.) a\ntarget objective function that quanti\ufb01es the amount of information preserved by a choice of a subset\nof indicators, with respect to the full set; ii.) algorithms for optimizing this choice of subset with\nrespect to the objective function. The general idea is to start with a target posterior distribution of\nlatent variables, de\ufb01ned by some latent variable measurement model M (i.e., PM(X | Y)). We\nwant to choose a subset Yz \u2282 Y so that the resulting conditional distribution PM(X | Yz) is\nas close as possible to the original one according to some metric. Model M is provided either by\nexpertise or by numerous standard approaches that can be applied to learn it from data (e.g., methods\nin Bishop, 2009). We call this task measurement model thinning.\n\nNotice that the size of Yz is a domain-dependent choice. Assuming M is a good model for the data,\nchoosing a subset of indicators will incur some information loss. It is up to the analyst to choose a\ntrade-off between loss of information and the design of simpler, cheaper ways of measuring latent\nvariables. Even if a shorter questionnaire is not to be deployed, the outcome of measurement model\nthinning provides a formal sensitivity analysis of the target latent distribution with respect to the\navailable indicators. The result is useful to generate different insights into the domain.\n\nThis paper is organized as follows: Section 2 de\ufb01nes a formal criterion to quantify how appropriate a\nsubset Yz is. Section 3 describes different approaches in which this criterion can be optimized. Re-\nlated work is brie\ufb02y discussed in Section 4. Experiments with synthetic and real data are discussed\nin Section 5, followed by the conclusion.\n\n2 An Information-Theoretical Criterion\n\nOur focus is on domains where latent variables are not a by-product of a dimensionality reduction\ntechnique, but the target of the analysis as in the example of Figure 1. That is, measurement error\nproblems where the variables to be recorded are designed speci\ufb01cally to obtain information con-\ncerning such unknowns (Carroll et al., 1995; Bartholomew et al., 2008). As such, we postulate that\nthe outcome of any analysis should be a functional of PM(X | Y), the conditional distribution of\nunobservables X given observables Y within a model M. It is assumed that M speci\ufb01es the joint\nPM(X, Y). We further assume that observed variables are conditionally independent given X, i.e.\n\ni=1 PM(Yi | X), with p being the number of observed indicators.\n\nPM(X, Y) = PM(X)Qp\n\n2\n\n\fIf z \u2261 (z1, z2, . . . , zp) is a binary vector of the same dimensionality as Y, and Yz is the subset of\nY corresponding the non-zero entries of z, we can assess z by the KL divergence\nPM(X | Y)\nPM(X | Yz)\n\nKL(PM(X | Y) || PM(X | Yz)) \u2261Z PM(X | Y) log\n\nThis is well-de\ufb01ned, since both distributions lie in the same sample space despite the difference\nof dimensionality between Y and Yz. Moreover, since Y is itself a random vector, our criterion\nbecomes the expected KL divergence\n\ndX\n\nhKL(PM(X | Y) || PM(X | Yz))iPM(Y)\n\nwhere h\u00b7i denotes expectation. Our goal is to minimize this function with respect to z. Rearranging\nthis expression to drop all constants that do not depend on z, and multiplying it by \u22121 to get a\nmaximization problem, we obtain the problem of \ufb01nding z\u22c6 such that\n\nz\u22c6 = argmaxznhlog(PM(Yz | X))iPM(X,Yz) \u2212 hlog(PM(Yz))iPM(Yz)o\n\nzi hlog(PM(Yi | X))iPM(X,Yi) + HM(Yz))\n\n= argmaxz( p\nXi=1\n\nsubject to Pp\n\n\u2261 argmaxzFM(z)\ni=1 zi = K for a choice of K, and zi \u2208 {0, 1}. HM(\u00b7) denotes here the entropy of\na distribution parameterized by M. Notice we used the assumption that indicators are mutually\nindependent given X. There is an intuitive appeal of having a joint entropy term to reward not only\nmarginal relationships between indicators and latent variables, but also selections that are jointly\ndiverse. Notice that optimizing this objective function turns out to be equivalent to minimizing\nthe conditional entropy of latent variables given Yz. Motivating conditional entropy from a more\nfundamental principle illustrates that other functions can be obtained by changing the divergence.\n\n3 Approaches for Approximate Optimization\n\nThe problem of optimizing FM(z) subject to the constraints Pp\n\ni=1 zi = K, zi \u2208 {0, 1}, is hard\nnot only for its combinatorial nature, but due to the entropy term. This needs to be approximated,\nand the nature of the approximation should depend on the form taken by M. We will assume that\nit is possible to ef\ufb01ciently compute any marginals of PM(Y) of modest dimensionality (say, 10\ndimensions). This is the case, for instance, in the probit model for binary data:\n\nX \u223c N (0, \u03a3),\n\nYi = 1, if Y \u22c6\n\nY \u22c6\ni \u223c N (\u039bT\ni > 0, and 0 otherwise\n\ni X + \u03bbi;0, 1),\n\nwhere N (m, S) is the multivariate Gaussian distribution with mean m and covariance matrix S. The\nprobit model is one of the most common latent variable models for questionnaire data (Bartholomew\net al., 2008), with a straigthforward extension to ordinal data. In this model, marginals for a few\ndozen variables can be obtained ef\ufb01ciently since this corresponds to calculating multivariate Gaus-\nsian probabilities (Genz, 1992). Parameters can be \ufb01t by a variety of methods (Hahn et al., 2010).\nWe also assume that M allows for the computation of hlog(PM(Yi | X))iPM(X,Yi) at little cost.\nAgain, in the binary probit model this is simple, since this requires integrating away a single binary\nvariable Yi and a univariate Gaussian \u039bT\n\ni X.\n\n3.1 Gaussian Entropy\n\nOne approximation to FM(z) is to replace its entropy term by the corresponding entropy from\nsome Gaussian distribution PN (Yz). The entropy of a Gaussian distribution is proportional to the\nlogarithm of the determinant of its covariance matrix, and hence can be computed in O(p3) steps.\nThis Gaussian can be chosen as the one closest to PM(Yz) in a KL(PM || PN ) sense: that is, the\none with the same \ufb01rst and second moments as PM(Yz). In our case, computing these moments\ncan be done deterministically (up to numerical error) using standard bivariate quadrature methods.\nNo expectation-propagation (Minka, 2001) is necessary. The corresponding objective function is\n\nFM;N (z) \u2261\n\np\n\nXi=1\n\nzi hlog(PM(Yi | X))iPM(X,Yi) + 0.5 log |\u03a3z|\n\n3\n\n\fwhere \u03a3z is the covariance matrix of Yz \u2013 which for binary and ordinal data has a sensible interpre-\ntation. This function is also an upper bound on the exact function, FM(z), since the Gaussian is the\ndistribution with the largest entropy for a given mean vector and covariance matrix. The resulting\nfunction is non-linear in z. In our experiments, we optimize for z using a greedy scheme: for all\n\npossible pairs (i, j) such that zi = 1 and zj = 0, we swap its values (so thatPi zi is always K).\n\nWe choose the pair with the highest increase in FM;N (z) and repeat the process until convergence.\n\n3.2 Entropy with Bounded Neighborhoods\n\nAn alternative bound can be derived from a standard fact in information theory: H(Y | S) \u2264\nH(Y | S\u2032) for S\u2032 \u2286 S, where H(\u00b7 | \u00b7) denotes conditional entropy. This was exploited by Globerson\nand Jaakkola (2007) to de\ufb01ne an upper bound in the entropy of a distribution as follows: consider\na permutation e of the set {1, 2, . . . , p}, with e(i) being the i-th element of e. Denote by e(1 : i)\nthe \ufb01rst i elements of this permutation (an empty set if i < 1). Moreover, let N (e, i) be a subset of\ne(1 : i \u2212 1). For a given set variables Y = {Y1, Y2, . . . , Yp} the following bound holds:\n\nn\n\np\n\nH(Y1, Y2, . . . Yp) =\n\nH(Ye(i) | Ye(1:i\u22121)) \u2264\n\nH(Ye(i) | YN (e,i))\n\n(1)\n\nXi=1\n\nXi=1\n\nIf each set N (e, i) is no larger than some constant D, then this bound can be computed in O(p \u00b7 2D)\nsteps for binary probit models. The bound holds for any choice of e, but we want it to be as tight\nas possible so that it gets weighted in a reasonable way against the other terms in FM(\u00b7). Since the\nentropy function is decomposable as a sum of functions that depend on i and N (e, i) only, one can\nminimize this bound with respect to e by using permutation optimization methods such as (Jaakkola\net al., 2010). In our implementation, we use a method similar to Teyssier and Koller (2005) that\nshuf\ufb02es neighboring entries of e to generate candidates, chooses the optimal N (e, i) for each i\ngiven the candidate e, and picks as the next permutation the candidate e with the greatest decrease\nin the bound.\n\nNotice that a permutation choice e and neighborhood choices N (e, i) de\ufb01ne a Bayesian network\nwhere N (e, i) are the parents of Ye(i). Therefore, if this Bayesian network model provides a good\napproximation to PM(Y), the bound will be reasonably tight.\nGiven e, we will further relax this bound with the goal of obtaining an integer programming formu-\nlation for the problem of optimizing an upper bound to FM(z). For any given z, we de\ufb01ne the local\nterm HL(z, i) as\n\nHL(z, i) \u2261 HM(Ye(i) | Yz\u2229N (e, i)) = XS\u2208P (N (e,i))\n\n\uf8ee\n\uf8f0Yj\u2208S\n\nzj\uf8f9\n\uf8fb\n\n\uf8ee\n\uf8f0 Yk\u2208N (e,i)\\S\n\n(1 \u2212 zk)\uf8f9\n\uf8fb\n\nHM(Ye(i) | S)\n\nwhere P (\u00b7) denotes the power set of a set. The new approximate objective function becomes\n\np\n\np\n\nFM;D(z) \u2261\n\nzi hlog(PM(Yi | X))iPM(X,Yi) +\n\nze(i)HL(z, i)\n\n(2)\n\n(3)\n\nXi=1\n\nXi=1\n\nNotice that HL(z, i) is still an upper bound on HM(Ye(i) | Ye(1:i\u22121)). The intuition is that we\nare bounding HM(Yz) by the entropy of a Bayesian network where a vertex Ye(i) is included if\nze(i) = 1, with corresponding parents given by Yz \u2229 N (e, i). This is a well-de\ufb01ned Bayesian\nnetwork for any choice of z. The shortcoming is that ideally we would like this Bayesian network to\nbe the actual marginal of the model given by e and N (e, i). It is not: if the network implied by e and\nN (e, i) was, for instance, Y1 \u2192 Y2 \u2192 Y3, the choice of z = (1, 0, 1) would result on the entropy\nof the disconnected graph {Y1, Y3}, while the true marginal would correspond instead to the graph\nY1 \u2192 Y3. However, our simpli\ufb01ed marginalization has the advantage of avoiding an intractable\nproblem. Moreover, it allows us to rede\ufb01ne the problem as an integer linear program (ILP).\n\nEach product ze(i)Qj zjQk(1\u2212zk) appearing in (3) results in a sum of O(2D) terms, each of which\nhas (up to a sign) the form qM \u2261 Qm\u2208M zm for some set M . It is still the case that qM \u2208 {0, 1}.\n\nTherefore, objective function (3) can be interpreted as being linear on a set of binary variables\n{{z}, {q}}. We need further to enforce the constraints coming from\n\nqM = 1 \u21d2 {\u2200m \u2208 M, zm = 1}; qM = 0 \u21d2 {\u2203m \u2208 M s.t. zm = 0}\n\n4\n\n\fIt is well-known (Glover and Woolsey, 1974) that this corresponds to the linear constraints\n\nqM = 1 \u21d2 {\u2200m \u2208 M, zm = 1} \u21d4 \u2200m \u2208 M, qM \u2212 zm \u2264 0\n\nqM = 0 \u21d2 {\u2203m \u2208 M s.t. zm = 0} \u21d4 Pm\u2208M zm \u2212 qM \u2264 |M | \u2212 1\n\nwhich combined with the linear constraint Pp\n\ni=1 zi = K implies that optimizing FM;D(z) is an\nILP with O(p \u00b7 2D) variables and O(p2 \u00b7 2D) constraints. In our experiments in Section 5, we\nwere able to solve essentially all of such ILPs exactly using linear programming relaxations with\nbranch-and-bound.\n\n3.3 Entropy with Tree-Structured Bounds\n\nThe previous bound simpli\ufb01es marginalization, which might badly overestimate entropies where\nthe corresponding Yz are uniformly spread out in permutation e. We now propose a different type\nof bound which treats different marginalizations on an equal footing. It comes from the following\nobservation: since H(Ye(i) | Ye(1:i\u22121)) is less than or equal to any conditional entropy H(Ye(i) | Yj)\nfor j \u2208 e(1 : i \u2212 1), we have that the tighest bound given by singleton conditioning sets is\n\nH(Ye(i) | Ye(1:i\u22121)) \u2264 min\n\nj\u2208e(1:i\u22121)\n\nHM(Ye(i) | Yj ),\n\nresulting in the objective function\n\np\n\np\n\nFM;tree(z) \u2261\n\nzi hlog(PM(Yi | X))iPM(X,Yi) +\n\nze(i) \u00b7\n\nmin\ne(1:i\u22121)\u2229Yz}\n\n{Yj \u2208Y\n\nH(Ye(i) | Yj) (4)\n\nwhere min{Yj \u2208Y\ne(1:i\u22121)\u2229Yz} H(Ye(i) | Yj) \u2261 H(Ye(i)) if Ye(1:i\u22121) \u2229 Yz = \u2205. The intuition is\nthat we are bounding the exact entropy using the entropy of a directed tree rooted at Yez (1), the \ufb01rst\nelement of Yz according to e. That is, all variables are marginally dependent in the approximation\nregardless of what z is, and for a \ufb01xed z the tree is, by construction, the one obtained by the usual\ngreedy algorithm of adding edges corresponding to the next legal pair of vertices with maximum\nmutual information (following an ordering, in this case).\n\nIt turns out we can also write (4) as a linear objective function of a polynomial number of 0\\1\nvariables and constraints. Let \u00afzi \u2261 1 \u2212 zi. Let H (1)\nbe the values of set\n{HM(Ye(i) | Ye(1)), . . . , HM(Ye(i) | Ye(i\u22121))} sorted in ascending order, with z(1)\nbe-\ning the corresponding permutation of {ze(1), . . . , ze(i\u22121)}. We have\n\n, . . . , H (i\u22121)\n\n, . . . , z(i\u22121)\n\n, H (2)\n\ni\n\ni\n\ni\n\ni\n\ni\n\nXi=1\n\nXi=1\n\nmin{Yj \u2208Y\n\ni H (1)\ne(1:i\u22121)\u2229Yz} H(Ye(i) | Yj) = z(1)\n\u00afz(1)\n\ni H (2)\n\ni + \u00afz(1)\n. . . \u00afz(i\u22122)\ni H (j)\nj=1 q(j)\n\ni z(2)\nz(i\u22121)\ni + q(i)\n\ni + \u00afz(1)\nH (i\u22121)\ni HM(Ye(i))\n\n+Qi\u22121\n\ni \u00afz(2)\n\ni\n\ni\n\ni\n\ni z(3)\nj=1 \u00afz(j)\n\ni H (3)\ni + . . .\ni HM(Ye(i))\n\nwhere q(j)\na linear objective function in this extended variable space. The corresponding constraints are\n\n, and also a binary 0\\1 variable. Plugging this expression into (4) gives\n\ni \u2261 z(j)\n\nk=1 \u00afz(k)\n\ni\n\ni Qj\u22121\n\ni\n\n\u2261 Pi\u22121\n\ni = 1 \u21d2 {\u2200zm \u2208 {\u00afz(1)\nq(j)\ni = 0 \u21d2 {\u2203zm \u2208 {\u00afz(1)\nq(j)\n\n, . . . , \u00afz(j\u22121)\n, . . . , \u00afz(j\u22121)\n\n, z(j)\ni }, zm = 1}\n, z(j)\ni } s.t. zm = 0}\n\ni\n\ni\n\ni\n\ni\n\nwhich, as shown in the previous section, can be written as linear constraints (substituting each \u00afzi\nby 1 \u2212 zi). The total number of constraints is however O(p3), which can be expensive, and often a\nlinear relaxation procedure with branch-and-bound fails to provide guarantees of optimality.\n\n3.4 The Reliability Score\n\nFinally, it is important to design cheap, effective criteria whose maxima correlate with the maxima\nof FM(\u00b7). Empirically, we have found high quality selections in binary probit models using the\nsolution to the problem\n\nmaximize FM;R(z) =\n\np\n\nXi=1\n\nwizi, subject to zi \u2208 {0, 1},\n\nzi = K\n\np\n\nXi=1\n\n5\n\n\fi \u03a3\u039bi. This can be solved by picking the corresponding indicators with the highest\nwhere wi = \u039bT\nK weights wi. Assuming a probit model where the measurement error for each Y \u22c6\ni has the same\nvariance of 1, this score is related to the \u201creliability\u201d of an indicator. Simply put, the reliability\nRi of an indicator is the proportion of its variance that is due to the latent variables (Bollen, 1989,\nChapter 6): Ri = wi/(wi + 1) for each Y \u22c6\ni . There is no current theory linking this solution to the\nproblem of maximizing FM(\u00b7): since there is no entropy term, we can set an adversarial problem to\neasily defeat this method. For instance, this happens in a model where the K indicators of highest\nreliability all measure the same latent variable Xi and nothing else \u2013 much information about Xi\nwould be preserved, but little about other variables. In any case, we found this criterion to be fairly\ncompetitive even if at times it produces extreme failures. An honest account of more sophisticated\nselection mechanisms cannot be performed without including it, as we do in Section 5.\n\n4 Related Work\n\nThe literature on survey analysis, in the context of latent variable models, contains several exam-\nples of guidelines on how to simplify questionnaires (sometimes described as providing \u201cshortened\nversions\u201d of scales). Much of the literature, however, consists of describing general guidelines and\nrules-of-thumb to accomplish this task (e.g, Richins, 2004; Stanton et al., 2002). One possible excep-\ntion is Leite et al. (2008), which uses different model \ufb01tness criteria with respect to a given dataset\nto score candidate solutions, along with an expensive combinatorial optimization method. This con-\n\ufb02ates model selection and questionnaire thinning, and there is no theory linking the score functions\nto the amount of information preserved. In the machine learning and statistics literature, there is a\nlarge body of research in active learning, which is related to our task. One of the closest approaches\nis the one by Liang et al. (2009), which casts the classical problem of measurement selection within\na Bayesian graphical model perspective. In that work, one has to choose which measurements to\nadd. This is done sequentially, partially motivated by problems where collecting new measurements\ncan be done relatively fast and cheap (say, by paying graduate students to annotate text data), and so\nthe choice of next measurement can make use of fresh data. In our case, it not might be realistic to\nexpect we can perform a large number of iterations of data collection \u2013 and as such the task of reduc-\ning the number of measurements from a large initial collection might be more relevant in practice.\nLiang et al. also focus on (multivariate) supervised learning instead of purely unsupervised learning.\nIn statistics there is also a considerable body of literature on suf\ufb01cient dimension reduction and its\nsparse variants (e.g., Chen et al., 2010). Such techniques create a bottleneck between two sets of\nvariables in a regression problem (say, the mapping from Y to X) while eliminating some of the\ninput variables. In principle one might want to adapt such models to take a latent variable model M\nas the target mapping. Besides some loss of interpretability, the computational implications might\nbe problematic, though. Moreover, this framework has another free parameter corresponding to the\ndimensionality of the bottleneck that has to be set. It is not clear how this parameter, along with a\nchoice of sparsity level, would interact with a \ufb01xed choice K of indicators to be kept.\n\n5 Experiments\n\nIn this section, we \ufb01rst describe some synthetic experiments to provide insights about the different\nmethods, followed by one brief description of a case study. In all of the experiments, the target\nmodels M are binary probit. We set the neighborhood parameter for FM;N (\u00b7) to 9. The ordering\ne for the tree-structured method is obtained by the same greedy search of Section 3.2, where now\nthe score is the average of all H(Yi | Yj) for all Yj preceding Yi. Finally, all ordering optimization\nmethods were initialized by sorting indicators in a descending order according to their reliability\nscores, and the initial solution for all entropy-based optimization methods was given by the reliability\nscore solution of Section 3.4. The integer program solver GUROBI 4.02 was used in all experiments.\n\n5.1 Synthetic studies\n\nWe start with a batch of synthetic experiments. We generated 80 models with 40 indicators and 10\nlatent variables1. We further preprocess such models into two groups: in 40 of them, we select a\n\n1Details on the model generation: we generate 40 models by sampling the latent covariance matrix from\nan inverse Wishart distribution with 10 degrees of freedom and scale matrix 10I, I being the identity matrix.\n\n6\n\n\fImprovement ratio: high signal\n\nImprovement ratio: low signal\n\nMean error: high signal\n\nMean error: low signal\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\nN/R\n\nT/R\n\nG/R\n\nN/S\n\nT/S\n\nG/S\n\nN/R\n\nT/R\n\nG/R\n\nN/S\n\nT/S\n\nG/S\n\n(a)\n\n(b)\n\ne\nr\no\nc\ns\n \ny\nt\ni\nl\ni\n\nb\na\n\ni\nl\n\ne\nR\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.1\n\ne\nr\no\nc\ns\n \ny\nt\ni\nl\ni\n\nb\na\n\ni\nl\n\ne\nR\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.25\n\n0.3\n\n0.1\n\n0.15\n\n0.2\n\nTree bound\n(c)\n\n0.25\n\n0.3\n\n0.15\n\n0.2\n\nTree bound\n(d)\n\nFigure 2: (a) A comparison of the bounded neighborhood (N ), tree-based (T ) and Gaussian (G)\nmethods with respect to a random solution (R) and the reliability score (S). (b) A similar comparison\nfor models where indicators are more weakly correlated to the latent variables than in (a). (c) and\n(d) Scatterplots of the average absolute deviance for the tree-based method (horizontal axis) against\nthe reliability method (vertical axis). The bottom-left clouds correspond to the K = 32 trials.\n\nj\n\nF , . . . , d(1000)\n\nF\n\nj=1 |\u02c6x(i)\n\nj \u2212 \u02c6x(i)\n\nF \u2261P10\n\nj;F |/10. In this case, \u02c6x(i)\n\ntarget reliability ri for each indicator Yi, uniformly at random from the interval [0.4 0.7]. We then\nrescale coef\ufb01cients \u039bi such that the reliability (de\ufb01ned in Section 3.4) of the respective Y \u22c6\ni becomes\nri. For the remaining 40 models, we sample ri uniformly at random from the interval [0.2 0.4].\nWe perform two choices of subsets: sets Yz of size 20 and 32 (50% and 80% of the total num-\nber of indicators). Our evaluation is as follows: since the expected value is perhaps the most\ncommon functional of the posterior distribution PM(X | Y), we calculate the expected value of\nthe latent variables for a sample {y(1), y(2), . . . , y(1000)} of size 1000 taken from the respective\nsynthetic models. This is done for the full set of 40 indicators, and for each set chosen by our\nfour criteria: for each data point i and each objective function F, we evaluate the average distance\nd(i)\nis the expected value of Xj obtained by conditioning\non all indicators, while \u02c6x(i)\nj;F is the one obtained with the subset selected by optimizing F. We denote\nby mF the average of {d(1)\nF , d(2)\n}. Finally, we compare the three main methods with\nrespect to the reliability score method using the improvement ratio statistic sF = 1 \u2212 mF /mFM;R,\nthe proportion of average error decrease with respect to the reliability score. In order to provide\na sense of scale on the dif\ufb01culty of each problem, we compute the same ratios with respect to a\nrandom selection, obtained by choosing K = 20 and K = 32 indicators uniformly at random.\nFigure 2 provides a summary of the results. In Figure 2(a), each boxplot shows the distribution\nover the 40 probit models where reliabilities were sampled between [0.4 0.7] (the \u201chigh signal\u201d\nmodels). The \ufb01rst three boxplots show the scores sF of the bounded neighborhood, tree-structured\nand Gaussian methods, respectively, compared against random selections. The last three boxplots\nare comparisons against the reliability heuristic. The tree-based method easily beats the Gaussian\nmethod, with about 75% of its outcomes being better than the median Gaussian outcome. The Gaus-\nsian approach is also less reliable, with results showing a long lower tail. Although the reliability\nscore is on average a good approach, in only a handful of cases it was better than the tree-based\nmethod, and by considerably smaller magnitudes compared to the upper tails in the tree-based out-\ncome distribution. A separate panel (Figure 2(b)) is shown for the 40 models with lower reliabilities.\nIn this case, all methods show stronger improvements over the reliability score, although now there\nis a less clear difference between the tree method and the Gaussian one. Finally, in panels (c) and (d)\nwe present scatterplots for the average deviances mF of the tree-based method against the reliability\nscore. The two clouds correspond to the solutions with 20 and 32 indicators. Notice that in the vast\nmajority of the cases the tree-based method does better.\n\nWe then rescale the matrix to make all variances equal to 1. We also generate 40 models using as the in-\nverse Wishart scale matrix the correlation matrix will all off-diagonal entries set to 0.5. Coef\ufb01cients linking\nindicators to latent variables were set to zero with probability 0.8, and sampled from a standard Gaussian oth-\nerwise. If some latent variable ends up with no child, or an indicator ends up with no parent, we uniformly\nchoose one child/parent to be linked to it. Code to fully replicate the synthetic experiments is available at\nHTTP://WWW.HOMEPAGES.UCL.AC.UK/\u223cUCGTRBD/.\n\n7\n\n\f5.2 Case study\n\nThe National Health Service (NHS) is the public health system of the United Kingdom. In 2009, a\nmajor survey called the National Health Service National Staff Survey was deployed with the goal of\n\u201ccollect(ing) staff views about working in their local NHS trust\u201d (Care Quality Comission and Aston\nUniversity, 2010). A questionnaire of 206 items was \ufb01lled by 156, 951 respondents. We designed a\nmeasurement model based on the structure of the questionnaire and \ufb01t it by the posterior expected\nvalue estimator. Gaussian and inverse Wishart priors were used, along with Gibbs sampling and a\nrandom subset of 50, 000 respondents. See the Supplementary Material for more details. Several\nitems in this survey asked for the NHS staff member to provide degrees of agreement in a Likert\nscale (Bartholomew et al., 2008) to questions such as\n\n\u2022 . . . have you ever come to work despite not feeling well enough to perform . . . ?\n\u2022 Have you felt pressure from your manager to come to work?\n\u2022 Have you felt pressure from colleagues to come to work?\n\u2022 Have you put yourself under pressure to come to work?\n\nas different probes into an unobservable self-assessed level of work pressure.\n\nWe preprocessed and binarized the data to \ufb01rst narrow it down to 63 questions. We compare selec-\ntions of 32 (50%) and 50 (80%) items using the same statistics of the previous section.\nsF ;random mF ;tree mF ;R\n\nsF ;tree\n\nsF ;D\n\nsF ;N\n\nK = 32\nK = 50\n\n7.8%\n10.5% 11.9% 7.6%\n\n6.3% 0.01% \u221216.0%\n\u22120.05%\n\n0.238\n0.123\n\n0.255\n0.140\n\nAlthough gains were relatively small (as measured by the difference between reconstruction errors\nmF ;tree \u2212 mF ;R and the good performance of a random selection), we showed that: i.) we do\nimprove results over a popular measure of indicator quality; ii.) we do provide some guarantees\nabout the diversity of the selected items via a information-theoretical measure with an entropy term,\nwith theoretically sound approximations to such a measure. For more details on the preprocessing,\nand more insights into the different selections, please refer to the Supplementary Material.\n\n6 Conclusion\n\nThere are problems where one posits that the relevant information is encoded in the posterior distri-\nbution of a set of latent variables. Questionnaires (and other instruments) can be used as evidence\nto generate this posterior, but there is a cost associated with complex questionnaires. One problem\nis how to simplify such instruments of measurement. To the best of our knowledge, we provide the\n\ufb01rst formal account on how to solve it. Nevertheless, we would like to stress there is no substitute\nfor common sense. While the tools we provide here can be used for a variety of analyses \u2013 from\ndeploying simpler questionnaires to sensitivity analysis \u2013 the value and cost of keeping particular\nindicators can go much beyond the information contained in the latent posterior distribution. How\nto combine this criterion with other domain-dependent criteria is a matter of future research.\n\nAnother problem of importance is how to deal with model speci\ufb01cation and transportability across\nstudies. A measurement model built for a very speci\ufb01c population of respondents might transfer\npoorly to another group, and therefore taking into account model uncertainty will be important. The\nBayesian setup discussed by Liang et al. (2009) might provide some directions on this issue. Also,\nthere is further structure in real-world questionnaires we are not exploiting in the current work.\nNamely, it is not uncommon to have questionnaires with branching questions and other dynamic\nbehaviour more commonly associated with Web based surveys and/or longitudinal studies. Finally,\nhybrid approaches combining the bounded neighborhood and the tree-structured methods, along\nwith more sophisticated ordering optimization procedures and the use of other divergence measures\nand determinant-based criteria (e.g. Kulesza and Taskar, 2011), will also be studied in the future.\n\nAcknowledgments\n\nThe author would like to thank James Cussens and Simon Lacoste-Julien for helpful discussions, as\nwell as the anonymous reviewers for further comments.\n\n8\n\n\fReferences\nD. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. Analysis of Multivariate Social Science\n\nData, 2nd edition. Chapman & Hall, 2008.\n\nC. Bishop. Latent variable models. In M. Jordan (editor), Learning in Graphical Models, pages\n\n371\u2013403, 1998.\n\nC. Bishop. Pattern Recognition and Machine Learning. Springer, 2009.\nK. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989.\nR. Carroll, D. Ruppert, and L. Stefanski. Measurement Error in Nonlinear Models. Chapman &\n\nHall, 1995.\n\nX. Chen, C. Zou, and R. Cook. Coordinate-independent sparse suf\ufb01cient dimension reduction and\n\nvariable selection. Annals of Statistics, 38:3696\u20133723, 2010.\n\nCare Quality Commission and Aston University. Aston Business School, National Health Service\nNational Staff Survey, 2009 [computer \ufb01le]. Colchester, Essex: UK Data Archive [distributor],\nOctober 2010. Available at HTTPS://WWW.ESDS.AC.UK, SN: 6570, 2010.\n\nA. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational\n\nand Graphical Statistics, 1:141\u2013149, 1992.\n\nA. Globerson and T. Jaakkola. Approximate inference using conditional entropy decompositions.\nProceedings of the 11th International Conference on Arti\ufb01cial Intelligence and Statistics (AIS-\nTATS 2007), pages 141\u2013149, 2007.\n\nF. Glover and E. Woolsey. Converting the 0-1 polynomial programming problem to a 0-1 linear\n\nprogram. Operations Research, 22:180\u2013182, 1974.\n\nP. Hahn, J. Scott, and C. Carvalho. A sparse factor-analytic probit model for congressional voting\n\npatterns. Duke University Department of Statistical Science, Discussion Paper 2009-22, 2010.\n\nT. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using\nLP relaxations. Proceedings of the 13th International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS 2010), pages 366\u2013373, 2010.\n\nA. Kulesza and B. Taskar. k-DPPs: \ufb01xed-size determinantal point processes. Proceedings of the\n\n28th International Conference on Machine Learning (ICML), pages 1193\u20131200, 2011.\n\nW. Leite, I-C. Huang, and G. Marcoulides. Item selection for the development of short forms of\nscales using an ant colony optimization algorithm. Multivariate Behavioral Research, 43:411\u2013\n431, 2008.\n\nP. Liang, M. Jordan, and D. Klein. Learning from measurements in exponential families. Proceed-\n\nings of the 26th Annual International Conference on Machine Learning (ICML \u201909), 2009.\n\nT. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, Massachussets\n\nInstitute of Technology, 2001.\n\nJ. Palomo, D. Dunson, and K. Bollen. Bayesian structural equation modeling. In Sik-Yum Lee (ed.),\n\nHandbook of Latent Variable and Related Models, pages 163\u2013188, 2007.\n\nM. Richins. The material values scale: Measurement properties and development of a short form.\n\nThe Journal of Consumer Research, 31:209\u2013219, 2004.\n\nJ. Stanton, E. Sinar, W. Balzer, and P. Smith. Issues and strategies for reducing the length of self-\n\nreported scales. Personnel Psychology, 55:167\u2013194, 2002.\n\nM. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning\nBayesian networks. Proceedings of the Twenty-\ufb01rst Conference on Uncertainty in AI (UAI \u201905),\npages 584\u2013590, 2005.\n\n9\n\n\f", "award": [], "sourceid": 222, "authors": [{"given_name": "Ricardo", "family_name": "Silva", "institution": null}]}