{"title": "Feature Set Embedding for Incomplete Data", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 801, "abstract": "We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then non-linearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets.", "full_text": "Feature Set Embedding for Incomplete Data\n\nDavid Grangier\nNEC Labs America\n\nPrinceton, NJ\n\ndgrangier@nec-labs.com\n\nIain Melvin\n\nNEC Labs America\n\nPrinceton, NJ\n\niain@nec-labs.com\n\nAbstract\n\nWe present a new learning strategy for classi\ufb01cation problems in which train and/or\ntest data suffer from missing features. In previous work, instances are represented\nas vectors from some feature space and one is forced to impute missing values or\nto consider an instance-speci\ufb01c subspace. In contrast, our method considers in-\nstances as sets of (feature,value) pairs which naturally handle the missing value\ncase. Building onto this framework, we propose a classi\ufb01cation strategy for sets.\nOur proposal maps (feature,value) pairs into an embedding space and then non-\nlinearly combines the set of embedded vectors. The embedding and the combina-\ntion parameters are learned jointly on the \ufb01nal classi\ufb01cation objective. This simple\nstrategy allows great \ufb02exibility in encoding prior knowledge about the features in\nthe embedding step and yields advantageous results compared to alternative solu-\ntions over several datasets.\n\n1\n\nIntroduction\n\nMany applications require classi\ufb01cation techniques dealing with train and/or test instances with miss-\ning features: e.g. a churn predictor might deal with incomplete log features for new customers,\na spam \ufb01lter might be trained from data originating from servers storing different features, a face\ndetector might deal with images for which high resolution cues are corrupted.\nIn this work, we address a learning setting in which the missing features are either missing at ran-\ndom [6], i.e. deletion due to corruption or noise, or structurally missing [4], i.e. some features do not\nmake sense for some examples, e.g. activity history for new customers. We do not consider setups\nin which the features are maliciously deleted to fool the classi\ufb01er [5]. Techniques for dealing with\nincomplete data fall mainly into two categories: techniques which impute the missing features and\ntechniques considering an instance-speci\ufb01c subspace.\nImputation-based techniques are the most common. In this case, the data instances are viewed as\nfeature vectors in a high-dimensional space and the classi\ufb01er is a function from this space into the\ndiscrete set of classes. Prior to classi\ufb01cation, the missing vector components need to be imputed.\nEarly imputation approaches \ufb01ll any missing value with a constant, zero or the average of the feature\nover the observed cases [18]. This strategy neglects inter-feature correlation, and completion tech-\nniques based on k-nearest-neighbors (k-NN) have subsequently been proposed to circumvent this\nlimitation [1]. Along this line, more complex strategies based on generative models have been used\nto \ufb01ll missing features according to the most likely value given the observed features. In this case, the\nExpectation-Maximization algorithm is typically adopted to estimate the data distribution over the\nincomplete training data [9]. Building upon this generative model strategy, several approaches have\nconsidered integrating out the missing values, either by integrating the loss [2] or the decision func-\ntion [22]. Recently, [15] and [6] have proposed to avoid the initial maximum likelihood distribution\nestimation. Instead, they proposed to learn jointly the generative model and the decision function to\noptimize the \ufb01nal classi\ufb01cation loss.\nAs an alternative to imputation-based approaches, [4] has proposed a different framework. In this\ncase, each instance is viewed as a vector from a subspace of the feature space determined by its\n\n1\n\n\fFigure 1: Feature Set Embedding: An example is given a set of (feature, value) pairs. Each pair\nis mapped into an embedding space, then the embedded vectors are combined into a single vector\n(either linearly with mean or non-linearly with max). Linear classi\ufb01cation is then applied. Our\nlearning procedure learns both the embedding space and the linear classi\ufb01er jointly.\n\nobserved features. A decision function is learned for each speci\ufb01c subspace and parameter sharing\nbetween the functions allows the method to achieve tractability and generalization. Compared to\nimputation-based approaches, this strategy avoids choosing a generative model, i.e. making an as-\nsumption about the missing data. Other alternatives to imputation have been proposed in [10] and\n[5]. These approaches focus on linear classi\ufb01ers and propose learning procedures which avoid con-\ncentrating the weights on a small subset of the features, which helps achieve better robustness with\nrespect to feature deletion.\nIn this work, we propose a novel strategy called feature set embedding. Contrary to previous work,\nwe do not consider instances as vectors from a given feature space. Instead, we consider instances\nas a set of (feature, value) pairs and propose to learn to classify sets directly. For that purpose, we\nintroduce a model which maps each (feature, value) pair onto an embedding space and combines the\nembedded pairs into a single vector before applying a linear classi\ufb01er, see Figure 1. The embedding\nspace mapping and the linear classi\ufb01er are jointly learned to maximize the conditional probability\nof the label given the observed input. Contrary to previous work, this set embedding framework\nnaturally handles incomplete data without modeling the missing feature distribution, or considering\nan instance speci\ufb01c decision function. Compared to other work on learning from sets, our approach\nis original as it proposes to learn to embed set elements and to classify sets as a single optimization\nproblem, while prior strategies learn their decision function considering a \ufb01xed mapping from sets\ninto a feature space [12, 3].\nThe rest of the paper is organized as follows: Section 2 presents the proposed approach, Section 3\ndescribes our experiments and results. Section 4 concludes.\n\n2 Feature Set Embedding\n\nWe denote an example as (X, y) where X = {(fi, vi)}|X|\ni=1 is a set of (feature, value) pairs and y is a\nclass label in Y = {1, . . . , k}. The set of features is discrete, i.e. \u2200i, fi \u2208 {1, . . . d}, while the feature\nvalues are either continuous or discrete, i.e. \u2200i, vi \u2208 Vfi where Vfi\n= {1, . . . , cfi}.\nGiven a labeled training dataset Dtrain = {(Xi, yi)}n\ni=1, we propose to learn a classi\ufb01er g which\npredicts a class from an input set X.\nFor that purpose, we combine two levels of modeling. At the lower level, (feature, value) pairs are\nindividually mapped into an embedding space of dimension m: given an example X = {(fi, vi)}|X|\ni=1,\na function p predicts an embedding vector pi = p(fi, vi) \u2208 Rm for each feature value pair (fi, vi). At\nthe upper level, the embedded vectors are combined to make the class prediction: a function h takes\nthe set of embedded vectors {pi}|X|\n) \u2208 Rk in\nwhich the correct class should be assigned the highest value. Our classi\ufb01er composes the two levels,\ni.e g = h \u25e6 p. Intuitively, the \ufb01rst level extracts the information relevant to class prediction provided\nby each feature, while the second level combines this information over all observed features.\n\ni=1 and predicts a vector of con\ufb01dence values h({pi}|X|\n\n= R or Vfi\n\ni=1\n\n2\n\nInputFeatureA:0.15FeatureBmissingFeatureCmissingFeatureDmissingFeatureE:0.28FeatureF:0.77FeatureGmissingpSetEmbeddingp(A,0.15)p(E,0.28)p(F,0.77)\u03a6(Non)LinearCombination\u03a6(...)VLinearDescisionInputClass1Class2Class3Class4Class5\f2.1 Feature Embedding\n\nFeature embedding offers great \ufb02exibility. It can accommodate discrete and continuous data and\nallows encoding prior knowledge on characteristics shared between groups of features. For discrete\nfeatures, the simplest embedding strategy learns a distinct parameter vector for each (f, v) pair, i.e.\n\np(f, v) = Lf,v where Lf,v \u2208 Rm.\n\nFor capacity control, rank regularization can be applied,\n\np(f, v) = W Lf,v where Lf,v \u2208 Rl and W \u2208 Rm\u00d7l,\n\nIn this case, l < m is a hyperparameter bounding the rank of W L, where L denotes the matrix\nconcatenating all Lf,v vectors. One can further indicate that two pairs (f, v) and (f, v(cid:48)) originate\nfrom the same feature by parameterizing Lf,v as\n\nLf,v =\n\nwhere\n\nf \u2208 Rl(a) and L(b)\nL(a)\nl(a) + l(b) = l\n\nf,v \u2208 Rl(b)\n\n(1)\n\n(cid:35)\n(cid:35)\n\nf\n\n(cid:34) L(a)\n(cid:34) L(a)\n\nL(b)\nf,v\n\nf,v\nL(b)\n\nv\n\n(cid:40)\n(cid:40)\n\nSimilarly, one can indicate that two pairs (f, v) and (f(cid:48), v) shares the same value by parameterizing,\n\nLf,v =\n\nwhere\n\nf,v \u2208 Rl(a) and L(b)\nL(a)\nl(a) + l(b) = l\n\nv \u2208 Rl(b)\n\n(2)\n\nThis is useful when feature values share a common physical meaning, like gray levels at different\npixel locations or temperatures measured by different sensors. Of course, the parameter sharing\nstrategies (1) and (2) can be combined.\nWhen the feature values are continuous, we adopt a similar strategy and de\ufb01ne\n\np(f, v) = W\n\nwhere\n\nf \u2208 Rl(a) and L(b)\nL(a)\nl(a) + l(b) = l\n\nf \u2208 Rl(b)\n\n(3)\n\n(cid:35)\n\n(cid:34) L(a)\n\nf\n\nvL(b)\n\nf\n\n(cid:40)\n\nf\n\ninforms about the presence of feature f, while vL(b)\nf\n\nwhere L(a)\nis thought not to need presence information, L(a)\nWhen the dataset contains a mix of continuous and discrete features, both embedding approaches can\nbe used jointly. Feature embedding is hence a versatile strategy; the practitioner de\ufb01nes the model\nparameterization according to the nature of the features, and the learned parameters L and W encode\nthe correlation between features.\n\ninforms about its value. If the model\n\ncan be omitted, i.e. l(a) = 0.\n\nf\n\n2.2 Classifying from an Embedded Feature Set\n\nThe second level of our architecture h considers the set of embedded features and predicts a vector\nof con\ufb01dence values. Given an example X = {(fi, vi)}|X|\ni=1, the function h takes the set P =\n{p(fi, vi)}|X|\ni=1 as input, and outputs h(P ) \u2208 Rk according to\n\nh(P ) = V \u03a6(P )\n\nwhere \u03a6 is a function which takes a set of vector of Rm and outputs a single vector of Rm, while V\nis a k-by-m matrix. This second level is hence related to kernel methods for sets, which \ufb01rst apply a\n\ufb01xed mapping \u03a6 from sets to vectors, before learning a linear classi\ufb01er in the feature space [12]. In\nour case, however, we make sure that \u03a6 is a generalized differentiable function [19], so that h and p\ncan be optimized jointly. In the following, we consider two alternatives for \u03a6: a linear function, the\nmean, and a non-linear function, the component-wise max.\n\nLinear Model\n\nIn this case, one can remark that\n\nh(P ) = V mean({p(fi, vi)}|X|\n= V mean({W Lfi,vi}|X|\n= V W mean({Lfi,vi}|X|\n\ni=1\n\ni=1\n\n)\n)\n)\n\ni=1\n\n3\n\n\fby linearity of the mean. Hence, in this case, the dimension of the embedding space m bounds\nthe rank of the matrix V W . This also means that considering m > k is irrelevant in the linear\ncase. In the speci\ufb01c case where features are continuous and no presence information is provided,\ni.e. Lf,v = vL(b)\nf , our model is equivalent to a classical linear classi\ufb01er operating on feature vectors\nwhen all features are present, i.e. |X| = d,\n\ng(X) = V W mean({Lfi,vi}d\n\ni=1\n\n) =\n\n1\nd\n\nV W\n\nviL(b)\nfi\n\n=\n\n1\nd\n\n(V W L)v\n\nd(cid:88)\n\ni=1\n\nwhere L denotes the matrix [L(b)\nf1\ncase, our model corresponds to\n\n, . . . , L(b)\nfd\n\n] and v denotes the vector [v1, . . . , vd]. Hence, in this\n\ng(X) = M v where M \u2208 Rk\u00d7d s.t. rank(M) = min{k, l, m, d}\n\nNon-linear Model\nIn this case, we rely on the component-wise max. This strategy can model\nmore complex decision functions. In this case, selecting m > k, l is meaningful. Intuitively, each\ndimension in the embedding space provides a meta-feature describing each (feature, value) pair,\nthe max operator then outputs the best meta-feature match over the set of (feature, value) pairs,\nperforming a kind of soft-OR, i.e. checking whether there is at least one pair for which the meta-\nfeature is high. The \ufb01nal classi\ufb01cation decision is then taken as a linear combination of the m\nsoft-ORs. One can relate our use of the max operator to its common use in \ufb01xed set mapping for\ncomputer vision [3].\n\n2.3 Model Training\n\nModel learning aims at selecting the parameter matrices L, W and V . For that purpose, we maximize\nthe (log) posterior probability of the correct class over the training set Dtrain = {(Xi, yi)}n\n\ni=1, i.e.\n\nwhere model outputs are mapped to probabilities through a softmax function, i.e.\n\nn(cid:88)\n\ni=1\n\n(cid:80)k\n\nC =\n\nlog P (yi|Xi)\n\nP (y|X) =\n\nexp(g(X)y)\ny(cid:48)=1\n\nexp(g(X)y(cid:48))\n\n.\n\nCapacity control is achieved by selecting the hyperparameters l and m. For linear models, the crite-\nrion C is referred to as the multiclass logistic regression objective and [16] has studied the relation\nbetween C and margin maximization. In the binary case (k = 2), the criterion C is often referred to\nas the cross entropy objective.\nThe maximization of C is conducted through stochastic gradient ascent for random initial parameters.\nThis algorithm enables the addressing of large training sets and has good properties for non-convex\nproblems [14], which is of interest for our non-linear model and for the linear model when rank\nregularization is used. One can note that our non-linear model relies on the max function, which is\nnot differentiable everywhere. However, [8] has shown that gradient ascent can also be applied to\ngeneralized differentiable functions, which is the case of our criterion.\n\n3 Experiments\n\nOur experiments consider different setups: features missing at train and test time, features missing\nonly at train time, features missing only at test time. In each case, our model is compared to alterna-\ntive solutions relying on experimental setups introduced in prior work. Finally, we study our model\nin various conditions over the larger MNIST dataset.\n\n3.1 Missing Features at Train and Test Time\n\nThe setup in which features are missing at train and test time is relevant to applications suffering\nsensor failure or communication errors. It is also relevant to applications in which some features are\n\n4\n\n\fTable 1: Dataset Statistics\n\nUCI sick\npima\nhepatitis\necho\nhypo\n\nTrain set\n\nsize\n2,530\n614\n124\n104\n2,530\n1,000\n177\n1,000\n1,000\n500\n12\u00d7100\n60,000\n\nTest set\n\nsize\n\n633\n154\n31\n27\n633\n200\n45\n6,291\n5,179\n213\n12\u00d7300\n10,000\n\n# eval. Total # Missing Continuous\nsplits\nor discrete\n5\n5\n5\n5\n5\n2\n5\n100\n100\n100\n20\n1\n\nfeat.(%)\n90\n90\n90\n90\n90\n25\n62\n85(cid:63)\n85(cid:63)\n26(cid:63)\n0 to 99\u2020\n0 to 87\n\nfeat.\n25\n8\n19\n7\n25\n784\n1,900\n256\n78\n41\n784\n784\n\nc\nc\nc\nc\nc\nd\nd\nc\nc\nc\nd\nd\n\nMNIST-5-vs-6\nCars\nUSPS\nPhysics\nMine\nMNIST-miss-test\u2020\nMNIST-full\n(cid:63) Features missing only at training time for USPS, Physics and Mine.\n\u2020 Features missing only at test time for MNIST-miss-test. This set presents 12 binary problems, 4vs9,\n3vs5, 7vs9, 5vs8, 3vs8, 2vs8, 2vs3, 8vs9, 5vs6, 2vs7, 4vs7 and 2vs6, each having 100 examples for\ntraining, 200 for validation and 300 for test.\n\nstructurally missing, i.e. the measurements are absent because they do not make sense (e.g. see the\ncar detection experiments).\nWe compare our model to alternative solutions over the experimental setup introduced in [4]. Three\nsets of experiments are considered. The \ufb01rst set relies on binary classi\ufb01cation problems from the\nUCI repository. For each dataset, 90% of the features are removed at random. The second set of\nexperiments considers the task of discriminating between handwritten characters of 5 and 6 from\nthe MNIST dataset. Contrary to UCI, the deleted features have some structure; for each example, a\nsquare area covering 25% of the image surface is removed at random. The third set of experiments\nconsiders detecting cars in images. This task presents a problem where some features are structurally\nmissing. For each example, regions of interests corresponding to potential car parts are detected, and\nfeatures are extracted for each region. For each image, 19 types of region are considered and between\n0 and 10 instances of each region have been extracted. Each region is then described by 10 features.\nThis region extraction process is described in [7]. Hence, at most 1900 = 19 \u00d7 10 \u00d7 10 features are\nprovided for each image. Data statistics are summarized in Table 1.\nOn these datasets, Feature Set Embedding (FSE) is compared to 7 baseline models. These baselines\nare all variants of Support Vector Machines (SVMs), suitable for the missing feature problem. Zero,\nMean, GMM and kNN are imputation-based strategies: Zero sets the missing values to zero, Mean\nsets the missing values to the average value of the features over the training set, GMM \ufb01nds the\nmost likely missing values given the observed ones relying on a Gaussian Mixture learned over the\ntraining set, kNN \ufb01lls the missing values of an instance based on its k-nearest-neighbors, relying on\nthe Euclidean distance in the subspace relevant to each pair of examples. Flag relies on the Zero\nimputation but complements the examples with binary features indicating whether each feature was\nobserved or imputed. Finally, geom is a subspace-based strategy [4]; for each example, a classi\ufb01er\nin the subspace corresponding to the observed features is considered. The instance-speci\ufb01c margin\nis maximized but the instance-speci\ufb01c classi\ufb01ers share common weights.\nFor each experiment, the hyperparameters of our model l, m and the number of training iterations are\nvalidated by \ufb01rst training the model on 4/5 of the training data and assessing it on the remainder of\nthe training data. A similar strategy has been used for selecting the baseline parameters. The SVM\nkernel has notably been validated between linear and polynomial up to order 3. Test performance is\nthen reported over the best validated parameters.\nTable 2 reports the results of our experiments. Overall, FSE performs at least as well as the best alter-\nnative for all experiments, except for hepatitis where all models yield almost the same performance.\nIn the case of structurally missing features, the car experiment shows a substantial advantage for FSE\nover the second best approach geom, which was speci\ufb01cally introduced for this kind of setup. During\nvalidation (no validation results are reported due to space constraints), we noted that non-linear mod-\n\n5\n\n\fTable 2: Error Rate (%) for Missing Features at Train & Test Time\n\nUCI\n\nsick\npima\nhepatitis\necho\nhypo\n\nMNIST-5-vs-6\nCars\n\nFSE geom zero mean\n37\n35\n22\n33\n35\n6\n39\n\n9\n34\n23\n33\n5\n5\n24\n\n10\n34\n22\n34\n5\n5\n28\n\n9\n34\n22\n37\n7\n5\n39\n\n\ufb02ag GMM kNN\n30\n16\n35\n41\n23\n22\n33\n36\n19\n6\n6\n7\n41\n48\n\n40\n35\n22\n33\n33\n5\n38\n\nTable 3: Error rate (%) for missing features at train time only\n\nUSPS\nPhysics\nMines\n\nFSE meanInput GMM meanFeat\n13.2\n11.7\n29.6\n23.8\n9.8\n10.6\n\n9.0\n31.2\n10.5\n\n13.6\n29.2\n11.7\n\nels, i.e. the baseline SVM with a polynomial kernel of order 2 and FSE with \u03c6 = max, outperformed\ntheir linear counterparts. We therefore solely validate non-linear FSE in the following: For feature\nembedding of continuous data, feature presence information has proven to be useful in all cases, i.e.\nl(a) > 0 in Eq. (3). For feature embedding of discrete data, sharing parameters across different\nvalues of the same feature, i.e. Eq. (1), was also helpful in all cases. We also relied on sharing\nparameters across different features with the same value, i.e. Eq. (2), for datasets where the feature\nvalues shared a common meaning, i.e. gray levels for MNIST and region features for cars. For the\nhyperparameters (l, m) of our model, we observed that the main control on our model capacity is\nthe embedding size m. Its selection is simple since varying this parameter consistently yields convex\nvalidation curves. The rank regularizer l needed little tuning, yielding stable validation performance\nfor a wide range of values.\n\n3.2 Missing Features at Train Time\n\nThe setup presenting missing features at training time is relevant to applications which rely on dif-\nferent sources for training. Each source might not collect the exact same set of features, or might\nhave introduced novel features during the data collection process. At test time however, the feature\ndetector can be designed to collect the complete feature set.\nIn this case, we compare our model to alternative solutions over the experimental setup introduced\nin [6]. Three datasets are considered. The \ufb01rst set USPS considers the task of discriminating between\nodd and even handwritten digits over the USPS dataset. The training set is degraded and 85% of the\nfeatures are missing. The second set considers the quantum physics data from the KDD Cup 2004 in\nwhich two types of particles generated in high energy collider experiments should be distinguished.\nAgain, the training set is degraded and 85% of the features are missing. The third set considers\nthe problem of detecting land-mines from 4 types of sensors, each sensor provides a different set of\nfeatures or views. In this case, for each instance, whole views are considered missing during training.\nData statistics are summarized in Table 1 for the three sets.\nFor this set of experiments, we rely on in\ufb01nite imputations as a baseline. In\ufb01nite imputation is a gen-\neral technique proposed for the case where features are missing at train time. Instead of pretraining\nthe distribution governing the missing values with a generative objective, in\ufb01nite imputations pro-\nposes to train the imputation model and the \ufb01nal classi\ufb01er in a joint optimization framework [6]. In\nthis context, we consider an SVM with a RBF kernel as the classi\ufb01er and three alternative imputation\nmodels Mean, GMM and MeanFeat which corresponds to mean imputations in the feature space.\nFor each experiment, we follow the validation strategy de\ufb01ned in the previous section for FSE. The\nvalidation strategy for tuning the parameters of the other models is described in [6].\nTable 3 reports our results. FSE is the best model for the Physics and Mines dataset, and the second\nbest model for the USPS dataset. In this case, features are highly correlated and GMM imputation\nyields a challenging baseline. On the other hand, Physics presents a challenging problem with higher\n\n6\n\n\fFigure 2: Results for MNIST-miss-test (12 binary problems with features missing at test time only)\n\nerror rates for all models. In this case, feature correlation is low and GMM imputation is yielding the\nworse performance, while our model brings a strong improvement.\n\n3.3 Missing Features at Test Time\n\nThe setup presenting missing features at test time considers applications in which the training data\nhave been produced with more care than the test data. For example, in a face identi\ufb01cation applica-\ntion, customers could provide clean photographs for training while, at test time, the system should\nbe required to work in the presence of occlusions or saturated pixels.\nIn this case, we compare our work to [10] and [5]. Both strategies propose to learn a classi\ufb01er\nwhich avoids assigning high weight to a small subset of features, hence limiting the impact of the\ndeletion of some features at test time.\n[10] formulates their strategy as a min-max problem, i.e.\nidentifying the best classi\ufb01er under the worst deletion, while [5] relies on an L\u221e regularizer to\navoid assigning high weights to few features. We compare our algorithm to these alternatives over\nbinary problems discriminating handwritten digits originating from MNIST. This experimental setup\nhas been introduced in [10] and Table 1 summarizes its statistics.\nIn this setup, the data is split\ninto training, validation and test sets. For a fair comparison, the validation set is used solely to\nselect hyperparameters, i.e. we do not retrain the model over both training and validation sets after\nhyperparameter selection.\nSince no features are missing at train time, we adapt our training procedure to take into account\nthe mismatched conditions between train and test. Each time an example is considered during our\nstochastic training procedure, we delete a random subset of its features. The size of this subset is\nsampled uniformly between 0 and the total number of features minus 1.\nFigure 2 plots the error rate as a function of the number of missing features. FSE has a clear advantage\nin most settings: it achieves a lower error rate than Globerson & Roweis [10] in all cases, while it\nis better than Dekel & Shamir [5], as soon as the number of missing features is above 50, i.e. less\nthan 6% missing features. In fact, we observe that FSE is very robust to feature deletion; its error\nrate remains below 20% for up to 700 missing features i.e. 90% missing features. On the other end,\nthe alternative strategies report performance close to random when the number of missing features\nreaches 150, i.e. 20% missing features. Note that [10] and [5] further evaluate their models in an\nadversarial setting, i.e. features are intentionally deleted to fool the classi\ufb01er, that is beyond the scope\nof this work.\n\n3.4 MNIST-full experiments\n\nThe previous experiments compared our model to prior approaches relying on the experimental se-\ntups introduced to evaluate these approaches. These setups proposed small training sets motivated by\nthe training cost of the compared alternatives (see Table 1). In this section, we stress the scalability\nof our learning procedure and study FSE on the whole MNIST dataset with 10 classes and 60, 000\ntraining instances. All conditions are considered: features missing at training time, at testing time,\nand at both times.\nWe train 4 models which have access to training sets with various numbers of available features,\ni.e. 100, 200, 500 and 784 features which approximately correspond to 90, 60, 35 and 0% missing\n\n7\n\n 0 10 20 30 40 0 150 300 450 600 750Error rate (%)Num. of missing featuresFSEDekel & ShamirGloberson & Roweis\fTable 4: Error Rate (%) 10-class MNIST-full Experiments\n\n# train f.\n\n100\n19.8\n34.2\n55.6\n78.3\nrandom 10.7\n\n100\n300\n500\n784\n\n# test features\n500\n7.5\n4.8\n4.8\n17.8\n2.1\n\n300\n8.9\n7.4\n12.3\n46.7\n2.9\n\n784\n6.9\n3.9\n2.9\n2.5\n1.8\n\nfeatures. We train a 5th model referred to as random with the algorithm introduced in Section 3.3, i.e.\nall training features are available but the training procedure randomly hides some features each time\nit examines an example. All models are evaluated with 100, 200, 500 and 784 available features.\nTable 4 reports the results of these experiments. Excluding the random model, the result matrix is\nstrongly diagonal, e.g. when facing a test problem with 300 available features, the model trained with\n300 features is better than the models trained with 100, 500 or 784 features. This is not surprising as\nthe training distribution is closer to the testing distribution in that case. We also observe that models\nfacing less features at test time than at train time yield poor performance, while the models trained\nwith few features yield satisfying performance when facing more features. This seems to suggest that\ntraining with missing features yields more robust models as it avoids the decision function to rely\nsolely on few speci\ufb01c features that might be corrupted. In other word, training with missing features\nseems to achieve a similar goal as L\u221e regularization [5]. This observation is precisely what led\nus to introduce the random training procedure. In this case, the model performs better than all other\nmodels in all conditions, even when all features are present, con\ufb01rming our regularization hypothesis.\nIn fact, the results obtained with no missing features (1.8% error) are comparable to the best non-\nconvolutional methods, including traditional neural networks (1.6% error) [20]. Only recent work\non Deep Boltzmann Machines [17] achieved signi\ufb01cantly better performance (0.95% error). The\nregularization effect of missing training features could be related to noise injection techniques for\nregularization [21, 11].\n\n4 Conclusions\n\nThis paper introduces Feature Set Embedding for the problem of classi\ufb01cation with missing features.\nOur approach deviates from the standard classi\ufb01cation paradigm: instead of considering examples\nas feature vectors, we consider examples as sets of (feature, value) pairs which handle the missing\nfeature problem more naturally. In order to classify sets, we propose a new strategy relying on two\nlevels of modeling. At the \ufb01rst level, each (feature, value) is mapped onto an embedding space. At\nthe second level, the set of embedded vectors is compressed onto a single embedded vector over\nwhich linear classi\ufb01cation is applied. Our training algorithm then relies on stochastic gradient ascent\nto jointly learn the embedding space and the \ufb01nal linear decision function.\nThis proposed strategy has several advantages compared to prior work. First, sets are conceptually\nbetter suited than vectors for dealing with missing values. Second, embedding (feature, value) pairs\noffers a \ufb02exible framework which easily allows encoding prior knowledge about the features. Third,\nour experiments demonstrate the effectiveness and the scalability of our approach.\nFrom a broader perspective, the \ufb02exible feature embedding framework could go beyond the missing\nfeature application. In particular, it allows using meta-features (attributes describing a feature) [13],\ne.g. the embedding vector of the temperature features in a weather prediction system could be com-\nputed from the locations of their sensors. It also enables the designing of a system in which new\nsensors are added without requiring full model re-training; in this case, the model could be quickly\nadapted by only updating embedding vectors corresponding to the new sensors. Also, our approach\nof relying on feature sets offers interesting opportunities for feature selection and adversarial feature\ndeletion. We plan to study these aspects in the future.\n\nAcknowledgments The authors are grateful to Gal Chechik and Uwe Dick for sharing their data\nand experimental setups.\n\n8\n\n\fReferences\n[1] G. Batista and M. Monard. A study of k-nearest neighbour as an imputation method. In Hybrid\n\nIntelligent Systems (HIS), pages 251\u2013260, 2002.\n\n[2] C. Bhattacharyya, P. K. Shivaswamy, and A. Smola. A second order cone programming formu-\nlation for classifying missing data. In Neural Information Processing Systems (NIPS), pages\n153\u2013160, 2005.\n\n[3] S. Boughhorbel, J-P. Tarel, and F. Fleuret. Non-mercer kernels for svm object recognition. In\n\nBritish Machine Vision Conference (BMVC), 2004.\n\n[4] G. Chechik, G. Heitz, G. Elidan, P. Abbeel, and D. Koller. Max margin classi\ufb01cation of data\n\nwith absent features. Journal of Machine Learning Research (JMLR), 9:1\u201321, 2008.\n\n[5] O. Dekel, O. Shamir, and L. Xiao. Learning to classify with missing and corrupted features.\n\nMachine Learning Journal, 2010 (to appear).\n\n[6] U. Dick, P. Haider, and T. Scheffer. Learning from incomplete data with in\ufb01nite imputations.\n\nIn International Conference on Machine Learning (ICML), 2008.\n\n[7] G. Elidan, G. Heitz, and D. Koller. Learning object shape: From drawings to images.\nConference on Computer Vision and Pattern Recognition (CVPR), pages 2064\u20132071, 2006.\n\nIn\n\n[8] Y. M. Ermoliev and V. I. Norkin. Stochastic generalized gradient method with application to\ninsurance risk management. Technical Report 21, International Institute for Applied Systems\nAnalysis, 1997.\n\n[9] Z. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via an em approach.\n\nIn Neural Information Processing Systems (NIPS), pages 120\u2013127, 1993.\n\n[10] A. Globerson and S. Roweis. Nightmare at test time: robust learning by feature deletion. In\n\nInternational Conference on Machine Learning (ICML), pages 353\u2013360, 2006.\n\n[11] Y. Grandvalet, S. Canu, and S. Boucheron. Noise injection: Theoretical prospects. Neural\n\nComputation, 9(5):1093\u20131108, 1997.\n\n[12] R. Kondor and T. Jebara. A kernel between sets of vectors. In International Conference on\n\nMachine Learning (ICML), 2003.\n\n[13] E. Krupka, A. Navot, and N. Tishby. Learning to select features using their properties. Journal\n\nof Machine Learning Research (JMLR), 9:2349\u20132376, 2008.\n\n[14] Y. LeCun, L. Bottou, G. B. Orr, and K. R. Mueller. Ef\ufb01cient backprop. In G. B Orr and K. R.\nMueller, editors, Neural Networks: Tricks of the Trade, chapter 1, pages 9\u201350. Springer, 1998.\n[15] X. Liao, H. Li, and L. Carin. Quadratically gated mixture of experts for incomplete data classi-\n\n\ufb01cation. In International Conference on Machine Learning (ICML), pages 553\u2013560, 2007.\n\n[16] S. Rosset, J. Zhu, and T. Hastie. Margin maximizing loss functions. In Neural Information\n\nProcessing Systems (NIPS), 2003.\n\n[17] R. Salakhutdinov and H. Larochelle. Ef\ufb01cient learning of deep Boltzmann machines. In Arti\ufb01-\n\ncial Intelligence and Statistics (AISTATS), 2010.\n\n[18] J.L. Schafer. Analysis of Incomplete Multivariate Data. Chapman & Hall, London, UK, 1998.\n[19] N.Z. Shor. Minimization Methods for Non-Differentiable Functions and Applications. Springer,\n\nBerlin, Germany, 1985.\n\n[20] P. Simard, D. Steinkraus, and J.C. Platt. Best practices for convolutional neural networks ap-\nIn International Conference on Document Analysis and\n\nplied to visual document analysis.\nRecognition (ICDAR), pages 958\u2013962, 2003.\n\n[21] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and composing robust fea-\ntures with denoising autoencoders. In International Conference on Machine Learning (ICML),\npages 1096\u20131103, 2008.\n\n[22] D. Williams, X. Liao, Y. Xue, and L. Carin. Incomplete-data classi\ufb01cation using logistic re-\n\ngression. In International Conference on Machine Learning (ICML), pages 972\u2013979, 2005.\n\n9\n\n\f", "award": [], "sourceid": 745, "authors": [{"given_name": "David", "family_name": "Grangier", "institution": null}, {"given_name": "Iain", "family_name": "Melvin", "institution": null}]}