{"title": "Max-margin classification of incomplete data", "book": "Advances in Neural Information Processing Systems", "page_first": 233, "page_last": 240, "abstract": null, "full_text": "Max-margin classification of incomplete data\n\n2\n\nGal Chechik1 , Geremy Heitz2 , Gal Elidan1 , Pieter Abb eel 1 , Daphne Koller 1 1 Department of Computer Science, Stanford University, Stanford CA, 94305 Department of Electrical Engineering, Stanford University, Stanford CA, 94305 Email for correspondence: gal@ai.stanford.edu\n\nAbstract\nWe consider the problem of learning classifiers for structurally incomplete data, where some ob jects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired ob jective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex ob jective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images.\n\n1\n\nIntro duction\n\nIn the traditional formulation of supervised learning, data instances are viewed as vectors of features in some high-dimensional space. However, in many real-world tasks, data instances have a complex pattern of missing features. While features may sometimes be missing due to measurement noise or corruption, different samples often have different sets of observable features due to inherent properties of the instances. For example, in the case of recognizing ob jects in natural images, an ob ject is often classified using a set of image patches corresponding to parts of the ob ject (like the license plate for cars); but some images may not contain all parts, either because a part was not captured in the image or because the specific instance does not have this part in the first place. In other scenarios, some features cannot even be defined for all instances. Such situations arise when the ob jects to be learned are organized based on a known graph structure, since their features may rely on local properties of the graph. For example, we might wish to classify the attributes of a web-page given the attributes of neighboring web-pages [8]. In analyzing genomic data, we may wish to predict the edges in networks of interacting proteins or chemical reactions [9, 15]. In these cases, the local neighborhood of an instance in the graph often varies drastically, and it has already been observed that variation this could introduce statistical biases [8]. In the web-page task, for instance, a useful feature is the most common topic of other sites that point to a given page. When a page has no such parents, however, this feature is meaningless and should be considered structural ly absent. The common approach for classification with missing features is imputation, a two phase procedure where the values of the missing attributes are first filled in during a preprocessing\n\n\f\nphase, after which a standard classifier is applied to the completed data [10]. Most Imputation techniques make most sense when the features are missing due to noise, especially in the setting of missing at random (MAR, when the missingness pattern is conditionally independent of the unobserved features given the observations), or missing completely at random (MCAR, when it is independent of both observed and unobserved measurements). In common practice of applying imputation, missing attributes in continuous data are often filled with zeros, or with the average of all of the data instances, or using the k nearest neighbors (kNN) of each instance to find a plausible value of its missing features. A second family of imputation methods builds probabilistic generative models of the features using raw maximum likelihood or algorithms such as expectation maximization (EM) [4]. Such model-based methods allow the designer to introduce prior knowledge and are extremely useful when priors can be explicitly modeled. These methods work very well for MAR data settings, because they assume that the missing features are generated by the same model that generates the observed features. However, model-based approaches can be computationally expensive, and require significant prior knowledge about the data. More importantly, they will produce meaningless completions for features that are structural ly absent. As an extreme example, consider two subpopulation of instances (e.g., animals and buildings) having no overlapping features (e.g., body parts, and architectural aspects), in which filling missing values (e.g., the body parts of buildings) is clearly meaningless and may harm classification performance. As a result, for structurally absent features, it would be useful if we could avoid unnecessary prediction of hypothetical undefined values, and classify instances directly. We approach this problem directly from the geometric interpretation of the classification task as finding a separating hyperplane in the feature space. We view instances with different feature sets as lying in subspaces of the full feature space, and suggest a modified optimization ob jective within the framework of support vector machines (SVMs), that explicitly considers the subspace of each instance. We show how the linearly separable case can be efficiently solved using convex optimization (second order cone programming, SOCP). The ob jective of the non separable case is non-convex, and we propose an iterative procedure that is found to converge in practice. These approaches may be viewed as model-free methods for handling missing data in the cases where the MAR assumption fails to hold. We evaluate the performance of our approach in two real world applications: prediction of missing enzymes in a metabolic network, and automobile detection in natural images. In both tasks, features may be inherently absent due to the mechanisms described above, and our methods are found superior to other simple imputation methods.\n\n2\n\nMax-Margin Formulation for Missing Features\n\nLet x1 . . . xn be a set of samples with binary labels yi {-1, 1}. Each sample xi is characterized by a subset of features F (xi ), from a full set F of size d. A sample that has all features F (xi ) = F , is viewed as a vector in Rd , where the ith coordinate corresponds to the ith feature. A sample xi with partially valid features can be viewed as embedded in the relevant subspace R|F (xi )| Rd . For simplicity of notation, we treat each xi as if it were a vector in Rd where only its F (xi ) entries are valid and define the inner product with another k vector in Rd as wx = :fk F (xi ) wk xk . Imp ortantly, since instances share features, the learned classifier must be consistent across instances, assigning the same weight to a given feature in different samples, even if those instance do not lie in the same subspace. In the classical SVM approach [14, 13], a linear classifier w is optimized to maximize the margin, defined as mini yi (wxi + b)/ w , and the learning problem is reduced to the quadratic constrained optimization problem min 1 w 2\n2\n\nw, ,b\n\n+C\n\nin\n\ni\n\ns.t.\n\nyi (wxi + b) 1 - i ,\n\ni = 1...n\n\n(1)\n\n=1\n\nwhere b is a threshold, the 's are slack variables necessary for the case when the training instances are not linearly separable, and C is the error penalty. Eq. (1) can be extended to nonlinear classifiers using kernels [13].\n\n\f\n2 1\n\nFigure 1: The margin is incorrectly scaled when a sample that has missing features is treated as if the missing features have a value of zero. In this example, the margin of a sample that only has one feature (the x dimension) is measured both in the higher dimensional space (2 ) and the lower one (1 ). If all features are assumed to exist, and we give the missing feature (along the y axis) a value of zero, the margin 2 measured in the higher dimensional space is shorter that the margin measured in the relevant subspace 1 .\n\nConsider now learning such a classifier in the presence of missing data. At first glance, it may appear that since the x's only affect the optimization through inner products with w, missing features can merely be skipped (or equivalently, replaced with zeros), thereby preserving the values of the inner product. However, this does not properly normalize the different entries in w, and damages classification accuracy. The reason is illustrated in Fig. 1 where a single sample in R2 has one valid and one missing feature. Due to the missing feature, measuring the margin in the full space 2 , underestimates the correct geometric margin of the sample in the valid space 1 . This is different from the case where the feature exists but is unknown, in which the sample's margin could be either over- or under-estimated. In the next sections, we explore how this Eq. (1) can be solved while properly taking this normalization into account. We start by reminding the reader about the geometric interpretation of SVM.\n\n3\n\nGeometric interpretation\n\nThe derivation of the SVM classifier [14] is motivated by the goal of finding a hyperplane that maximally separates the positive examples from the negative, as measured by the geometric margin (w) = mini yi wxi . The task of maximizing the margin (w), w max (w) = max\nw w\n\nm\n\ni\n\nin\n\nyi wxi w\n\n(\n\n2)\n\nis transformed into the quadratic programming problem of Eq. (1) in two steps. First, w , 1 is taken out of the minimization, yielding maxw w (mini yi wxi ). Then, the following invariance is used: for every solution, there exists a solution that achieves the same target function value, but with a margin that equals 1. This allows us to write the SVM problem -1 as a constrained optimization problem: maxw w s.t. yi (wxi ) 1. This is equivalent 2 to minimizing w with the same constraints, which equals the SVM problem of Eq. (1). In the case of missing features, this derivation no longer optimizes the correct geometrical margin (Fig. 1). To address this problem, we treat the margin of each instance in its own (i) subspace, by defining the instance margin for the ith instance as i (w) = yi w (i)xi where w k 2 w(i) = :fk F (xi ) wk . The geometric margin is, as b efore, the minimum over all instance margins, yielding a new optimization problem m yi w(i) xi max in (3) w i w(i) . Unfortunately, since different margin terms are normalized by different norms w (i) , we can no longer take the denominator out of the minimization as above. In addition, each of the terms yi w(i) xi / w(i) is non-convex in w, which is difficult to solve directly in an efficient way. We now discuss two approaches for solving this problem. In the linearly separable case, the optimization problem of Eq. (3) is equivalent to max\nw,\n\n\n\ns.t.\n\nyi w(i) xi w(i)\n\ni = 1...n\n\n,\n\n(4)\n\n\f\nThis is a convex feasibility problem for any fixed value of , which is a real scalar that corresponds to the margin. It can be solved efficiently using a bisection search over R + , where in each iteration we solve a convex second order cone program (SOCP) [11]. Unfortunately, extending this formulation to the non-separable while preserving the geometric margin interpretation case makes the problem non-convex (this is discussed elsewhere). A second approach for solving Eq. (3) is to treat each instance margin individually. We represent each of the norms w(i) as a scaling of the full norm by defining scaling coefficients si = w(i) / w , and rewriting Eq. (3) as , = m m yi wxi w(i) w yi wxi 1 si = (5) max max in in w w i i si w w si . The si factors are scalars, and had we known them, we could have solved a standard SVM problem. Unfortunately they depend on w (i) and are unknown. This formalism allows us to use again the invariance to the rescaling of w and rewrite as a constrained optimization problem over si and w. In the non-separable case, Eq. (5) becomes min 1 w 2\n2\n\nw,b, ,s\n\n+C\n\ni\n\ni\n\ns.t.\n\n1 (yi (wxi + b)) 1 - i , si si = w(i) / w ,\n\ni = 1...n\n\n(6)\n\ni = 1...n\n\nThis constrained optimization problem is no longer a QP. In fact, due to the normalization constraint it is not even convex in w. One solution is a projected gradient approach, in which one iterates between steps in the direction of the gradient of the Lagrangian and pro jections to the constrained space, by calculating si = w(i) / w . For the right choices of step sizes, such approaches are guaranteed to converge to local minima [2]. We can use a faster iterative algorithm based on the fact that the problem is a QP for any given set of si 's, and iterate between (1) solve a QP for w given si , and (2) use the resulting w to calculate new si 's. This algorithm differs from a pro jected gradient approach in that rather than taking a series of small gradient steps, it takes bigger leaps, and pro jects back to the constrained space after each step. Since the convergence of this iterative algorithm is not guaranteed, we used cross validation to choose an early stopping point and found that the best solutions were obtained within 2-5 steps. Typically, the ob jective improved on the first 1-3 iterations, but then, in about 75% of the cases the ob jective oscillated. In the remaining cases the algorithm converged to a fixed point. It is easy to see that a fixed point of this iterative procedure achieves an optimal solution for Eq. (6), since it achieves a minimal w while obeying the si constraints. As a result, when this algorithm converges, the solution is also guaranteed to be a locally optimal solution of the original problem Eq. (3). The power of the SVM approach can be largely attributed to the flexibility and efficiency of nonlinear classification allowed through the use of kernels. The dual of the above QP can be kernelized as in a standard SVM, yielding max n R in i - yj yi 1i i K ( x i , xj ) j 2 ,j =1 si sj\nn\n\ns.t.\n\n0 i C ;\n\n=1\n\nin\n\ni yi = 0.\n\n(7)\n\n=1\n\nKernels in this formulation operate over vectors with missing features, hence we have to develop kernels that handle them correctly. Fortunately, many kernels only depend on their inputs through their inner product. In this case there is an easy procedure to construct a modified kernel that takes such missing values into account. For example, for a polynomial d d kernel K (xi , xj ) = ( xi , xj + 1) , define K (xi , xj ) = K (xi , xj ) k= ( xi , xj F + 1) , with the inner product calculated over valid features xi , xj F = :fk (xj )F (xi ) xik , xj k . This can be easily proved to be a kernel.\n\nwhere K (xi , xj ) is the kernel function that simulates an inner product in the higher dimensional feature space. Classification j f new samples are obtained as in standard SVM by o 1 calculating the margin (xnew ) = yj j sj K (xj , xnew ) sn1 w . e\n\n\f\n(a)\n\n(b)\n\n(c)\n0.5 0.4\nclassification error\n\n0.3 0.2 0.1 0\n\ngeom\n\nzero\n\nmean flag-agg kNN 5\n\nFigure 2: Car classification results. (a) An easy instance where all local features are approximately in agreement. (b) A hard instance where local features are divided into two distinct groups. This instance was correctly classified by the `geometric margin' approach but misclassified by all other methods. (c) Classification accuracy of the different methods for the task of ob ject recognition in real images. Error bars are standard errors of the mean (SEM) over the five cross validation sets.\n\n4\n\nExp eriments\n\nWe evaluated our approaches in three different missingness scenarios. First, as a sanity check, we explored performance when features are missing at random, in a series of five standard UCI benchmarks, and also in a large digit recognition task using MNIST data. In this setting our methods performed equally well as other approaches (or slightly better). The full details of these experiments are provided in a longer version of this work. Second, we study a visual ob ject recognition application where some features are missing because they cannot be located in the image. Finally, we apply our methods to a problem of biological network completion, where missingness patterns of features is determined by the known structure of the network. For all applications, we compare our iterative algorithm with five common approaches for completing missing features. 1. Zero: Missing values were set to zero. 2. Mean: Missing values were set to the average feature values 3. Aggregated Flags: Features were annotated with an explicit additional feature noting whether a feature is valid or missing. To reduce the number of added features, we added a single flag for each group of features that were valid or invalid together across all instances. For example, In the vision application, all features of a landmark candidate are grouped together since they are all invalid if the match is wrong (see below). 4. kNN: Missing features were set with the mean value obtained from the K nearest neighbors instances; neighborhood was measured using a Euclidean distance in the subspace relevant to each two samples, number of neighbors was varied as K = 3, 5, 10, 20, and the best result is the one reported. 5. EM: Generative model in the spirit of [4]. A Gaussian mixture model is learned by iterating between (1) learning a GMM model of the fil led data (2) re-filling missing values using clusters means, weighted by the posterior probability that a cluster generated the sample. Covariances were assumed spherical. The number of clusters was varied as K = 3, 5, 10, 15, 20, and the best result is the one reported. 6. Geometric margin: Our non-separable approach described in Sec. 3. In all of the experiments, we used a 5-fold cross validation procedure and evaluated performance using a testing set that was not used during training. In addition, 20% of the training set was used for choosing optimization parameters, such as the kernel type, its parameters, and an early stopping point for the iterative algorithm. 4.1 Visual ob ject recognition We now consider a visual ob ject recognition task where instances have structurally missing features. In this task we attempt to determine if an ob ject from a certain class (automobiles) is present in a given input image. The task of classifying images based on the ob ject class that they contain has seen much work in recent years [1, 5],and discriminative approaches have typically produced very good results [5, 12]. Features in these methods are commonly constructed from regions of interest (patches) in the image. These patches typically cover \"landmarks\" of the ob ject, like the trunk or a headlight for a car. A typical set of patches includes several candidates for any ob ject part,\n\n\f\nand some images may have more candidates for a given part than others. For example, a trunk of a car may not be found in a picture of a hatch-back car, hence all its corresponding features are considered to be structurally missing from that image. Our ob ject model contains a set of \"landmarks\", for which we find several matches in a given image (details are omitted due to lack of space). Fig. 2 shows examples of matches for the front windshield landmark. Because of the noisy matches, the highest scoring match often does not match the true landmark, and the number of high-quality matches (features) varies in practice. It is in precisely such a scenario that we expect our proposed algorithm to be effective. In some cases, landmark models could provide confidence levels for each match. These could in principle be used as additional features to help the classifiers give more weight to better matches, and are expected to improve classification when the confidence measure is reliable. While this is a potentially useful approach for the current application, this paper takes a different approach: it does not use any soft confidence values but rather treats the low-confidence matches as wrong, removing them from the data. Concretely, we located up to 10 candidate patches (21 21 pixels) that were promising (likelihood above a given threshold) for each of the 19 landmarks in the car model. For each candidate, we compute the first 10 principal component coefficients of the image patch and concatenate these patches to form the image feature vector. If the number of patches for a given landmark is less than ten, we consider the rest to be structurally absent. We evaluated performance for this task using two levels of a 5-fold cross validation procedure as explained above. We compared several kernels and report results using the kernel that fared best on the validation set, which was usually a second order polynomial kernel. Fig. 2c compares the accuracy of the different methods. We found the geometric approach to be significantly superior to all other methods. To further evaluate our method, we qualitatively examined the classification results for several images across the various methods. Fig. 2a shows the top 10 matches for the front windshield landmark for a representative \"easy\" test instance where all local features are approximately in agreement. This instance was correctly classified by all methods. In contrast, Fig. 2b shows a representative \"hard\" test instance where local features cluster into two different groups. In this case, the cluster of bad matches was automatically excluded yielding missing features, and our geometric approach was the only method able to classify the instance correctly. 4.2 Metab olic pathway reconstruction\n\nAs a final application, we consider the problem of predicting missing enzymes in metabolic pathways, a long-standing and important challenge in computational biology [15, 9]. Instances in this task have missing features due to the structure of the biochemical network. Cells use a complex network of chemical reactions to produce their chemical building blocks (Fig. 3). Each reaction transforms a set of molecular compounds (called substrates) into another set of molecules (products), and requires the presence of an enzyme to catalyze the reaction. It is often unknown which enzyme catalyzes a given reaction, and it is desirable to predict the identity of such missing enzymes computationally. Our approach for predicting missing enzymes is based on the observation that enzymes in local network neighborhoods usually participate in related functions. As a result, neighboring enzyme pairs have non trivial correlations over their features that reflect their functional relations. Importantly, different types of neighborhood relations between enzyme pairs lead to different relations of their properties. For example, an enzyme in a linear chain depends on the preceding enzyme product as its substrate. Hence it is expected that the corresponding genes are co-expressed [9, 15]. On the other hand, enzymes in forking motifs (same substrate, different products) often have anti-correlated expression profiles [7]. To preserve the distinction between different neighbor relations, we defined a set of network motifs, including forks, funnels and linear chains. Each enzyme is represented as a vector of features that measure its relatedness to each of its neighbors. A feature vector has structurally missing entries if the enzyme does not have all types of neighbors. For example, the enzyme PHA2 in Fig. 3 does not have a neighbor of type fork, and therefore all features assigned to such a neighbor are absent in the representation of the reaction \"Prephanate Phenylpyruvate\".\n\n\f\n0.5\nclassification error\n\n0.4 0.3 0.2 0.1 0\n1\n\ngeom\n\nzero\n\nmean flag-agg kNN 10\n\n0.8\n\n1 - False positives\n\n0.2\n\ngeom zero mean flag-agg kNN 10\n0.25 0.75 1\n\n0 0\n\nTrue positives\n\nFigure 3: Left: A fragment of the full metabolic pathway network in S. cerevisiae. Chemical reactions (arrows) transform a set of molecular compounds into other compounds. Small molecules like C O2 were omitted from this drawing for clarity. Reactions are catalyzed by enzymes (boxed names, e.g., ARO7), but in some cases these enzymes are unknown. The network imposes various neighborhood relations between enzymes assigned to reactions, like linear chains (ARO7,PHA2), forks (TRP2,ARO7) and funnels (ARO9,PHA2) Top Right: Classification accuracy for compared methods. The classification task is to identify if a candidate enzyme is in the right \"neighborhood\". Error bars are SEMs over 5 cross validation sets. Bottom right: ROC curves for the same task.\n\nWe used the metabolic network of S. cerevisiae, as reconstructed by Palsson and colleagues [3], after removing 14 metabolic currencies and reactions with unknown enzymes, leaving 1265 directed reactions. We used three data types: (1) A compendium of 645 gene expression experiments; each experimental condition k contributed one feature, the point-wise Pearson x (k)xj (k) correlation ixi xj . xi is the vector of expression levels across conditions. (2) The proteindomain content of each enzyme as found by the Prosite database. Each domain k contributed one feature, the point-wise symmetric DK L measure xi (k ) (log(xi (k )/(xj (k ) + xi (k ))/2)) + xj (k ) (log(xj (k )/(xj (k ) + xi (k ))/2)). (3) The cellular localization of the protein [6]; each cellular localization contributed one feature, the point-wise Hamming distance. In total, the feature vector length was about 3900. Pathway reconstruction requires that we rank candidate enzymes by their potential to match a reaction. As a first step towards this goal, we train a binary classifier, to predict if an enzyme fits its neighborhood. We created a set of positive examples from the reactions with known enzymes ( 520 reactions), and also created negative examples by plugging impostor genes into `wrong' neighborhoods. We trained an SVM classifier using a 5-fold cross validation procedure as described above. Figure 3 shows the classification error of the different methods in the gene filling task. The geometric margin approach achieves significantly better performance in this task. kNN achieved very poor performance compared to all other methods. One reason could be that the Euclidean distance is inappropriate for the current task and that a more elaborate distance measure needs to developed for this type of data. Learning metrics is a complicated task in general, and more so in the current problem since the feature vectors contain entries of several different types, making it unlikely that a naive distance measure would work well. Finally, the resulting classifier is used for predicting missing enzymes, by ranking all candidate enzymes according to their match to a given neighborhood. Evaluating the quality of ranking on known enzymes (cross validation), shows that it significantly outperforms previous approaches [9] (not shown here due to space limitations). We attribute this to the ability of the current approach to preserve different types of network-neighbors as separate features in spite of creating missing values.\n\n\f\n5\n\nDiscussion\n\nWe presented a novel method for max-margin training of classifiers in the presence of missing features, where the pattern of missing features is an inherent part of the domain. Instead of completing missing features as a preprocessing phase, we developed a max-margin learning ob jective based on a geometric interpretation of the margin when different instances essentially lie in different spaces. Using two challenging real life problems we showed that our method is significantly superior when the pattern of missing features has structure. The standard treatment of missing features is based on the concept that missing features exist but are unobserved. This assumption underlies the approach of completing features before the data is used in classification. This paper focuses on a different scenario, in which features are inherently absent. In such cases, it is not clear why we should guess hypothetical values for undefined features, since the completed values are filled based on other observed values, and do not add information to our classifiers. In fact, by completing features that are not supposed to be part of an instance, we may be confusing the learning algorithm by presenting it with problem which may be harder than the one we actually need to solve. Interestingly, the problem of classifying with missing features is related to another problem, where individual reliability measures are available for features at each instance separately. This is a common case in analysis scientific measurements, where the reliability of each experiment could be provided separately. For example, DNA micro-array experiments have inherent measures of experimental noise levels, and biological variability is often estimated using replicates. This problem can be viewed in the same framework described in this paper: the geometric margin must be defined separately for each instance since the different noise levels distort the relative scale of each coordinate of each instance separately. Relative to this setting, the completely missing and fully valid features discussed in this paper are extreme points on the spectrum of reliability. It will be interesting to see which aspects of the geometric formulation discussed in this paper can be extended to this new problem.\nAcknowledgement: This paper was supported by a NSF grant DBI-0345474.\n\nReferences\n[1] A. Berg, T. Berg, and J. Malik. Shape matching and ob ject recognition using low distortion correspondence. In CVPR, 2005. [2] Paul H. Calamai and Jorge J. More:9A. Pro jected gradient methods for linearly constrained problems. Math. Program., 39(1):93116, 1987. [3] J. Forster, I. Famili, P. Fu, B.. Palsson, and J. Nielsen. Genome-scale reconstruction of the saccharomyces cerevisiae metabolic network. Genome Research, 13(2):244253, February 2003. [4] Z. Ghahramani and MI. Jordan. Supervised learning from incomplete data via an EM approach. In JD. Cowan, G. Tesauro, and J. Alspector, editors, NIPS, volume 6, pages 120127, 1994. [5] K. Grauman and T. Darrell. Pyramid match kernels: Discriminative classification with sets of image features. In ICCV, 2005. [6] W.K. Huh, J.V. Falvo, L.C. Gerke, A.S. Carroll, R.W. Howson, J.S. Weissman, and E.K. O'Shea. Global analysis of protein localization in budding yeast. Nature, 425:686691, 2003. [7] J. Ihmels, R. Levy, and N. Barkai. Principles of transcriptional control in the metabolic network of saccharomyces cerevisiae. Nature Biotechnology, 22:8692, 2003. [8] D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection bias in relational learning. In ICML, 2002. [9] P. Kharchenko, D. Vitkup, and GM Church. Filling gaps in a metabolic network using expression information. Bioinformatics, 20:I178I185, 2003. [10] R.J.A. Little and D.B. Rubin. Statistical Analysis with Missing Data. NY wiley, 1987. [11] MS. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret. Applications of second-order cone programming. Linear Algebra and its Applications, 284:193228, 1998. [12] A. Quattoni, M. Collins, and T. Darrell. Conditional random fields for ob ject recognition. In LK. Saul, Y. Weiss, and L. Bottou, editors, NIPS 17, pages 10971104, 2005. [13] B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Machines, Regularization Optimization and Beyond. MIT Press, Cambridge, MA, 2002. [14] V.N. Vapnik. The nature of statistical learning theory. SpringerVerlag, 1995. [15] J. P. Vert and Y. Yamanishi. Supervised graph inference. In LK. Saul, Y. Weiss, and L. Bottou, editors, NIPS 17, pages 14331440, 2004.\n\n\f\n", "award": [], "sourceid": 3024, "authors": [{"given_name": "Gal", "family_name": "Chechik", "institution": null}, {"given_name": "Geremy", "family_name": "Heitz", "institution": null}, {"given_name": "Gal", "family_name": "Elidan", "institution": null}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}