{"title": "CPM-Nets: Cross Partial Multi-View Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 559, "page_last": 569, "abstract": "Despite multi-view learning progressed fast in past decades, it is still challenging due to the difficulty in modeling complex correlation among different views, especially under the context of view missing. To address the challenge, we propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets). In this framework, we first give a formal definition of completeness and versatility for multi-view representation and then theoretically prove the versatility of the latent representation learned from our algorithm. To achieve the completeness, the task of learning latent multi-view representation is specifically translated to degradation process through mimicking data transmitting, such that the optimal tradeoff between consistence and complementarity across different views could be achieved. In contrast with methods that either complete missing views or group samples according to view-missing patterns, our model fully exploits all samples and all views to produce structured representation for interpretability. Extensive experimental results validate the effectiveness of our algorithm over existing state-of-the-arts.", "full_text": "CPM-Nets: Cross Partial Multi-View Networks\n\nChangqing Zhang1,2, Zongbo Han1, Yajie Cui1, Huazhu Fu3, Joey Tianyi Zhou4\u2217, Qinghua Hu1,2\n\n1College of Intelligence and Computing, Tianjin University, Tianjin, China\n\n2Tianjin Key Lab of Machine Learning, Tianjin, China\n\n3Inception Institute of Arti\ufb01cial Intelligence, Abu Dhabi, UAE\n4Institute of High Performance Computing, A*STAR, Singapore\n\nAbstract\n\nDespite multi-view learning progressed fast in past decades, it is still challenging\ndue to the dif\ufb01culty in modeling complex correlation among different views, espe-\ncially under the context of view missing. To address the challenge, we propose a\nnovel framework termed Cross Partial Multi-View Networks (CPM-Nets). In this\nframework, we \ufb01rst give a formal de\ufb01nition of completeness and versatility for\nmulti-view representation and then theoretically prove the versatility of the latent\nrepresentation learned from our algorithm. To achieve the completeness, the task\nof learning latent multi-view representation is speci\ufb01cally translated to degradation\nprocess through mimicking data transmitting, such that the optimal tradeoff be-\ntween consistence and complementarity across different views could be achieved.\nIn contrast with methods that either complete missing views or group samples ac-\ncording to view-missing patterns, our model fully exploits all samples and all views\nto produce structured representation for interpretability. Extensive experimental\nresults validate the effectiveness of our algorithm over existing state-of-the-arts.\n\n1\n\nIntroduction\n\nIn the real-word applications, data is usually represented in different views, including multiple modal-\nities or multiple types of features. A lot of existing methods [1, 2, 3] empirically demonstrate that\ndifferent views could complete each other, leading ultimate performance improvement. Unfortunately,\nthe unknown and complex correlation among different views often disrupts the integration of different\nmodalities in the model. Moreover, data with missing views further aggravates the modeling dif\ufb01culty.\nConventional multi-view learning usually holds the assumption that each sample is associated with the\nuni\ufb01ed observed views and all views are available for each sample. However, in practical applications,\nthere are usually incomplete cases for multi-view data [4, 5, 6, 7, 8]. For example, in medical data,\ndifferent types of examinations are usually conducted for different subjects, and in web analysis,\nsome webs may contain texts, pictures and videos, but others may only contain one or two types,\nwhich produce data with missing views. The view-missing patterns (i.e., combinations of available\nviews) become even more complex for the data with more views.\nProjecting different views into a common space (e.g., CCA: Canonical Correlation Analysis and its\nvariants [9, 10, 11]) is impeded by view-missing issue. Several methods are proposed to keep on\nexploiting the correlation of different views. One straightforward way is completing the missing\nviews, and then the on-shelf multi-view learning algorithms could be adopted. The missing views\nare basically blockwise and thus low-rank based completion [12, 13] is not applicable which has\nbeen widely recognized [5, 14]. Missing modality imputation methods [15, 5] usually require\nsamples with two paired modalities to train the networks which can predict the missing modality\nfrom the observed one. To explore the complementarity among multiple views, another natural way\n\n\u2217Corresponding author: J. T. Zhou .\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Illustration of Cross Partial Multi-View Networks. Given multi-view data with missing\nviews (black blocks), the encoding networks degrade the complete latent representation into the\navailable views (white blocks). Learning multi-view representation according to the distributions of\nobservations and classes has the promise to encode complementary information, as well as provide\naccurate prediction.\n\nis manually grouping samples according to the availability of data sources [16], and subsequently\nlearning multiple models on these groups for late fusion. Although it is more effective than learning\non each single view, the grouping strategy is not \ufb02exible especially for the data with large number of\nviews. Accordingly, a challenging problem arises - how to effectively and \ufb02exibly exploit samples\nwith arbitrary view-missing patterns?\nOur methodology is expected to endow the following merits: complete and structured represen-\ntation - comprehensively encoding information from different views into a clustering-structured\nrepresentation, and \ufb02exible integration - handling arbitrary view-missing patterns. To this end, we\npropose a novel algorithm, i.e., Cross Partial Multi-View Networks (CPM-Nets) for classi\ufb01cation,\nas shown in Fig. 1. Bene\ufb01ting from the learned common latent representation from the encoding\nnetworks, all samples and all views can be jointly exploited regardless of view-missing patterns. For\nthe multi-view representation, CPM-Nets jointly considers multi-view complementarity and class\ndistribution, making them mutually improve each other to obtain the representation re\ufb02ecting the\nunderlying patterns. Speci\ufb01cally, the encoded latent representation from observations is complete and\nversatile thus promotes the prediction performance, while the clustering-like classi\ufb01cation schema in\nturn enhances the separability for latent representation. Theoretical analysis and empirical results\nvalidate the effectiveness of the proposed CPM-Nets in exploiting partial multi-view data.\n\n1.1 Related Work\n\nMulti-View Learning (MVL) aims to jointly utilize information from different views. Multi-view\nclustering algorithms [17, 18, 19, 20, 21] usually search for the consistent clustering hypotheses across\ndifferent views, where the representative methods include co-regularized based [17], co-training based\n[18] and high-order multi-view clustering [19]. Under the metric learning framework, multi-view\nclassi\ufb01cation methods [22, 23] jointly learn multiple metrics for multiple views. The representative\nmulti-view representation learning methods are CCA based, including kernelized CCA [10], deep\nneural networks based CCA [11, 24], and semi-paired and semi-supervised generalized correlation\nanalysis (S2GCA) [25]. Cross-View Learning (CVL) basically searches mappings between two\nviews, and has been widely applied in real applications [26, 27, 28, 29, 30]. With adversarial training,\nthe embedding spaces of two individual views are learned and aligned simultaneously [27]. The\ncross-modal convolutional neural networks are regularized to obtain a shared representation which\nis agnostic of the modality for cross-modal scene images [28]. The cross-view learning can be\nalso utilized for missing view imputation [31, 14]. For Partial Multi-View Learning (PMVL),\nexisting strategies usually transform the incomplete case into complete multi-view learning task. The\nimputation methods [5, 31] complete the missing views by leveraging the strength of deep neural\nnetworks. The grouping strategy [16] divides all samples according to the availability of data sources,\nand then multiple classi\ufb01ers are learned for late fusion. Although effective, this strategy cannot\nscale well for data with large number of views or small-sample-size case. Though the KCCA based\nalgorithm [8] can model incomplete data, it needs one complete (primary) view.\n\n2\n\nEncoding networksLatent spaceTesting with retuned networksPartial multi-view dataClass labelTesting sampleSamplesViews Class distributionView reconstruction\f2 Cross Partial Multi-View Networks\n\nRecently, there is an increasing interest in learning on data with multiple views, including multi-view\nlearning and cross-view learning. Differently, we focus on classi\ufb01cation based on data with missing\nviews, which is termed Partial Multi-View Classi\ufb01cation (see de\ufb01nition 2.1) where samples with\ndifferent view-missing patterns are involved. The proposed cross partial multi-view networks enable\nthe comparability for samples with different combinations of views instead of samples in two different\nviews, which generalizes the concept of cross-view learning. There are three main challenges for\npartial multi-view classi\ufb01cation: (1) how to project samples with arbitrary view-missing patterns\n(\ufb02exibility) into a common latent space (completeness) for comparability (in section 2.1)? (2) how to\nmake the learned representation to re\ufb02ect class distribution (structured representation) for separability\n(in section 2.2)? (3) how to reduce the gap between representation obtained in test stage and training\nstage for consistency (in section 2.3)? For clari\ufb01cation, we \ufb01rst give the formal de\ufb01nition of partial\nmulti-view classi\ufb01cation as follows:\nDe\ufb01nition 2.1 (Partial Multi-View Classi\ufb01cation (PMVC)) Given the training set {Sn, yn}N\nn=1,\nwhere Sn is a subset of the complete observations Xn = {x(v)\nv=1 (i.e., S \u2286 X ) and yn is the class\nlabel with N and V being the number of samples and views, respectively, PMVC trains a classi\ufb01er\nby using training data containing view-missing samples, to classify a new instance S with arbitrary\npossible view-missing pattern.\n\nn }V\n\n2.1 Multi-View Complete Representation\n\nConsidering the \ufb01rst challenge - we aim to design a \ufb02exible algorithm to project samples with\narbitrary view-missing patterns into a common space, where the desired latent representation should\nencode the information from observed views. Inspired by the reconstruction point of view [32], we\nprovide the de\ufb01nition of completeness for multi-view representation as follows:\n\nDe\ufb01nition 2.2 (Completeness for Multi-View Representation) A multi-view representation h is\ncomplete if each observation, i.e., x(v) from {x(1), ..., x(V )}, can be reconstructed from a mapping\nfv(\u00b7), i.e., x(v) = fv(h).\nIntuitively, we can reconstruct each view from a complete representation in a numerically stable way.\nFurthermore, we show that the completeness is achieved under the assumption [33] that each view is\nconditionally independent given the shared multi-view representation. Similar to each view from X ,\nthe class label y can also be considered as one (semantic) view, then we have\n\np(y,S|h) = p(y|h)p(S|h),\n\n(1)\nwhere p(S|h) = p(x(1)|h)p(x(2)|h)...p(x(V )|h). We can obtain the common representation by\nmaximizing p(y,S|h).\nBased on different views in S, we model the likelihood with respect to h given observations S as\n\n(2)\nwhere \u0398r are parameters governing the reconstruction mapping f (\u00b7) from common representation h\nto partial observations S with \u2206(S, f (h; \u0398r)) being the reconstruction loss. From the view of class\nlabel, we model the likelihood with respect to h given class label y as\n\np(S|h) \u221d e\u2212\u2206(S,f (h;\u0398r)),\n\n(3)\nwhere \u0398c are parameters governing the classi\ufb01cation function g(\u00b7) based on common representation\nh, and \u2206(y, g(h; \u0398c)) de\ufb01nes the classi\ufb01cation loss. Accordingly, assuming the data are independent\nand identically distributed (IID), the log-likelihood function is induced as\n\np(y|h) \u221d e\u2212\u2206(y,g(h;\u0398c)),\n\nln p(yn,Sn|hn) \u221d \u2212(cid:0) N(cid:88)\n\n\u2206(Sn, f (hn; \u0398r)) + \u2206(yn, g(hn; \u0398c))(cid:1),\n\nL({hn}N\n\nn=1, \u0398r, \u0398c) =\n\n(4)\nwhere Sn denotes the available views for the nth sample. On one hand, we encode the infor-\nmation from available views into a latent representation hn and denote the encoding loss as\n\nn=1\n\nN(cid:88)\n\nn=1\n\n3\n\n\f\u2206(Sn, f (hn; \u0398r)). On the other hand, the learned representation should be consistent with class\ndistribution, which is implemented by minimizing the loss \u2206(yn, g(hn; \u0398c)) to penalize the dis-\nagreement with class label.\nEffectively encoding information from different views is the key requirement for multi-view represen-\ntation, thus we seek a common representation which could recover the partial (available) observations.\nAccordingly, the following loss is induced\n\nV(cid:88)\n\n\u2206(Sn, f (hn; \u0398r)) = (cid:96)r(Sn, hn) =\n\nsnv||fv(hn; \u0398(v)\n\nr ) \u2212 x(v)\n\nn ||2,\n\n(5)\n\nv=1\n\nwhere \u2206(Sn, f (hn; \u0398r)) is specialized with the reconstruction loss (cid:96)r(Sn, hn). snv is an indicator\nof the availability for the nth sample in the vth view, i.e., snv = 1 and 0 indicating available\nand unavailable views, respectively. fv(\u00b7; \u0398(v)\nr ) is the reconstruction network for the vth view\nparameterized by \u0398(v)\nr . In this way, hn encodes comprehensive information from different available\nviews, and different samples (regardless of their missing patterns) are associated with representations\nin a common space, making them comparable.\nIdeally, minimizing Eq. (5) will induce a complete representation. Since the complete representation\nencodes information from different views, it should be versatile compared with each single view. We\ngive the de\ufb01nition of versatility for multi-view representation as follows:\n\nDe\ufb01nition 2.3 (Versatility for Multi-View Representation) Given the observations x(1), ..., x(V )\nfrom V views, the multi-view representation h is of versatility if \u2200 v and \u2200 mapping \u03d5(\u00b7) with\ny(v) = \u03d5(x(v)), there exists a mapping \u03c8(\u00b7) satisfying y(v) = \u03c8(h), where h is the corresponding\nmulti-view representation for sample S = {x(1), ..., x(V )}.\nAccordingly, we have the following theoretical result:\n\nProposition 2.1 (Versatility for the Multi-View Representation from Eq. (5)) There exists a solu-\ntion (with respect to latent representation h) to Eq. (5) which holds the versatility.\n\nProof 2.1 The proof for proposition 2.1 is as follow. Ideally, according to Eq. (5), there exists\nr ), where fv(\u00b7) is the mapping from h to x(v). Hence, \u2200 \u03d5(\u00b7) with y(v) = \u03d5(x(v)),\nx(v) = fv(h; \u0398(v)\nthere exists a mapping \u03c8(\u00b7) satisfying y(v) = \u03c8(h) by de\ufb01ning \u03c8(\u00b7) = \u03d5(fv(\u00b7)). This proves the\nversatility of the latent representation h based on multi-view observations {x(1), ..., x(V )}.\nIn practical case, it is usually dif\ufb01cult to guarantee the exact versatility for latent representation, then\nv=1 ||\u03d5(fv(h; \u0398(v))) \u2212\n\u03d5(x(v))||2) which is inversely proportional to the degree of versatility. Fortunately, it is easy to show\nr ) \u2212 x(v)||2 from Eq. (5) is the upper bound of ey if \u03d5(\u00b7) is\n(cid:3)\n\nthe goal is to minimize the error ey =(cid:80)V\nthat Ker with er =(cid:80)V\n\nv=1 ||\u03c8(h) \u2212 \u03d5(x(v))||2 (i.e.,(cid:80)V\n\nLipschitz continuous with K being the Lipschitz constant.\n\nv=1 ||fv(h; \u0398(v)\n\nAlthough the proof is inferred under the condition that all views are available, it is intuitive and easy\nto generalize the results for view-missing case.\n\n2.2 Classi\ufb01cation on Structured Latent Representation\n\nMulticlass classi\ufb01cation remains challenging due to possible confusing classes [34]. For the second\nchallenge - we target to ensure the learned representation to be structured for separability by a\nclustering-like loss. Speci\ufb01cally, we should minimize the following classi\ufb01cation loss\n\n\u2206(yn, y) = \u2206(yn, g(hn; \u0398c)),\n\n(6)\nwhere g(hn; \u0398c) = arg maxy\u2208YEh\u223cT (y)F (h, hn) and F (h, hn) = \u03c6(h; \u0398c)T \u03c6(hn; \u0398c), with\n\u03c6(\u00b7; \u0398c) being the feature mapping function for h, and T (y) being the set of latent representation\nfrom class y. In our implementation, we set \u03c6(h; \u0398c) = h for simplicity and effectiveness. By\njointly considering classi\ufb01cation and representation learning, the misclassi\ufb01cation loss is speci\ufb01ed as\n\n(cid:19)\n0, \u2206(yn, y) + Eh\u223cT (y)F (h, hn) \u2212 Eh\u223cT (yn)F (h, hn)\n\n.\n\n(7)\n\n(cid:18)\n\n(cid:96)c(yn, y, hn) = max\ny\u2208Y\n\n4\n\n\fn=1, hyperparameter \u03bb.\n\nAlgorithm 1: Algorithm for CPM-Nets\n/*Training*/\nInput: Partial multi-view dataset: D = {Sn, yn}N\nInitialize: Initialize {hn}N\nwhile not converged do\n\nr }V\n\nv=1 with random values.\n\nfor v = 1 : V do\nUpdate the network parameters \u0398(v)\nr \u2190 \u0398(v)\n\u0398(v)\nend\nfor n = 1 : N do\n\nn=1 and {\u0398(v)\n(cid:80)N\nn=1 (cid:96)r(Sn, hn; \u0398r)/\u2202\u0398(v)\nr \u2212 \u03b1\u2202 1\nr ;\n(cid:80)N\nn=1((cid:96)r(Sn, hn; \u0398r) + \u03bb(cid:96)c(yn, y, hn))/\u2202hn;\n\nUpdate the latent representation hn with gradient descent:\nhn \u2190 hn \u2212 \u03b1\u2202 1\n\nr with gradient descent:\n\nN\n\nN\n\nend\n\nend\nOutput: networks parameters {\u0398(v)\nr }V\n/*Test*/\nTrain the retuned networks ({\u0398(v)\nrt }V\nCalculate the latent representation with the retuned networks for test instance;\nClassify the test instance with y = arg maxy\u2208YEh\u223cT (y)F (h, htest).\n\nv=1 and latent representation {hn}N\nv=1) for test;\n\nn=1.\n\nCompared with mostly used parametric classi\ufb01cation equipped with cross entropy loss, the clustering-\nlike loss not only penalizes the misclassi\ufb01cation but also ensures structured representation. Speci\ufb01-\ncally, for correctly classi\ufb01ed sample, i.e., y = yn, there is no loss. For incorrectly classi\ufb01ed sample,\ni.e., y (cid:54)= yn, it will enforce the similarity between hn and the center corresponding to class yn larger\nthan that between hn and the center corresponding to class y (wrong label) with a margin \u2206(yn, y).\nHence, the proposed nonparametric loss naturally leads to a representation with clustering structure.\nBased on above considerations, the overall objective function is induced as\n\nN(cid:88)\n\nn=1\n\nmin{hn}N\n\nn=1,\u0398r\n\n1\nN\n\n(cid:96)r(Sn, hn; \u0398r) + \u03bb(cid:96)c(yn, y, hn),\n\n(8)\n\nwhere \u03bb > 0 balances the belief degree of information from multiple views and class labels.\n\n2.3 Test: Towards Consistency with Training Stage\n\nThe last challenge lies in the gap between training and test stages in representation learning. To\nclassify a test sample with incomplete views S, we need to obtain its common representation\nh. A straightforward way is to optimize the objective, minh (cid:96)r(S, f (h; \u0398r)), to encode the in-\nformation from S into h. This way raises a new issue - how to ensure the representations ob-\ntained in test stage consistent with training stage? The gap originates from the difference between\nthe objectives corresponding to training and test stages. Speci\ufb01cally, in test, we can obtain the\nuni\ufb01ed representation with h = arg minh(cid:96)r(S, f (h; \u0398r)) and then conduct classi\ufb01cation with\ny = arg maxy\u2208YEhn\u223cT (y)F (h, hn). However, it is different from representation learning in training\nstage which simultaneously considers reconstruction and classi\ufb01cation error. To address this issue, we\nintroduce the \ufb01ne-tuning strategy based on {Sn, hn}N\nn=1 obtained after training to update the network-\ns {fv(h; \u0398(v)\nr )}V\nv=1 for consistent mapping from observations to latent representation. Accordingly,\nin test stage we obtain the retuned encoding networks {f\nv=1 by \ufb01ne-tuning the networks\nv=1. Subsequently, we can solve the following objective - minh (cid:96)r(S, f\nr )}V\n{fv(h; \u0398(v)\n(h; \u0398rt)) to\nobtain the latent representation which is consistent with that in training. The optimization of the\nproposed CPM-Nets and the test procedure are summarized in Algorithm 1.\n\n(cid:48)\nv(h; \u0398(v)\n\nrt )}V\n\n(cid:48)\n\n2.4 Discussion on key components\n\nThe CPM-Nets are composed of two key components, i.e., encoding networks and clustering-like\nclassi\ufb01cation, which are different from conventional ways thus detailed explanations are provided.\n\n5\n\n\fv=1 snv||f (x(v)\n\nthere is an alternative route, i.e., (cid:96)r(Sn, hn) =(cid:80)V\nfrom the schema used in our model shown in Eq. (5), i.e., (cid:96)r(Sn, hn) =(cid:80)V\n\nEncoding schema. To encode the information from multiple views into a common representation,\nn ; \u0398(v)) \u2212 hn||2. This is different\nv=1 snv||f (hn; \u0398(v)) \u2212\nn ||2. The underlying assumption in our model is that information from different views are\nx(v)\noriginated from a latent representation h, and hence it can be mapped to each individual view.\nWhereas for the alternative, it indicates that the latent representation could be obtained from (mapping)\neach single view, which is basically not the case in real applications. For the alternative, ideally,\nminimizing the loss will enforce the representations of different views to be the same, which is\nnot reasonable especially for the views highly independent. From the view of information theory,\nthe encoding network for the vth view could be considered as communication channel with \ufb01xed\nproperty, i.e., p(x(v)|h) and p(h|x(v)) for our model and the alternative, respectively, where the\ndegradation process could be mimicked as data transmitting. Therefore, it is more reasonable to\nsend comprehensive information and receive partial information, i.e., p(x(v)|h) compared with its\ncounterpart - sending partial data and receiving comprehensive data, i.e., p(h|x(v)). The theoretical\nresults in subsection 2.1 also advocates above analysis.\nClassi\ufb01cation model. For classi\ufb01cation, the widely used strategy is to learn a classi\ufb01cation function\nbased on h, i.e., y = f (h; \u0398) parameterized with \u0398. Compared with this manner, the reasons of using\nthe clustering-like classi\ufb01er in our model are as follows. First, jointly learning the latent representation\nand parameterized classi\ufb01er is likely an under-constrained problem which may \ufb01nd representation\nthat can well \ufb01t the training data but not well re\ufb02ect the underlying patterns, thus the generalization\nability may be affected [35]. Second, the clustering-like classi\ufb01cation produces the compactness\nwithin the same class and separability between different classes for the learned representation, making\nthe classi\ufb01er interpretable. Third, the nonparametric way reduces the load of parameter tuning and\nre\ufb02ects a simpler inductive bias which is especially bene\ufb01cial to small-sample-size regime [36].\n\n3 Experiments\n\n3.1 Experiment Setting\nWe conduct experiments on the following datasets: (cid:5) ORL 2 The dataset contains 10 facial images\nfor each of 40 subjects. (cid:5) PIE 3 A subset containing 680 facial images of 68 subjects are used. (cid:5)\nYaleB Similar to previous work [37], we use a subset which contains 650 images of 10 subjects.\nFor ORL, PIE and YaleB, three types of features: intensity, LBP and Gabor are extracted. (cid:5) CUB\n[38] The dataset contains different categories of birds, where the \ufb01rst 10 categories are used and\ndeep visual features from GoogLeNet and text features using doc2vec [39] are used as two views.\n(cid:5) Handwritten 4 The dataset contains 10 categories from digits \u20180\u2019 to \u20189\u2019, and 200 images in each\ncategory with 6 types of image features are used. (cid:5) Animal The dataset consists of 10158 images\nfrom 50 classes with two types of deep features extracted with DECAF [40] and VGG19 [41].\nWe compared the proposed CPM-Nets with the following methods: (1) FeatConcate simply con-\ncatenates multiple types of features from different views. (2) CCA [9] maps multiple types of\nfeatures into one common space, and subsequently concatenates the low-dimensional features of\ndifferent views. (3) DCCA (Deep Canonical Correlation Analysis) [11] learns low-dimensional\nfeatures with neural networks and concatenates them. (4) DCCAE (Deep Canonical Correlated\nAutoEncoders) [24] employs autoencoders for common representations, and then combines these\nprojected low-dimensional features together. (5) KCCA (Kernelized CCA) [10] employs feature\nmappings induced by positive-de\ufb01nite kernels. (6) MDcR (Multi-view Dimensionality co-Reduction)\n[42] applies the kernel matching to regularize the dependence across multiple views and projects each\nview onto a low-dimensional space. (7) DMF-MVC (Deep Semi-NMF for Multi-View Clustering)\n[43] utilizes a deep structure through semi-nonnegative matrix factorization to seek a common feature\nrepresentation. (8) ITML (Information-Theoretic Metric Learning) [44] characterizes the metric\nusing a Mahalanobis distance function and solves the problem as a particular Bregman optimiza-\ntion. (9) LMNN (Large Margin Nearest Neighbors) [45] searches a Mahalanobis distance metric\nto optimize the k-nearest neighbours classi\ufb01er. For metric learning methods, the original features\n\n2https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html\n3http://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html\n4https://archive.ics.uci.edu/ml/datasets/Multiple+Features\n\n6\n\n\fof multiple views are concatenated, and then the new representation could be obtained with the\nprojection induced by the learned metric matrix.\nFor all methods, we tune the parameters with 5-fold cross validation. For CCA-based methods, we\nselect two views for the best performance. For our CPM-Nets, we set the dimensionality (K) of the\nlatent representation from {64, 128, 256} and tune the parameter \u03bb from the set {0.1, 1, 10} for all\ndatasets. We run 10 times for each method to report the mean values and standard deviations. Please\nrefer to the supplementary material for the details of network architectures and parameter settings.\n\n(a) ORL\n\n(b) PIE\n\n(c) YaleB\n\n(d) CUB\n\n(e) Animal\n\n(f) Handwritten\n\nFigure 2: Performance comparison under different missing rate (\u03b7).\n\n3.2 Experimental Results\n\nFirstly, we evaluate our algorithm by comparing it with state-of-the-art multi-view representation\n(cid:80)\nlearning methods, investigating the performance with respect to varying missing rate. The missing\nv Mv\nV \u00d7N , where Mv indicates the number of samples without the vth view.\nrate is de\ufb01ned as \u03b7 =\nSince datasets may be associated with different number of views, samples are randomly selected as\nmissing multi-view ones, and the missing views are randomly selected by guaranteeing at least one\nof them is available. As a result, partial multi-view data are obtained with diverse missing patterns.\nFor compared methods, the missing views are \ufb01lled with mean values according to available samples\nwithin the same class. From the results in Fig. 2, we have the following observations: (1) without\nmissing, our algorithm achieves very competitive performance on all datasets which validates the\nstability of our algorithm for complete multi-view data; (2) with increasing the missing rate, the\nperformance degradations of the compared methods are much larger than that of ours. Taking the\nresults on ORL for example, ours and LMNN obtain the accuracy of 98.4% and 98.0%, respectively,\nwhile with increasing the missing rate, the performance gap becomes much larger; (3) our model\nis rather robust to view-missing data, since our algorithm usually performs relatively promising\nwith heavily missing cases. For example, the performance decline (on ORL) is less than 5% with\nincreasing the missing rate from \u03b7 = 0.0 to \u03b7 = 0.3.\nFurthermore, we also \ufb01ll the missing views with recently proposed imputation method - Cascaded\nResidual Autoencoder (CRA) [5]. Since CRA needs a subset of samples with complete views in\ntraining, we set 50% data as complete-view samples and the left are samples with missing views\n(missing rate \u03b7 = 0.5). The comparison results are shown in Fig. 3. It is observed that \ufb01lling with\nCRA is generally better than that of using mean values due to capturing the correlation of different\nviews. Although the missing views are \ufb01lled with CRA by using part of samples with complete views,\nour proposed algorithm still demonstrates the clear superiority. The proposed CPM-Nets performs as\nthe best on all the six datasets.\n\n7\n\nMissing Rate (2)00.10.20.30.40.5Accuracy 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%100%DMFCCAFeatConKCCADCCADCCAEMDcRITMLLMNNOursMissing Rate (2)00.10.20.30.40.5Accuracy 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%100%DMFCCAFeatConKCCADCCADCCAEMDcRITMLLMNNOursMissing Rate (2)00.10.20.30.40.5Accuracy 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%100%DMFCCAFeatConKCCADCCADCCAEMDcRITMLLMNNOursMissing Rate (2)00.10.20.30.40.5Accuracy 30% 40% 50% 60% 70% 80% 90%DMFCCAFeatConKCCADCCADCCAEMDcRITMLLMNNOursMissing Rate (2)00.10.20.30.40.5Accuracy 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%100%DMFCCAFeatConKCCADCCADCCAEMDcRITMLLMNNOursMissing Rate (2)00.10.20.30.40.5Accuracy 30% 40% 50% 60% 70% 80% 90%100%DMFCCAFeatConKCCADCCADCCAEMDcRITMLLMNNOurs\fFigure 3: Performance comparison with view completion by using mean value and cascaded residual\nautoencoder (CRA) [5] (with missing rate \u03b7 = 0.5).\n\n(a) FeatCon (U)\n\n(b) DCCA (U)\n\n(c) Ours (U)\n\n(d) LMNN (S)\n\n(e) ITML (S)\n\n(f) Ours (S)\n\nFigure 4: Visualization of representations with missing rate \u03b7 = 0.5, where \u2018U\u2019 and \u2018S\u2019 indicate\n\u2018unsupervised\u2019 and \u2018supervised\u2019 manner in representation learning. (Zoom in for best view).\n\nWe visualize the representations from different methods on Handwritten to investigate the improve-\nment of CPM-Nets. As shown in Fig. 4, the sub\ufb01gures (a)-(c) obtain representations in unsupervised\nmanner. It is observed that the latent representation from our algorithm reveals the underlying class\ndistribution much better. With introducing label information, the representation from CPM-Nets are\nfurther improved, where the clusters are more compact and the margins between different classes\nbecomes more clear, which validates the effectiveness of using clustering-like loss. It is noteworthy\nthat we jointly exploit all samples, all views for random view-missing patterns in experiments,\ndemonstrating the \ufb02exility in handling partial multi-view data, while Fig. 4 supports the claim of\nstructured representation.\n\n4 Conclusions\n\nWe proposed a novel algorithm for partial multi-view data classi\ufb01cation named CPM-Nets, which can\njointly exploit all samples, all views and is \ufb02exible for arbitrary view-missing patterns. Our algorithm\nfocuses on learning a complete thus versatile representation to handling the complex correlation\namong multiple views. The common representation also endows the \ufb02exibility for handling the data\nwith arbitrary number of views and complex view-missing patterns, which is different from existing\nad hoc methods. Equipped with a clustering-like classi\ufb01cation loss, the learned representation is well\nstructured making the classi\ufb01er interpretable. We empirically validated that the proposed algorithm is\nrelatively robust to heavy and complex view-missing data.\n\n8\n\n38.1%42.4%38.3%35.6%60.1%65.1%66.3%76.3%70.0%88.9%37.4%33.8%35.8%36.3%34.3%23.1%36.3%36.6%56.4%61.8%66.2%67.8%67.1%67.6%57.5%58.0%59.8%81.2%76.6%91.0%32.6%32.8%21.5%19.5%42.5%60.9%68.5%58.4%64.3%36.1%33.8%34.1%39.4%30.9%26.9%24.8%19.6%47.2%45.5%45.5%24.0%17.1%46.5%51.8%57.5%47.5%53.2%CCAKCCADCCADCCAEDMFMDcRFeatConITMLLMNNCPMCCAKCCADCCADCCAEDMFMDcRFeatConITMLLMNNCPMCCAKCCADCCADCCAEDMFMDcRFeatConITMLLMNNCPM0%20%40%60%80%100%AccuracyORLPIEYaleB CRA mean Ours57.1%57.6%40.8%47.5%30.3%70.0%70.8%70.2%73.8%76.3%24.1%23.4%9.4%10.4%47.0%61.7%61.9%56.0%59.6%67.3%55.3%56.7%54.4%54.4%55.8%55.4%87.1%73.1%86.1%91.0%33.0%37.7%34.0%35.0%30.8%69.8%70.8%70.7%49.3%20.5%17.2%3.2%14.6%38.9%62.6%62.1%57.1%61.6%51.4%53.5%39.6%50.7%38.4%87.5%87.9%76.2%82.2%CCAKCCADCCADCCAEDMFMDcRFeatConITMLLMNNCPMCCAKCCADCCADCCAEDMFMDcRFeatConITMLLMNNCPMCCAKCCADCCADCCAEDMFMDcRFeatConITMLLMNNCPM0%20%40%60%80%100%AccuracyCUBAnimalHandwritten CRA mean Ours-50050100-500500123456789\fAcknowledgments\n\nThis work was partly supported by National Natural Science Foundation of China (61976151,\n61602337, 61732011, 61702358). We also appreciate the discussion with Ganbin Zhou and valuable\ncomments from all the reviewers.\n\nReferences\n[1] Tadas Baltru\u0161aitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learn-\n\ning: A survey and taxonomy. IEEE TPAMI, 41(2):423\u2013443, 2019.\n\n[2] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. arXiv preprint\n\narXiv:1304.5634, 2013.\n\n[3] Paramveer Dhillon, Dean Foster, and Lyle Ungar. Multi-view learning of word embeddings via\n\ncca. In NIPS, pages 199\u2013207, 2011.\n\n[4] Shao-Yuan Li, Yuan Jiang, and Zhi-Hua Zhou. Partial multi-view clustering. In AAAI, pages\n\n1968\u20131974, 2014.\n\n[5] Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. Missing modalities imputation via\n\ncascaded residual autoencoder. In CVPR, pages 1405\u20131414, 2017.\n\n[6] Xinwang Liu, Xinzhong Zhu, Miaomiao Li, Lei Wang, Chang Tang, Jianping Yin, Dinggang\nShen, Huaimin Wang, and Wen Gao. Late fusion incomplete multi-view clustering. IEEE\nTPAMI, 2018.\n\n[7] Mingxia Liu, Jun Zhang, Pew-Thian Yap, and Dinggang Shen. Diagnosis of alzheimer\u2019s disease\nusing view-aligned hypergraph learning with incomplete multi-modality data. In MICCAI,\npages 308\u2013316, 2016.\n\n[8] Anusua Trivedi, Piyush Rai, Hal Daum\u00e9 III, and Scott L DuVall. Multiview clustering with\n\nincomplete views. In NIPS Workshop, volume 224, 2010.\n\n[9] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321\u2013377, 1936.\n\n[10] Shotaro Akaho. A kernel method for canonical correlation analysis. arXiv preprint cs/0609071,\n\n2006.\n\n[11] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation\n\nanalysis. In ICML, pages 1247\u20131255, 2013.\n\n[12] Jian-Feng Cai, Emmanuel J Cand\u00e8s, and Zuowei Shen. A singular value thresholding algorithm\n\nfor matrix completion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[13] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for\n\nlearning large incomplete matrices. JMLR, 11(Aug):2287\u20132322, 2010.\n\n[14] Lei Cai, Zhengyang Wang, Hongyang Gao, Dinggang Shen, and Shuiwang Ji. Deep adversarial\n\nlearning for multi-modality missing data completion. In KDD, pages 1158\u20131166, 2018.\n\n[15] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng.\n\nMultimodal deep learning. In ICML, pages 689\u2013696, 2011.\n\n[16] Lei Yuan, Yalin Wang, Paul M Thompson, Vaibhav A Narayan, and Jieping Ye. Multi-source\nlearning for joint analysis of incomplete multi-modality neuroimaging data. In KDD, pages\n1149\u20131157, 2012.\n\n[17] Abhishek Kumar, Piyush Rai, and Hal Daume. Co-regularized multi-view spectral clustering.\n\nIn NIPS, pages 1413\u20131421, 2011.\n\n[18] Abhishek Kumar and Hal Daum\u00e9. A co-training approach for multi-view spectral clustering. In\n\nICML, pages 393\u2013400, 2011.\n\n9\n\n\f[19] Changqing Zhang, Huazhu Fu, Si Liu, Guangcan Liu, and Xiaochun Cao. Low-rank tensor\n\nconstrained multiview subspace clustering. In ICCV, pages 1582\u20131590, 2015.\n\n[20] Changqing Zhang, Qinghua Hu, Huazhu Fu, Pengfei Zhu, and Xiaochun Cao. Latent multi-view\n\nsubspace clustering. In CVPR, pages 4279\u20134287, 2017.\n\n[21] Zhiyong Yang, Qianqian Xu, Weigang Zhang, Xiaochun Cao, and Qingming Huang. Split\nmultiplicative multi-view subspace clustering. IEEE Transactions on Image Processing, 2019.\n\n[22] Haichao Zhang, Thomas S Huang, Nasser M Nasrabadi, and Yanning Zhang. Heterogeneous\nmulti-metric learning for multi-sensor fusion. In 14th International Conference on Information\nFusion, pages 1\u20138, 2011.\n\n[23] Heng Zhang, Vishal M Patel, and Rama Chellappa. Hierarchical multimodal metric learning for\n\nmultimodal classi\ufb01cation. In CVPR, pages 3057\u20133065, 2017.\n\n[24] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation\n\nlearning. In ICML, pages 1083\u20131092, 2015.\n\n[25] Xiaohong Chen, Songcan Chen, Hui Xue, and Xudong Zhou. A uni\ufb01ed dimensionality reduc-\ntion framework for semi-paired and semi-supervised multi-view data. Pattern Recognition,\n45(5):2005\u20132018, 2012.\n\n[26] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet,\nRoger Levy, and Nuno Vasconcelos. A new approach to cross-modal multimedia retrieval. In\nACM MM, pages 251\u2013260, 2010.\n\n[27] Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. Unsupervised cross-modal\n\nalignment of speech and text embedding spaces. In NIPS, pages 7365\u20137375, 2018.\n\n[28] Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Learning\naligned cross-modal representations from weakly aligned data. In CVPR, pages 2940\u20132949,\n2016.\n\n[29] Joey Tianyi Zhou, Ivor W Tsang, Sinno Jialin Pan, and Mingkui Tan. Multi-class heterogeneous\n\ndomain adaptation. Journal of Machine Learning Research, 20(57):1\u201331, 2019.\n\n[30] Joey Tianyi Zhou, Sinno Jialin Pan, and Ivor W Tsang. A deep learning framework for hybrid\n\nheterogeneous transfer learning. Arti\ufb01cial Intelligence, 2019.\n\n[31] Chao Shang, Aaron Palmer, Jiangwen Sun, Ko-Shin Chen, Jin Lu, and Jinbo Bi. Vigan: Missing\n\nview imputation with generative adversarial networks. In ICBD, pages 766\u2013775, 2017.\n\n[32] Tai Sing Lee. Image representation using 2d gabor wavelets. IEEE TPAMI, 18(10):959\u2013971,\n\n1996.\n\n[33] Martha White, Xinhua Zhang, Dale Schuurmans, and Yao-liang Yu. Convex multi-view\n\nsubspace learning. In NIPS, pages 1673\u20131681, 2012.\n\n[34] Weiwei Liu, Ivor W Tsang, and Klaus-Robert M\u00fcller. An easy-to-hard learning paradigm for\nmultiple classes and multiple labels. The Journal of Machine Learning Research, 18(1):3300\u2013\n3337, 2017.\n\n[35] Lei Le, Andrew Patterson, and Martha White. Supervised autoencoders: Improving generaliza-\n\ntion performance with unsupervised regularizers. In NIPS, pages 1\u201311, 2018.\n\n[36] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In\n\nNIPS, pages 1\u201311, 2017.\n\n[37] Athinodoros S Georghiades, Peter N Belhumeur, and David J Kriegman. From few to many:\nIllumination cone models for face recognition under variable lighting and pose. IEEE TPAMI,\n(6):643\u2013660, 2001.\n\n[38] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The\n\ncaltech-ucsd birds-200-2011 dataset. 2011.\n\n10\n\n\f[39] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In\n\nICML, pages 1188\u20131196, 2014.\n\n[40] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. In NIPS, pages 1097\u20131105, 2012.\n\n[41] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. In ICLR, 2015.\n\n[42] Changqing Zhang, Huazhu Fu, Qinghua Hu, Pengfei Zhu, and Xiaochun Cao. Flexible multi-\n\nview dimensionality co-reduction. IEEE TIP, 26(2):648\u2013659, 2017.\n\n[43] Handong Zhao, Zhengming Ding, and Yun Fu. Multi-view clustering via deep matrix factoriza-\n\ntion. In AAAI, pages 2921\u20132927, 2017.\n\n[44] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-\n\ntheoretic metric learning. In ICML, pages 209\u2013216, 2007.\n\n[45] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest\n\nneighbor classi\ufb01cation. JMLR, 10(Feb):207\u2013244, 2009.\n\n11\n\n\f", "award": [], "sourceid": 303, "authors": [{"given_name": "Changqing", "family_name": "Zhang", "institution": "Tianjin University"}, {"given_name": "Zongbo", "family_name": "Han", "institution": "Tianjin University"}, {"given_name": "yajie", "family_name": "cui", "institution": "tianjin university"}, {"given_name": "Huazhu", "family_name": "Fu", "institution": "Inception Institute of Artificial Intelligence"}, {"given_name": "Joey Tianyi", "family_name": "Zhou", "institution": "IHPC, A*STAR"}, {"given_name": "Qinghua", "family_name": "Hu", "institution": "Tianjin University"}]}