{"title": "Reconstruct & Crush Network", "book": "Advances in Neural Information Processing Systems", "page_first": 4548, "page_last": 4556, "abstract": "This article introduces an energy-based model that is adversarial regarding data: it minimizes the energy for a given data distribution (the positive samples) while maximizing the energy for another given data distribution (the negative or unlabeled samples). The model is especially instantiated with autoencoders where the energy, represented by the reconstruction error, provides a general distance measure for unknown data. The resulting neural network thus learns to reconstruct data from the first distribution while crushing data from the second distribution. This solution can handle different problems such as Positive and Unlabeled (PU) learning or covariate shift, especially with imbalanced data. Using autoencoders allows handling a large variety of data, such as images, text or even dialogues. Our experiments show the flexibility of the proposed approach in dealing with different types of data in different settings: images with CIFAR-10 and CIFAR-100 (not-in-training setting), text with Amazon reviews (PU learning) and dialogues with Facebook bAbI (next response classification and dialogue completion).", "full_text": "Reconstruct & Crush Network\n\nErin\u00e7 Merdivan1,2, Mohammad Reza Loghmani3 and Matthieu Geist4\n\n1 AIT Austrian Institute of Technology GmbH, Vienna, Austria\n\n2 LORIA (Univ. Lorraine & CNRS), CentraleSup\u00e9lec, Univ. Paris-Saclay, 57070 Metz, France\n\n3 Vision4Robotics lab, ACIN, TU Wien, Vienna, Austria\n\n4 Universit\u00e9 de Lorraine & CNRS, LIEC, UMR 7360, Metz, F-57070 France\n\nerinc.merdivan@ait.ac.at, loghmani@acin.tuwien.ac.at\n\nmatthieu.geist@univ-lorraine.fr\n\nAbstract\n\nThis article introduces an energy-based model that is adversarial regarding data:\nit minimizes the energy for a given data distribution (the positive samples) while\nmaximizing the energy for another given data distribution (the negative or unlabeled\nsamples). The model is especially instantiated with autoencoders where the energy,\nrepresented by the reconstruction error, provides a general distance measure for\nunknown data. The resulting neural network thus learns to reconstruct data from the\n\ufb01rst distribution while crushing data from the second distribution. This solution can\nhandle different problems such as Positive and Unlabeled (PU) learning or covariate\nshift, especially with imbalanced data. Using autoencoders allows handling a large\nvariety of data, such as images, text or even dialogues. Our experiments show\nthe \ufb02exibility of the proposed approach in dealing with different types of data in\ndifferent settings: images with CIFAR-10 and CIFAR-100 (not-in-training setting),\ntext with Amazon reviews (PU learning) and dialogues with Facebook bAbI (next\nresponse classi\ufb01cation and dialogue completion).\n\n1\n\nIntroduction\n\nThe main purpose of machine learning is to make inferences about unknown data based on encoded\ndependencies between variables learned from known data. Energy-based learning [16] is a framework\nthat achieves this goal by using an energy function that maps each point of an input space to a\nsingle scalar, called energy. The fact that energy-based models are not subject to the normalizability\ncondition of probabilistic models makes them a \ufb02exible framework for dealing with tasks such as\nprediction or classi\ufb01cation.\nIn the recent years, with the advancement of deep learning, astonishing results have been achieved in\nclassi\ufb01cation [15, 25, 8, 26]. These solutions focus on the standard setting, in which the classi\ufb01er\nlearns to discriminate between K classes, based on the underlying assumption that the training and\ntest samples belong to the same distribution. This assumption is violated in many applications in\nwhich the dynamic nature [6] or the high cardinality [19] of the problem prevent the collection of a\nrepresentative training set. In the literature, this problem is referred to as covariate shift [7, 24].\nIn this work, we address the covariate shift problem by explicitly learning features that de\ufb01ne the\nintrinsic characteristics of a given class of data rather than features that discriminate between different\nclasses. The aim is to distinguish between samples of a positive class (A) and samples that do not\nbelong to this class (\u00acA), even when test samples are not drawn from the same distribution as the\ntraining samples. We achieve this goal by introducing an energy-based model that is adversarial\nregarding data: it minimizes the energy for a given data distribution (the positive samples) while\nmaximizing the energy for another given data distribution (the negative or unlabeled samples). The\nmodel is instantiated with autoencoders because of their ability to learn data manifolds.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn summary, our contributions are the following:\n\ndistance measure of unknown data as the energy value;\n\ndata) by using features extracted from an autoencoder architecture;\n\n\u2022 a simple energy-based model dealing with the A/\u00acA classi\ufb01cation problem by providing a\n\u2022 a general framework that can deal with a large variety of data (images, text and sequential\n\u2022 a model that implicitly addresses the imbalanced classi\ufb01cation problem;\n\u2022 state-of-the-art results for the dialogue completion task on the Facebook bAbI dataset and\ncompetitive results for the general A/\u00acA classi\ufb01cation problem using different datasets such\nas CIFAR-10, CIFAR-100 and Amazon Reviews.\n\nThe next section introduces the proposed \u201creconstruct & crush\u201d network, section 3 positions our\napproach compared to related works, section 4 presents the experimental results on the aforementioned\nproblems and section 5 draws the conclusions.\n\n2 Model\nLet de\ufb01ne ppos as the probability distribution producing positive samples, xpos \u223c ppos. Similarly, write\npneg the distribution of negative samples, xneg \u223c pneg. More generally, these negative samples can be\nunlabeled samples (possibly containing positive samples). This case will be considered empirically,\nbut we keep this notation for now.\nLet N denote a neural network that takes as input a sample x and outputs a (positive) energy value E:\n\nN (x) = E \u2208 R+.\n\nThe proposed approach aims at learning a network N that assign low energy values to positive\nsamples (N (xpos) small for xpos \u223c ppos) and high energy values for negative samples (N (xneg) high\nfor xneg \u223c pneg).\nLet m > 0 be a user-de\ufb01ned margin, we propose to use the following loss LN and associated risk\nRN :\n\nL(xpos, xneg; N ) = N (xpos) + max(0, m \u2212 N (xneg))\n\nR(N ) = Expos\u223cppos,xneg\u223cpnegL(xpos, xneg)\n\n= Expos\u223cppos [N (xpos)] + Exneg\u223cpneg[max(0, m \u2212 N (xneg))].\n\n(1)\n\nIdeally, minimizing this risk amounts to have no reconstruction error over positive samples and a\nreconstruction error greater than m (in expectation) over negative samples. The second term of the\nrisk acts as a regularizer that enforces the network to assign a low energy only to positive samples.\nThe choice of the margin m will affect the behavior of the network: if m is too small a low energy\nwill be assigned to all inputs (both positive and negative), while if m is too large assigning a large\nenergy to negative samples can prevent from reconstructing the positive ones.\nWe specialize our model with autoencoders, that are a natural choice to represent energy-based\nmodels. An autoencoder is composed of two parts, the encoder (Enc) that projects the data into an\nencoding space, and the decoder (Dec) that reconstructs the data from this projection:\n\nEnc :X \u2192 Z\nDec :Z \u2192 X\n(cid:107)x\u2212 Dec(Enc(x))(cid:107)2.\n\nargmin\nEnc,Dec\n\nHere, X is the space of the input data (either positive or negative) and Z is the space of encoded data.\nIn this setting, the reconstruction error of a sample x can be interpreted as the energy value associated\nto that sample:\n\nN (x) = (cid:107)x \u2212 Dec(Enc(x))(cid:107)2 = E.\n\nOur resulting reconstruct & crush network (RCN) is thus trained to assign a low reconstruction error\nto xpos (reconstruct) and an high reconstruction error to xneg (crush).\nAny stochastic gradient descent method can be used to optimize the risk of Eq. (1), the mini-batches\nof positive and negative samples being sampled independently from the corresponding distributions.\n\n2\n\n\f3 Related work\n\nWith the diffusion of deep neural networks, autoencoders have received a new wave of attention due\nto their use for layer-wise pretraining [1]. Although the concept of autoencoders goes back to the\n80s [23, 3, 10], many variations have been proposed more recently, such as denoising autoencoder [27],\nstacked autoencoders [9] or variational autoencoders [13].\nDespite the use of autoencoders for pretraining is not a common practice anymore, various re-\nsearches still take advantage of their properties. In energy-based generative adversarial networks\n(EBGAN) [30], an autoencoder architecture is used to discriminate between real samples and \"fake\"\nones produced by the generator. Despite not being a generative model, our method shares with\nEBGAN the interpretation of the reconstruction error provided by the autoencoder as energy value\nand the fundamentals of the discriminator loss. However, instead of the samples produced by the\ngenerator network, we use negative or unlabeled samples to push the autoencoder to discover the\ndata manifold during training. In other words, EBGAN searches for a generative model by training\nadversarial networks, while in our framework the network tries to make two distributions adversarial.\nThe use of unlabeled data (that could contain both positive and negative samples) together with\npositive samples during training is referred to as PU (Positive and Unlabeled) learning [5, 17]. In\nthe literature, works in the PU learning setting [29, 18] focus on text-based applications. Instead, we\nshow in the experiments that our work can be applied to different type of data such as images, text\nand sequential data.\nSimilarly to our work, [11] uses the reconstruction error as a measure to differentiate between positive\nand negative samples. However they train their network with either positive or negative data only.\nIn addition, instead of end-to-end training, they provide a two-stage process in which a classi\ufb01er is\ntrained to discriminate between positive and negative samples based on the reconstruction error.\nIn the context of dialogue management systems, the score proposed in [21] has been used as a quality\nmeasure of the response. Nevertheless, [19] shows that this score fails when a correct response, that\nlargely diverges from the ground truth, is given. The energy value of the RCN is a valid score to\ndiscriminate between good and bad responses, as we show in section 4.4.\n\n4 Experimental results\n\nIn this section, we experiment the proposed RCN on various tasks with various kind of data. We\nconsider a not-in-training setting for CIFAR-10 and CIFAR-100 (sections 4.1 and 4.2), a PU learning\nsetting for the amazon reviews dataset (section 4.3) and a dialogue completion setting for the Facebook\nbAbI dataset (section 4.4).\nFor an illustrative purpose, we also provide examples of reconstructed and crushed images from\nCIFAR-10 and CIFAR-100 in \ufb01gure 1, corresponding to experiments of sections 4.1 and 4.2.\n\n4.1 CIFAR-10\n\nCIFAR-10 consists of 60k 32x32 color images in 10 classes, with 6k images per class. There are 50k\ntraining images and 10k test images [14]. We converted the images to gray-scale and used 5k images\nper class.\nThis set of experiments belong to the not-in-training setting [6]: the training set contains positive\nand negative samples and the test set belongs to a different distribution than the training set. The\n\u201cautomobile\u201d class is used as the positive class (A) and the rest of the classes are considered to be the\nnegative class (\u00acA) (binary classi\ufb01cation problem). All the training samples are used for training,\nexcept for those belonging to the \u201cship\u201d class. Test samples of \u201cautomobile\u201d and \u201cship\u201d are used for\ntesting. It is worth noticing that the size of positive and negative training sets is highly imbalanced:\n5k positive samples and 40k negative samples.\nIn this experiment, we show the superior performances of our network with respect to standard\nclassi\ufb01ers in dealing with images of an unseen class. Since we are dealing with a binary classi\ufb01cation\nproblem, we de\ufb01ne a threshold T for the energy value. This threshold is used in RCN to distinguish\nbetween the positive and the negative class. For our autoencoder, we used a convolutional network\nde\ufb01ned as: (32)3c1s-(32)3c1s-(64)3c2s-(64)3c2-(32)3c1s-512f-1024f, where \u201c(32)3c1s\u201d denotes\n\n3\n\n\fFigure 1: Illustrations of Reconstructed and Crushed images by RCN from CIFAR10 and CIFAR100.\n\na convolution layer with 32 output feature maps, kernel size 3 and stride 1, and \u201c512f\u201d denotes\na fully-connected layer with 512 hidden units. The size of the last layer corresponds to the size\nof the images (32x32=1024). For standard classi\ufb01cation we add on top of the last layer another\nfully-connected layer with 2 output neurons (A/\u00acA). The choice of the architectures for standard\nclassi\ufb01er and autoencoder is driven by necessity of fair comparison. ReLU activation functions are\nused for all the layers except for the last fully-connected layer of the standard classi\ufb01er in which a\nSoftmax function is used. These models are implemented in Tensor\ufb02ow and trained with the adam\noptimizer [12] (learning rate of 0.0004) and a mini-batch size of 100 samples. The margin m was set\nto 1.0 and the threshold T to 0.5.\nTable 1 shows the true positive rate (TPR=#(correctly classi\ufb01ed cars)/#cars) and the true negative rate\n(TNR=#(correctly classi\ufb01ed ships)/#ships) obtained by the standard classi\ufb01er (CNN / CNN-reduced)\nand our network (RCN). CNN-reduced shows the performance of the standard classi\ufb01er when using\nthe same amount of positive and negative samples. It can be noticed that RCN presents the best TNR\nand a TPR comparable to the one of CNN-reduced. These results shows that RCN is a better solution\nwhen dealing with not-in-training data. In addition, the TPR and TNR of our method is comparable\ndespite the imbalanced training set.\nFigure 2 clearly shows that not-in-training samples (ship images) are positioned between positive\nin-training samples (automobile images) and in-training-negative samples (images from all classes\nexcept automobile and ship). It can be noticed that negative in-training samples have a reconstruction\nloss close to margin value 1.0.\n\nTable 1: Performances of standard classi\ufb01er (CNN / CNN-reduced) and our method (RCN) on\nCIFAR-10. The positive class corresponds to \"automobile\" and the negative class corresponds to\n\"ship\" (unseen during the training phase).\n\nMethod\nCNN-reduced\nCNN\nRCN\n\nTrue Positive Rate True Negative Rate\n0.82\n0.74\n0.81\n\n0.638\n0.755\n0.793\n\n4\n\n\fFigure 2: Mean reconstruction error over the epochs of positive in-training, negative in-training and\nnegative not-in-training samples of CIFAR-10.\n\n4.2 CIFAR-100\n\nCIFAR-100 is similar to CIFAR-10, except it has 100 classes containing 600 images each (500 for\ntraining and 100 for testing) [14]. The 100 classes in the CIFAR-100 are grouped into 20 super-classes\nwith 5 classes each. Each image comes with a pair of labels: the class and the super-class.\nIn this set of experiments, the \u201cfood containers\u201d super-class is used as the positive class (A) and the\nall the other super-classes are considered to be the negative class (\u00acA) (binary classi\ufb01cation problem).\nDuring training, 4 out of 5 classes belonging to the \u201cfood containers\u201d super-class (\u201cbottles\u201d, \u201cbowls\u201d,\n\u201ccans\u201d, \u201ccups\u201d) are used as the positive training set and 4 out of 5 classes belonging to the \u201c\ufb02owers\u201d\nsuper-class (\u201corchids\u201d, \u201cpoppies\u201d, \u201croses\u201d, \u201csun\ufb02owers\u201d) are used as the negative training set. At\ntest time, two in-training classes (\u201ccups\u201d and \u201csun\ufb02owers\u201d), two not-in-training classes belonging\nto \u201cfood containers\u201d (\u201cplates\u201d) and \u201c\ufb02owers\u201d (\u201ctulips\u201d) and two not-in-training classes belonging to\nexternal super-classes (\u201ckeyboard\u201d and \u201cchair\u201d) are used.\nIn this experiment, we show the superior performances of our network with respect to standard\nclassi\ufb01ers in dealing with data coming from unknown distributions and from unseen modes of the\nsame distributions as the training data. The same networks and parameters of section 4.1 are used\nhere.\nTable 2 shows the true positive rate (TPR=#(correctly classi\ufb01ed plates)/#plates) and the true negative\nrate (TNR=#(correctly classi\ufb01ed tulips)/#tulips) obtained by the standard classi\ufb01er (CNN) and our\nnetwork (RCN). It can be noticed that RCN presents the best results both for TNR and for TPR.\nThese results shows that RCN is a better solution when dealing with not-in-training data coming from\nunseen modes of the data distribution. It is worth noticing that only 4k samples (2k positive and 2k\nnegative) have been used during training.\nFigure 3 clearly shows the effectiveness of the learning procedure of our framework: the networks\nassigns low energy value (close to 0) to positive samples, high energy value (close to m) to negative\nsamples related to the negative training set and medium energy value (close to m/2) to negative\nsamples unrelated to the negative training set.\n\nTable 2: Performances of the standard classi\ufb01er (CNN) and our method (RCN) on CIFAR-100. The\npositive class corresponds to \"plates\" and the negative class corresponds to \"tulips\".\n\nMethod True Positive Rate True Negative Rate\nCNN\nRCN\n\n0.81\n0.853\n\n0.718\n0.861\n\n5\n\n\fFigure 3: Mean reconstruction error over the epochs of positive in-training and not-in-training\n(blue), negative in-training and not-in-training (red) and not-in-training unrelated (green,black) of\nCIFAR-100.\n\n4.3 Amazon review\n\nAmazon reviews is a dataset containing product reviews (ratings, text, helpfulness votes) and meta-\ndata (descriptions, category information, price, brand, and image features) from Amazon, including\n142.8 million reviews spanning [20]. Here, we only use the ratings and text features.\nThis set of experiments belong to the PU learning setting: the training set contains positive and\nunlabeled data. The positive training set contains 10k \"5-star\" reviews and the unlabeled training\nset contains 10k unlabeled review (containing both positive and negative review). The test set is\ncomposed of 10k samples: 5k \"5-star\" (positive) reviews and 5k \"1-star\" (negative) reviews. The aim\nhere is to show that RCN performs well in the PU learning setting with unlabeled sets with different\npositive/negative samples ratio.\nFor handling the text data, we used the pretrained Glove word-embedding [22] with 100 feature\ndimensions. We set the maximum number of words in a sentence to 40 and zero-padded shorter\nsentences.\nFor our autoencoder, we used a 1-dimensional (1D) convolutional network de\ufb01ned as: (128)7c1s-\n(128)7c1s-(128)3c1s-(128)3c1-(128)3c1s-2048f-4000f, where \u201c(128)7c1s\u201d denotes a 1D convolution\nlayer with 128 output feature maps, kernel size 7 and stride 1. ReLU activation functions are used\nfor all the layers. These models are implemented in Tensor\ufb02ow and trained with the adam optimizer\n(learning rate of 0.0004) and a mini-batch size of 100 samples. The margin m was set to 0.85 and\nthe threshold T to 0.425.\nTable 3 shows the results of different well-established PU learning methods, together with ours\n(RCN), on the Amazon review dataset. In can be noticed that, despite the fact that the architecture of\nour method is not speci\ufb01cally designed for handling the PU learning setting, it shows comparable\nresults to the other methods, even when unlabeled training data with a considerable amount of positive\nsamples (50%) are used.\nTable 4 presents some examples from the test set. It can be noticed that positive comments are\nassigned a low reconstruction error (energy) and vice-versa.\n\n4.4 Facebook bAbI dialogue\n\nFacebook bAbI dialogue is a dataset containing dialogues related to 6 different tasks in which the\nuser books a table in a restaurant with the help of a bot [2]. For each task 1k training and 1k test\ndialogues are provided. Each dialogue has 4 to 11 turns between the user and the bot for a total of\n\n6\n\n\fTable 3: F-measure of positive samples obtained with Roc-SVM [28], Roc-EM [18], Spy-SVM [18],\nNB-SVM [18], NB-EM [18] and RCN (ours). The scores are obtained on two different con\ufb01guration\nof the unlabeled training set: one containing 5% of positive samples and one containing 50% of\npositive samples.\n\nMethod\nRoc-SVM [28]\nRoc-EM [18]\nSpy-SVM [18]\nNB-SVM [18]\nNB-EM [18]\nRCN\n\nF-measure for pos. samples (%5-%95)\n0.92\n0.91\n0.92\n0.92\n0.91\n0.90\n\nF-measure for pos. samples (%50-%50)\n0.89\n0.90\n0.89\n0.86\n0.86\n0.83\n\nTable 4: Examples of positive (5/5 score) and negative (1/5 score) reviews from Amazon review with\nthe corresponding reconstruction error assigned from RCN.\n\nReview\n\nScore Error\n\nexcellent funny fast reading i would recommend to all my friends\nthis is easily one of my favorite books in the series i highly\nrecommend it\nsuper book liked the sequence and am looking forward to a sequel\nkeeping the s and characters would be nice\ni truly enjoyed all the action and the characters in this book the\ninteractions between all the characters keep you drawn in to the\nbook\nthis book was the worst zombie book ever not even worth the\nreview\nway too much sex and i am not a prude i did not \ufb01nish and then\ndeleted the book\nin reality it rates no stars it had a political agenda in my mind it\nwas a waste my money\nfortunately this book did not cost much in time or money it\nwas very poorly written an ok idea poorly executed and poorly\ndeveloped\n\n5/5\n5/5\n\n5/5\n\n5/5\n\n1/5\n\n1/5\n\n1/5\n\n1/5\n\n0.00054\n0.00055\n\n0.00060\n\n0.00066\n\n1.00627\n\n1.00635\n\n1.00742\n\n1.00812\n\n\u223c6k turns in each set (training and test) for task 1 and \u223c9.5k turns in each set for task 2. Here, we\nconsider the training and test data associated to tasks 1 and 2 because the other tasks require querying\nKnowledge Base (KB) upon user request: this is out of the scope of the paper.\nIn task 1, the user requests to make a new reservation in a restaurant by de\ufb01ning a query that can\ncontain from 0 to 4 required \ufb01elds (cuisine type, location, number of people and price range) and the\nbot asks questions for \ufb01lling the missing \ufb01elds. In task 2, the user requests to update a reservation in\na restaurant between 1 and 4 times.\nThe training set is built in such a way that, for each turn in a dialogue, together with the positive\n(correct) response, 100 possible negative responses are selected from the candidate set (set of all\nbot responses in the Facebook bAbI dialogue dataset with a total of 4212 samples). The test set is\nbuilt in such a way that, for each turn in a dialogue, all possible negative responses are selected from\nthe candidate set. More precisely, for task 1, the test set contains approximately 6k positive and 25\nmillion negative dialogue history-reply pairs, while for task 2, it contains approximately 9k positive\nand 38 million negative pairs.\nFor our autoencoder, we use a gated recurrent unit (GRU) [4] with 1024 hidden units and a projection\nlayer on top of it in order to replicate the input sequence in output. An upper limit of 100 was set for\n\n7\n\n\fthe sequence length and a feature size of 50 was selected for word embeddings. The GRU uses ReLU\nactivation and a dropout of 0.1. This model is implemented in Tensor\ufb02ow and trained with the adam\noptimizer (learning rate of 0.0004) and a mini-batch size of 100 samples.\nIn this experiments, our network equals the state-of-the-art performance of memory networks pre-\nsented in [2] by achieving 100% accuracy both for next response classi\ufb01cation and for dialogue\ncompletion where dialogue is considered as completed if all responses within the dialogue are\ncorrectly chosen.\n\n5 Conclusions\n\nWe have introduced a simple energy-based model, adversarial regarding data by minimizing the\nenergy of positive data and maximizing the energy of negative data. The model is instantiated with\nautoencoders where the speci\ufb01c architecture depends on the considered task, thus providing a family\nof RCNs. Such an approach can address various covariate shift problems, such as not-in-training and\npositive and unlabeled learning and various types of data.\nThe ef\ufb01ciency of our approach has been studied with exhaustive experiments on CIFAR-10, CIFAR-\n100, the Amazon reviews dataset and the Facebook bAbI dialogue dataset. These experiments showed\nthat RCN can obtain state-of-the art results for the dialogue completion task and competitive results\nfor the general A/\u00acA classi\ufb01cation problem. These outcomes suggest that the energy value provided\nby RCN can be used to asses the quality of response given the dialogue history. Future works will\nextend the RCN to the multi-class classi\ufb01cation setting.\nThese results suggest that the energy value provided by RCN can be used to assess the quality of the\nresponse given the dialogue history. We plan to study further this aspect in the near future, in order to\nprovide an alternative metric for dialogue systems evaluation.\n\nAcknowledgments\n\nThis work has been funded by the European Union Horizon2020 MSCA ITN ACROSSING project\n(GA no. 616757). The authors would like to thank the members of the project\u2019s consortium for their\nvaluable inputs.\n\nReferences\n[1] Y. Bengio. Learning deep architectures for ai. Foundations and trends in Machine Learning,\n\n2(1):1\u2013127, 2009.\n\n[2] A. Bordes and J. Weston. Learning end-to-end goal-oriented dialog. arXiv:1605.07683, 2016.\n\n[3] H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value\n\ndecomposition. Biological Cybernetics, 59(4):291\u2013294, 1988.\n\n[4] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine\n\ntranslation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.\n\n[5] F. Denis. Pac learning from positive statistical queries. Algorithmic Learning Theory,112\u2013126,\n\n1998.\n\n[6] F. Geli and L. Bing. Social media text classi\ufb01cation under negative covariate shift. EMNLP,\n\n2015.\n\n[7] W.H. Greene. Sample selection bias as a speci\ufb01cation error: A comment. Econometrica:\n\nJournal of the Econometric Society, pages 795\u2013798, 1981.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR,\n\n2016.\n\n[9] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006.\n\n8\n\n\f[10] G.E. Hinton and R.S. Zemel. Autoencoders, minimum description length and helmholtz free\n\nenergy. NIPS, 1994.\n\n[11] N. Japkowicz, C. Myers, and M. Gluck. A novelty detection approach to classi\ufb01cation. IJCAI,\n\n1995.\n\n[12] D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n\n[13] D. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2013.\n\n[14] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n\n[15] A. Krizhevsky, I. Sutskever, and G. E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. NIPS, 2012.\n\n[16] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F.J. Huang. A tutorial on energy-based\n\nlearning. Technical report, MIT Press, 2006.\n\n[17] X. Li and L. Bing. Learning from positive and unlabeled examples with different data distribu-\n\ntions. ECML, 2005.\n\n[18] B Liu, Y. Dai, X. Li, W-S. Lee, and P. Yu. Building text classi\ufb01ers using positive and unlabeled\n\nexamples. ICDM, 2003.\n\n[19] C. Liu, R. Lowe, I.V. Serban, M. Noseworthy, L. Charlin, and J. Pineau. How not to your\ndialogue system: An empirical study of unsupervised evaluation metrics for dialogue response\ngeneration. EMNLP, 2016.\n\n[20] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions\n\nwith review text. RecSys, 2013.\n\n[21] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of\n\nmachine translation. ACL, 2002.\n\n[22] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation.\n\nEMNLP, 2014.\n\n[23] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning representations by back-propagating\n\nerrors. Cognitive Modeling, 5(3):1, 1988.\n\n[24] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-\n\nlikelihood function. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\n\nA. Rabinovich. Going deeper with convolutions. CVPR, 2015.\n\n[27] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust\n\nfeatures with denoising autoencoders. ACM, 2008.\n\n[28] Li X. and Liu B. Learning to classify text using positive and unlabeled data. IJCAI, 2003.\n\n[29] H. Yu, J. Han, and K. Chang. Pebl: Positive example based learning for web page classi\ufb01cation\n\nusing svm. KDD, 2002.\n\n[30] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial networks. ICLR,\n\n2017.\n\n9\n\n\f", "award": [], "sourceid": 2377, "authors": [{"given_name": "Erinc", "family_name": "Merdivan", "institution": "Austrian Institute of Technology GmbH"}, {"given_name": "Mohammad Reza", "family_name": "Loghmani", "institution": "TU Wien"}, {"given_name": "Matthieu", "family_name": "Geist", "institution": "Universit\u00e9 de Lorraine"}]}