{"title": "Effective End-to-end Unsupervised Outlier Detection via Inlier Priority of Discriminative Network", "book": "Advances in Neural Information Processing Systems", "page_first": 5962, "page_last": 5975, "abstract": "Despite the wide success of deep neural networks (DNN), little progress has been made on end-to-end unsupervised outlier detection (UOD) from high dimensional data like raw images. In this paper, we propose a framework named E^3Outlier, which can perform UOD in a both effective and end-to-end manner: First, instead of the commonly-used autoencoders in previous end-to-end UOD methods, E^3Outlier for the first time leverages a discriminative DNN for better representation learning, by using surrogate supervision to create multiple pseudo classes from original unlabelled data. Next, unlike classic UOD that utilizes data characteristics like density or proximity, we exploit a novel property named inlier priority to enable end-to-end UOD by discriminative DNN. We demonstrate theoretically and empirically that the intrinsic class imbalance of inliers/outliers will make the network prioritize minimizing inliers' loss when inliers/outliers are indiscriminately fed into the network for training, which enables us to differentiate outliers directly from DNN's outputs. Finally, based on inlier priority, we propose the negative entropy based score as a simple and effective outlierness measure. Extensive evaluations show that E^3Outlier significantly advances UOD performance by up to 30% AUROC against state-of-the-art counterparts, especially on relatively difficult benchmarks.", "full_text": "Effective End-to-end Unsupervised Outlier Detection\n\nvia Inlier Priority of Discriminative Network\n\nSiqi Wang1\u2217, Yijie Zeng2\u2217, Xinwang Liu1, En Zhu1, Jianping Yin3, Chuanfu Xu1, Marius Kloft4\n\nwangsiqi10c@nudt.edu.cn, yzeng004@e.ntu.edu.sg, {xinwangliu, enzhu}@nudt.edu.cn\n\njpyin@dgut.edu.cn, xuchuanfu@nudt.edu.cn, kloft@cs.uni-kl.de\n\n1National University of Defense Technology,\n\n3Dongguan University of Technology,\n\n2Nanyang Technological University\n\n4Technische Universit\u00e4t Kaiserslautern\n\nAbstract\n\nDespite the wide success of deep neural networks (DNN), little progress has been\nmade on end-to-end unsupervised outlier detection (UOD) from high dimensional\ndata like raw images. In this paper, we propose a framework named E3Outlier,\nwhich can perform UOD in a both effective and end-to-end manner: First, instead of\nthe commonly-used autoencoders in previous end-to-end UOD methods, E3Outlier\nfor the \ufb01rst time leverages a discriminative DNN for better representation learning,\nby using surrogate supervision to create multiple pseudo classes from original unla-\nbelled data. Next, unlike classic UOD that utilizes data characteristics like density\nor proximity, we exploit a novel property named inlier priority to enable end-to-end\nUOD by discriminative DNN. We demonstrate theoretically and empirically that\nthe intrinsic class imbalance of inliers/outliers will make the network prioritize\nminimizing inliers\u2019 loss when inliers/outliers are indiscriminately fed into the net-\nwork for training, which enables us to differentiate outliers directly from DNN\u2019s\noutputs. Finally, based on inlier priority, we propose the negative entropy based\nscore as a simple and effective outlierness measure. Extensive evaluations show\nthat E3Outlier signi\ufb01cantly advances UOD performance by up to 30% AUROC\nagainst state-of-the-art counterparts, especially on relatively dif\ufb01cult benchmarks.\n\n1\n\nIntroduction\n\nAn outlier is de\ufb01ned as \u201can observation which deviates so much from the other observations as to\narouse suspicions that it was generated by a different mechanism\u201d [1]. In some context of the literature,\noutliers are also referred as anomalies, deviants, novelties or exceptions [2]. Outlier detection (OD)\nhas broad applications such as \ufb01nancial fraud detection [3], intrusion detection [4], fault detection [5],\netc. Various solutions have been proposed to tackle OD (see [6] for a comprehensive review). Based\non the availability of labels, those solutions can be accordingly divided into three categories below\n[7]: 1) Supervised OD (SOD) deals with the case where a training set is provided with both labelled\ninliers/outliers, but it suffers from expensive data labelling and the rarity of outliers in practice [6].\n2) Semi-supervised OD (SSOD) only requires pure single-class training data that are labelled as\n\u201cinlier\u201d or \u201cnormal\u201d, and no outlier is involved during training. 3) Unsupervised OD (UOD) handles\ncompletely unlabelled data mixed with outliers, and no data label is provided for training at all.\nIn this paper we will limit our discussion to UOD, as most data are unlabelled in practice and UOD is\nthe most widely applicable [7]. In particular, two clari\ufb01cations of concepts must be made: First, in\nsome literature like [8, 9], \u201cunsupervised outlier/anomaly detection\u201d actually refers to SSOD rather\nthan UOD by our de\ufb01nition. Second, a recent topic is out-of-distribution sample detection, which\n\n\u2217Authors contribute equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdetects samples that are not from the distribution of training samples [10, 11, 12]. It is similar to\nSSOD, but it requires well-labelled multi-class data for training rather than single-class data in SSOD.\nBoth cases above are different from UOD that does not use any label information in this paper.\nRecently, surging image/video data have inspired important UOD applications in computer vision,\ne.g. re\ufb01ning web image query results [13] and video abnormal event detection [14]. Unfortunately,\ndespite the remarkable success of end-to-end deep neural networks (DNN) in computer vision [15], an\neffective and end-to-end UOD strategy is still under exploration: State-of-the-art methods [16, 17, 18]\nunexceptionally rely on deep autoencoders (AE) or convolutional autoencoders (CAE) to realize\neasily achievable DNN based UOD, but they all suffer from AE/CAE\u2019s ineffective representation\nlearning (detailed in Sec. 3.1). Motivated by this gap, we aim to address UOD in a both effective and\nend-to-end fashion, with the application to detect outlier images from contaminated datasets.\nContributions. This paper proposes an effective and end-to-end UOD framework named E3Outlier.\nSpeci\ufb01cally, our contributions can be summarized below: 1) To liberate DNN based UOD from\nAE/CAE\u2019s ineffective representation learning, E3Outlier for the \ufb01rst time enables us to adopt powerful\ndiscriminative DNN architectures like ResNet [19] for representation learning in UOD. This is\nrealized by surrogate supervision, which creates multiple pseudo classes by imposing various simple\noperations on original unlabelled data. 2) E3Outlier discovers outliers based on a novel property of\ndiscriminative network named inlier priority, which evidently differs from previous methods that\nutilize certain data characteristics (e.g. density, proximity, distance) to perform UOD. Through both\ntheory and experiments, we demonstrate that inlier priority will encourage the network to prioritize\nthe reduction of inliers\u2019 loss during network training. On the foundation of inlier priority, E3Outlier\nis able to achieve end-to-end UOD by directly inspecting the DNN\u2019s outputs, which re\ufb02ect each\ndatum\u2019s priority level. In this way, it avoids the possible suboptimal performance yielded by feeding\nthe DNN\u2019s learned representations into a decoupled UOD method [20]. 3) Based on inlier priority,\nwe explore several strategies and propose a simple and effective negative entropy based score to\nmeasure outlierness. Extensive experiments report a remarkable improvement by E3Outlier against\nstate-of-the-art methods, particularly on relatively dif\ufb01cult benchmarks for unsupervised tasks.\n\n2 Related Work\n\nClassic Outlier Detection. For classic SOD, labelled data are utilized to build discriminative\nmodels by well-studied supervised binary/multi-class classi\ufb01cation techniques, such as support vector\nmachine (SVM) [21], random forest [22] and recent XGBoost [23]. In contrast, SSOD that requires\nonly labelled inliers is much more prevalent, and it is also called one-class classi\ufb01cation [24] or\nnovelty detection [25]. Classic SSOD usually involves training a model on pure inliers and detecting\nthose data that evidently deviate from this model as outliers, and representative SSOD methods\ninclude SVM based methods [26, 27], replicator network/autoencoders [28, 29], principle component\nanalysis (PCA)/kernel based PCA [30, 31]. Compared with SOD and SSOD, UOD handles the most\nchallenging case where no labelled data is available. Classic UOD methods discover outliers by\nexamining the basic characteristics of data, such as statistical properties [32], cluster membership\n[33, 34], density [35, 36, 37], proximity [38, 39], etc. Besides, ensemble methods like isolation forest\n[40] and its variants [41, 42] are popular in UOD. However, most state-of-the-art UOD methods like\n[40, 37, 13] still require manual feature extraction from high dimensional data like raw images.\nDNN based Outlier Detection. DNN\u2019s recent success naturally inspires DNN based OD [20]. For\nSOD, discriminative DNN can be directly applied, while the main issue is the class imbalance of\ninliers/outliers [20], which is explored by [43, 44, 45, 46]. For SSOD, the case is more dif\ufb01cult as\nonly labelled inliers are provided. DNN solutions for SSOD fall into three types: Mainstream DNN\nbased SSOD methods handle high dimensional data by label-free generative models, i.e. AE/CAE\n[47, 48, 49, 50] and generative adversarial network (GAN) [51, 52, 53]. The second type extends\nclassic SSOD methods into their deep counterparts, such as deep support vector data description [54]\nand deep one-class SVM [55]. The last type turns SSOD into SOD by certain means like introducing\nreference datasets [56], intra-class splitting [57], geometric transformations [58] or synthetic outlier\ngeneration [59]. As to UOD, the absence of both inlier and outlier label poses great challenges to\ncombining UOD with DNN, which results in much less progress than SOD and SSOD. In addition to\nthe naive solution that feeds DNN\u2019s learned representations into a separated UOD method [20], to our\nbest knowledge only the following works have explored DNN based UOD: Zhou et al. [17] propose\na decoupled solution that combines a deep AE with Robust PCA, which decomposes the inputs into a\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: Surrogate supervision work\ufb02ow (left) and the comparison of learned representations (right).\n\nlow-rank part from inliers and a sparse part from outliers; For end-to-end UOD, Xia et al. [16] use\ndeep AE directly and propose a variant that estimates inliers by seeking a threshold that maximizes\nthe inter-class variance of AE\u2019s reconstruction loss. A loss function is designed to encourage the\nseparation of estimated inliers/outliers; Zong et al. [18] jointly optimize a deep AE and an estimation\nnetwork to perform simultaneous representation learning and density estimation for end-to-end UOD.\nSurrogate Supervision. Recent studies propose surrogate supervision to improve DNN pre-training\nfor downstream high-level tasks like image classi\ufb01cation and object detection. It imposes certain\noperations on unlabelled data to create corresponding pseudo classes and provide supervision signal,\nsuch as rotation [60], image patch permutation [61], clustering [62], etc. Surrogate supervision is\nalso called self-supervision (see [63] for a comprehensive survey), but we use surrogate supervision\nto better distinguish it from AE/CAE, which are also viewed as \u201cself-supervised\u201d in some context. To\nour best knowledge, our work is the \ufb01rst to connect surrogate supervision with end-to-end UOD.\n\n3 The proposed E3Outlier Framework\n\nProblem Formulation of UOD. Considering a data space X (in this context the space of images),\nan unlabelled data collection X \u2286 X consists of an inlier set Xin and an outlier set Xout, which\noriginate from fundamentally different underlying distributions [1]. Our goal is to obtain an end-to-\nend UOD method S(\u00b7) that in the ideal case outputs S(x) = 1 for inlier x \u2208 Xin and S(x) = 0 for\noutlier x \u2208 Xout. In practice, a smaller S(x) indicates a higher likelihood of x to be an outlier.\n3.1 Surrogate Supervision Based Effective Representation Learning for UOD\n\nWhy NOT AE/CAE? We note that existing DNN based UOD methods rely on AE/CAE [16, 17, 18].\nHowever, it is hard for them to handle relatively complex datasets like CIFAR10 and SVHN: As\nour UOD experiments2 show in Fig. 1(b), even a sophisticated deep CAE with isolation forest [40]\nonly performs slightly better than random guessing (50% AUROC). Similar results are reported in\nother AE/CAE based unsupervised tasks like deep clustering [64, 65]. This is because AE/CAE\ntypically adopt mean square error (MSE) as loss function, which forces AE/CAE to focus on reducing\nlow-level pixel-wise error that is not sensitive to human perception, rather than learning high-level\nsemantic features [66, 67]. Therefore, AE/CAE based representation learning is often ineffective.\nSurrogate Supervision. Discriminative DNNs like ResNet [19] and Wide ResNet (WRN) [68]\nhave proved to be highly effective in learning high-level semantic features, but they have not\nbeen explored in UOD due to the lack of supervision. To remedy the absence of data labels and\nsubstitute AE/CAE, we propose a surrogate supervision based discriminative network (SSD) for\nmore effective representation learning in UOD. Speci\ufb01cally, we \ufb01rst de\ufb01ne an operation set with\nK operations O = {O(\u00b7|y)}K\ny=1, where y represents the pseudo label associated with the operation\nO(\u00b7|y). Applying an operation O(\u00b7|y) to x can generate a new datum x(y) = O(x|y), and all data\ngenerated by the operation O(\u00b7|y) belong to the pseudo class with pseudo label y. Next, given a datum\nx(y(cid:48)), a discriminative DNN with a K-node softmax layer is trained to classify the type of applied\n\n2All UOD experiments in Sec. 3 follow the setup detailed in Sec. 4.1 and the outlier ratio is \ufb01xed to 10%.\n\n3\n\nRegular\tAffine\tTransformationIrregular\tAffine\tTransformationPatch\tRe-arrangingInliersOutliersUnlabelled\tData\twithboth\tInliers/OutliersOperations\tfor\tSurrogate\tSupervisionDiscriminative\tDNNData\tof\tMultiple\tPseudo\tClassesPseudo\tClass\tProbabilitiesMNISTF-MNISTCIFAR10SVHN405060708090AUROC(%)CAE+iForestSSD+iForest\f(a) MNIST (\u201c3\u201d)\n\n(b) F-MNIST (\u201cbag\u201d)\n\n(c) CIFAR10 (\u201chorse\u201d)\n\n(d) SVHN (\u201c3\u201d)\n\nFigure 2: Inliers and outliers\u2019 gradient magnitude on example cases of benchmark datasets during\nSSD training. The class used as inliers is in brackets.\n\noperation, i.e. the DNN is supposed to classify x(y(cid:48)) into the y(cid:48)-th pseudo class. With P (y)(\u00b7) and \u03b8\ndenoting the probability output by the y-th node of softmax layer and DNN\u2019s learnable parameters\nrespectively, DNN\u2019s output probability vector for K operations is P (x(y(cid:48))|\u03b8) = [P (y)(x(y(cid:48))|\u03b8)]K\ny=1.\nTo train such a DNN with an unlabelled data collection X = {xi}N\ni=1, the objective function is:\n\nN(cid:88)\n\ni=1\n\nmin\n\n\u03b8\n\n1\nN\n\nLSS(xi|\u03b8)\n\n(1)\n\nwhere LSS(xi|\u03b8) is the loss incurred by xi under surrogate supervision. When the commonly-used\ncross entropy loss is used to classify pseudo classes of surrogate supervision, it can be written as:\n\nLSS(xi|\u03b8) = \u2212 1\nK\n\nlog(P (y)(x(y)\n\ni\n\n|\u03b8)) = \u2212 1\nK\n\nlog(P (y)(O(xi|y)|\u03b8)).\n\n(2)\n\nK(cid:88)\n\ny=1\n\nK(cid:88)\n\ny=1\n\nAs to the operation set O, each operation O(\u00b7|y) \u2208 O is de\ufb01ned as a combination of one or more\nbasic transformations from the following transformation sets: 1) Rotation: This set\u2019s transformations\nclock-wisely rotate images by a certain degree. 2) Flip: This set\u2019s transformations refer to \ufb02ipping the\nimage or not. 3) Shifting: This set\u2019s transformations shift the image by some pixels along x-axis or\ny-axis. 4) Patch re-arranging: This set\u2019s transformations partition the image into several equally-sized\npatches and re-organize them into a new image by a certain permutation. Based on them, we construct\nthree operation subsets, i.e. regular af\ufb01ne transformation set ORA, irregular af\ufb01ne transformation\nset OIA and patch re-arranging set OP R (detailed in Sec.1 in supplementary material). The \ufb01nal\noperation set is O = ORA \u222a OIA \u222a OP R, and Fig. 1(a) shows SSD\u2019s entire work\ufb02ow. To verify\nSSD\u2019s effectiveness, we extract the outputs of its penultimate layer as the learned representations,\nwhile the outputs of deep CAE\u2019s intermediate hidden layer (with the same dimension as SSD) are\nused for comparison. We feed them into isolation forest [40], which is generally acknowledged to be\na good UOD method [69], to perform UOD under the same parameterization. As shown in Fig. 1(b),\nSSD\u2019s learned representations are able to outperform CAE by a large magrin (8%-10% AUROC).\n\n3.2\n\nInlier Priority: The Foundation of End-to-end UOD\n\nMotivation. The above simple solution feeds SSD\u2019s learned representations into a decoupled UOD\nmethod, which may yield suboptimal performance because SSD and the UOD method are trained\nseparately [18, 20]. Our goal is to achieve end-to-end UOD without using a decoupled UOD method.\nRecall that outliers are essentially rare patterns in a data collection [7], which implies an intrinsic class\nimbalance between inliers/outliers. Class imbalance is unfavorable in machine learning as it leads\nto the bias towards majority class during training [70, 71]. However, we argue that class imbalance\ncan be favorably exploited in UOD as it gives rise to \u201cinlier priority\u201d: Despite that inliers/outliers\nare indiscriminately fed into SSD for training, SSD will prioritize the minimization of inliers\u2019 loss.\nThis intuition naturally inspires an end-to-end UOD solution by measuring how well the SSD\u2019s output\nof a datum matches its target pseudo label, which directly indicates its priority level in training and\nthe likelihood to be an inlier. We demonstrate the inlier priority in terms of two aspects below:\nPriority by Gradient Magnitude. Our \ufb01rst point is that inliers will produce gradient with stronger\nmagnitude to update the SSD network than outliers. To demonstrate this point, we consider an SSD\n\n4\n\n100200Iterations02040MagnitudeofGradientInliersOutliers100200Iterations02040MagnitudeofGradientInliersOutliers100200Iterations0102030MagnitudeofGradientInliersOutliers100200Iterations050100MagnitudeofGradientInliersOutliers\f(a) De facto update\n\n(b) MNIST (\u201c3\u201d)\n\n(c) F-MNIST (\u201cbag\u201d)\n\n(d) CIFAR10 (\u201chorse\u201d)\n\nFigure 3: An illustration of de facto update and some example cases of the average de facto update\nfor inliers/outliers during the network training. The class used as inliers is in brackets.\n\nwith its network weights randomly initialized by i.i.d. uniform distribution on [\u22121, 1]. Without loss\nof generality, we consider the gradients w.r.t. the weights associated with the c-th class (1 \u2264 c \u2264 K)\nbetween the penultimate layer and softmax layer, wc = [ws,c](L+1)\n(wL+1,c is bias), because these\nweights are directly responsible for making predictions. For the commonly-used cross-entropy loss\nL, only data transformed by the c-th operation X (c) = {O(x|c)|x \u2208 X} are used to update wc. The\ngradient vector incurred by L is denoted by \u2207wcL = [\u2207ws,cL](L+1)\n, which will be used to update wc\nin back-propagation based optimizer like Stochastic Gradient Descent (SGD) [72]. Given unlabelled\ndata with Nin inliers and Nout outliers, it is easy to know that X (c) also contains Nin transformed\ninliers and Nout transformed outliers. Here we are interested in the magnitude of transformed inliers\nand outliers\u2019 aggregated gradient to update wc, i.e. ||\u2207(in)\nwc L||, which directly re\ufb02ect\ninliers/outliers\u2019 strength to affect the training of SSD. Since SSD is randomly initialized, we need to\ncompute the expectation of gradient magnitude. As shown in Sec. 2 of supplementary material, for\na simpli\ufb01ed SSD network with a single hidden-layer and sigmoid activation, we can quantitatively\nderive the following approximation on inliers and outliers\u2019 gradient magnitude:\n\nwc L|| and ||\u2207(out)\n\ns=1\n\ns=1\n\nE(||\u2207(in)\nE(||\u2207(out)\n\nwc L||2)\nwc L||2)\n\n\u2248 N 2\nin\nN 2\n\nout\n\n(3)\n\nwc L||) (cid:29) E(||\u2207(out)\n\nwhere E(\u00b7) denotes the probability expectation. As the class imbalance between inliers and outliers\nleads to Nin (cid:29) Nout, we naturally yield E(||\u2207(in)\nwc L||). Therefore, it serves as\na theoretical indication that the gradient magnitude induced by inliers will be signi\ufb01cantly larger\nthan outliers for an untrained SSD network. Since it is particularly dif\ufb01cult to directly analyze more\ncomplex network architectures such as Wide ResNet [68], we empirically examine inliers and outliers\u2019\ngradient magnitude during training by experiments (see Fig. 2), and the observations on different\nbenchmarks are consistent with the above analysis on the simpli\ufb01ed case: The magnitude of inliers\u2019\naggregated gradient has constantly been larger than outliers during the process of SSD training.\nPriority by Network Updating Direction. Our second point is that the network updating direction\nof SSD will bias towards the direction that prioritizes reducing inliers\u2019 loss during the SSD training.\nSince training is dynamic and a theoretical analysis is intractable, we demonstrate this point using\nan empirical veri\ufb01cation by computing inliers/outliers\u2019 average \u201cde facto update\u201d: As illustrated\n(cid:80)\nby Fig. 3(a), consider a datum xi from a batch of data X, and its negative gradient \u2212\u2207\u03b8L(xi)\nis the fastest network updating direction to reduce xi\u2019s loss. However, the network weights \u03b8 are\nactually updated by the negative gradient of the entire batch X, \u2212\u2207\u03b8L(X) = \u2212 1\ni \u2207\u03b8L(xi). It\nis actually different from the best updating direction for each individual datum. Thus, the de facto\nupdate di for xi refers to the actual gradient magnitude that xi obtains along its best direction for\nloss reduction from the network update direction \u2212\u2207\u03b8L(X), which can be computed by projecting\n\u2212\u2207\u03b8L(xi)\n||\u2212\u2207\u03b8L(xi)||. In this way, di re\ufb02ects\n\u2212\u2207\u03b8L(X) onto the direction of \u2212\u2207\u03b8L(xi): di = \u2212\u2207\u03b8L(X) \u00b7\nhow much effort the network will devote to reduce xi\u2019s loss, and it is a direct indicator of data\u2019s\npriority during network training. We calculate the average de facto update of inliers/outliers w.r.t the\nweights between SSD\u2019s penultimate and softmax layer and visualize some examples in Fig. 3(b)-3(d):\nAlthough the average de facto update of inliers/outliers is very close at the beginning, the average de\n\nN\n\n5\n\n012Iterations\u00d7103468DefactoUpdate\u00d710\u22124OutliersInliers012Iterations\u00d7103789DefactoUpdate\u00d710\u22124OutliersInliers24Iterations\u00d710389DefactoUpdate\u00d710\u22124OutliersInliers\f(a) MNIST (\u201c3\u201d)\n\n(b) F-MNIST (\u201cbag\u201d)\n\n(c) CIFAR10 (\u201chorse\u201d)\n\n(d) SVHN (\u201c3\u201d)\n\nFigure 4: Normalized histograms of inliers/outliers\u2019 Spl(x). The class used as inliers is in brackets.\n\nfacto update of inliers becomes evidently higher than outliers as the training continues, which implies\nthat SSD will devote more efforts to reducing inliers\u2019 loss by its network updating direction.\nRemarks on Inlier Priority. 1) Based on the discussion above, inliers will gain priority in terms\nof both the gradient magnitude and the updating direction of SSD\u2019s network weights. Such priority\nleads to a lower loss for inliers after training, which enables us to discern outliers by SSD\u2019s outputs\nand serves as a foundation of end-to-end UOD. 2) Intuitively, inlier priority will also happen when\nusing AE/CAE based end-to-end UOD methods. However, the effect of inlier priority is severely\ndiminished in this case for two reasons: First, AE/CAE typically uses the raw image pixels as learning\ntargets, but the intra-class difference of inlier images can be very large, which means AE/CAE usually\ndoes not have a uni\ufb01ed learning target like SSD. Second, AE/CAE is ineffective in learning high-level\nrepresentations (as we discussed in Sec. 3.1), which makes it dif\ufb01cult to capture common high-level\nsemantics of inlier images. Both factors above disable inliers from being a joint force to dominate the\ntraining of AE and produce a strong inlier priority effect like SSD, which is also demonstrated by\nAE/CAE\u2019s poor UOD performance in empirical evaluation (see experimental results in Sec. 4.2).\n\n3.3 Scoring Strategies for UOD\nBased on inlier priority, we need a strategy S(\u00b7) to score a datum x. Given x(y) = O(x|y) and the\nprobability vector P (x(y)|\u03b8) from SSD\u2019s softmax layer, we explore three strategies below:\nPseudo Label based Score (PL): Inlier priority suggests that SSD will prioritize reducing inliers\u2019\nloss during training. For the datum x(y), we note that the calculation of its cross entropy loss only\ndepends on the probability P (y)(x(y)|\u03b8) that corresponds to its pseudo label y in P (x(y)|\u03b8). Thus,\nwe propose a direct scoring strategy Spl(x) by averaging P (y)(x(y)|\u03b8) for all K operations:\n\nK(cid:88)\n\ny=1\n\nSpl(x) =\n\n1\nK\n\nP (y)(x(y)|\u03b8).\n\n(4)\n\nMaximum Probability based Score (MP): PL seems to be an ideal score. However, we note that\noperations for surrogate supervision do not always create suf\ufb01ciently separable classes, e.g. image\nwith a digit \u201c8\u201d is still an \u201c8\u201d when applying a \ufb02ip operation. Hence, misclassi\ufb01cations will happen\nand the probability P (y)(x(y)|\u03b8) that corresponds to pseudo label y may not be the only or the best\nindicator to re\ufb02ect how well the loss of a datum is reduced. Therefore, instead of P (y)(x(y)|\u03b8), we\nalternatively adopt the maximum probability of P (x(y)|\u03b8) to calculate the score Smp(x) as follows:\n\nK(cid:88)\n\ny=1\n\nSmp(x) =\n\n1\nK\n\nP (t)(x(y)|\u03b8).\n\nmax\n\nt\n\n(5)\n\nNegative Entropy based Score (NE). Both strategies above rely on a single probability retrieved\nfrom P (x(y)|\u03b8), while the information of the rest (K \u2212 1) classes\u2019 probability is ignored. If we\nconsider the entire probability distribution P (x(y)|\u03b8), the training actually encourages SSD to output\na probability distribution closer to the label\u2019s one-hot distribution. With inlier priority, we can expect\nSSD to output a sharper probability distribution P (x(y)|\u03b8) for inliers and a more uniform P (x(y)|\u03b8)\n\n6\n\n00.250.50.751PseudoLabelbasedScore024DensityInliersOutliers00.250.50.751PseudoLabelbasedScore0123DensityInliersOutliers00.250.50.751PseudoLabelbasedScore0246DensityInliersOutliers00.250.50.751PseudoLabelbasedScore0510DensityInliersOutliers\ffor outliers. Thus, we propose to use information entropy H(\u00b7) [73] as a simple and effective measure\nto the sharpness of a distribution, which gives the negative entropy based score Sne(x):\n\nK(cid:88)\n\ny=1\n\nK(cid:88)\n\nK(cid:88)\n\ny=1\n\nt=1\n\nSne(x) = \u2212 1\nK\n\nH(P (x(y)|\u03b8)) =\n\n1\nK\n\nP (t)(x(y)|\u03b8) log(P (t)(x(y)|\u03b8)).\n\n(6)\n\nA comparison of PL/MP/NE is given in Sec. 4.2. In Fig. 4(a)-4(d), we calculate the most intuitive\nSpl(x) of inliers/outliers on benchmarks and visualize the normalized histograms of Spl(x), which\nare favorably separable for UOD. Besides, such results also verify the effectiveness of inlier priority.\n\n4 Experiments\n\n4.1 Experiment Setup\n\nUOD Performance Evaluation on Image Benchmarks. We follow the standard procedure from\nprevious image UOD literature [13, 16, 17] to construct an image set with outliers: Given a standard\nimage benchmark, all images from a class with one common semantic concept (e.g. \u201chorse\u201d, \u201cbag\u201d)\nare retrieved as inliers, while outliers are randomly sampled from the rest of classes by an outlier\nratio \u03c1. We vary \u03c1 from 5% to 25% by a step of 5%. The assigned inlier/outlier labels are strictly\nunknown to UOD methods and only used for evaluation. Each class of a benchmark is used as\ninliers in turn and the performance on all classes is averaged as the overall UOD performance. The\nexperiments are repeated for 5 times to report the average results. Five public benchmarks: MNIST\n[74], Fashion-MNIST (F-MNIST) [75], CIFAR10 [76], SVHN [77], CIFAR100 [76] are used for\nexperiments3. Raw pixels are directly used as inputs with their intensity normalized into [\u22121, 1]. As\nfor evaluation, we adopt the commonly-used Area under the Receiver Operating Characteristic curve\n(AUROC) and Area under the Precision-Recall curve (AUPR) as threshold-independent metrics [78].\nImplementation Details and Compared Methods. For E3Outlier, we use an n = 10 layer wide\nResNet (WRN) with a widen factor k = 4 as the backbone DNN architecture. K = 111 operations\nare used for surrogate supervision, and NE is used as the scoring strategy. Since surrogate supervision\naugments original data by K times, we train WRN for (cid:100) 250\nK (cid:101) epochs. The batch size is 128. A\nlearning rate 0.001 and a weight decay 0.0005 are adopted. The SGD optimizer with momentum\n0.9 is used for MNIST and F-MNIST, while the Adam optimizer with \u03b2 = (0.9, 0.999) is used\nfor CIFAR10, CIFAR100 and SVHN for better convergence. We compare E3Outlier with the\nbaselines and existing state-of-the-art DNN based UOD methods (reviewed in Sec. 2) below: 1) CAE\n[79]. It directly uses CAE\u2019s reconstruction loss to perform UOD. 2) CAE-IF. It feeds CAE\u2019s learned\nrepresentations into isolation forest (IF) [40] as explained in Sec. 3.1. 3) Discriminative reconstruction\nbased autoencoder (DRAE) [16]. 4) Robust deep autoencoder (RDAE) [17]. 5) Deep autoencoding\ngaussian mixture model (DAGMM) [18]. 6) SSD-IF. It shares E3Outlier\u2019s SSD part but feeds SSD\u2019s\nlearned representations into IF to perform UOD. For all AE based UOD methods above, we adopt the\nsame CAE architecture from [58] with a 4-layer encoder and 4-layer decoder. We do not use more\ncomplex CAE (e.g. CAE using skip connection [80] or more layers) since they usually lower outliers\u2019\nreconstruction error as well and do not contribute to CAE\u2019s UOD performance. The hyperparameters\nof the compared methods are set to recommended values (if provided) or the values that produce the\nbest performance. More implementation details are given in Sec. 1 of the supplementary material.\nOur codes and results can be veri\ufb01ed at https://github.com/demonzyj56/E3Outlier.\n\n4.2 UOD Performance Comparison and Discussion\n\nUOD Performance Comparison. We report the numerical results on each benchmark under\n\u03c1 = 10% and 20% in Table 1, and UOD performance by AUROC under \u03c1 from 5% to 25% is\nshown in Fig. 5(a)-Fig. 5(e) (full results are given in Sec. 4 of supplementary material). AUPR-\nin and AUPR-out in Table 1 denote the AUPR calculated when inliers and outliers are used as\npositive class respectively. We draw the following observations from those results: Above all,\nE3Outlier overwhelmingly outperforms existing DNN based UOD methods by a large margin. As\nTable 1 shows, E3Outlier usually improves AUROC/AUPR by 5% to 30% when compared with\nstate-of-the-art UOD methods. In particular, E3Outlier produces a signi\ufb01cant performance leap\n\n3As all images are viewed as unlabelled in UOD, we do not split train/test set. CIFAR100 uses 20 superclasses.\n\n7\n\n\fTable 1: AUROC/AUPR-in/AUPR-out (%) for UOD methods. The best performance is in bold.\n\nDataset\n\n\u03c1\n\nCAE\n\nCAE-IF\n\nDRAE\n\nRDAE\n\nDAGMM\n\nSSD-IF\n\nE3Outlier\n\nMNIST\n\n10% 68.0/92.0/32.9 85.5/97.8/49.0 66.9/93.0/30.5 71.8/93.1/35.8 64.0/92.9/26.6 93.8/99.2/68.7 94.1/99.3/67.5\n20% 64.0/82.7/40.7 81.5/93.6/57.2 67.2/86.6/42.5 67.0/84.2/43.2 65.9/86.4/41.3 90.5/97.3/71.0 91.3/97.6/72.3\nF-MNIST 10% 70.3/94.3/29.3 82.3/97.2/40.3 67.1/93.9/25.5 75.3/95.8/31.7 64.0/92.7/30.3 90.6/98.5/68.6 93.3/99.0/75.9\n20% 64.4/85.3/36.8 77.8/92.2/49.0 65.7/86.9/36.6 70.9/89.2/41.4 66.0/86.7/43.5 87.6/95.6/71.4 91.2/97.1/78.9\nCIFAR10 10% 55.9/91.0/14.4 54.1/90.2/13.7 56.0/90.7/14.7 55.4/90.7/14.0 56.1/91.3/15.6 64.0/93.5/18.3 83.5/97.5/43.4\n20% 54.7/81.6/25.5 53.8/80.7/25.3 55.6/81.7/26.8 54.2/81.0/25.7 54.7/81.8/26.3 60.2/85.0/28.3 79.3/93.1/52.7\nSVHN 10% 51.2/90.3/10.6 55.0/91.4/11.9 51.0/90.3/10.5 52.1/90.6/10.8 50.0/90.0/19.3 73.4/95.9/22.0 86.0/98.0/36.7\n20% 50.7/80.2/20.7 54.0/82.0/22.4 50.6/80.4/20.5 51.8/80.9/21.1 50.0/79.9/29.6 69.2/89.5/33.7 81.0/93.4/47.0\nCIFAR100 10% 55.2/91.0/14.5 54.5/90.7/13.8 55.6/90.9/15.0 55.8/90.9/15.0 54.9/91.1/14.2 55.6/91.5/13.0 79.2/96.8/33.3\n20% 54.4/81.7/25.6 53.5/80.9/25.1 55.5/81.8/27.0 54.9/81.5/26.5 53.8/81.5/24.7 54.3/82.1/23.4 77.0/92.4/46.5\n\n(a) MNIST\n\n(b) F-MNIST\n\n(c) CIFAR10\n\n(d) SVHN\n\n(e) CIFAR100\n\nFigure 5: UOD performance (AUROC) comparison with varying \u03c1 from 5% to 25%.\n\n(\u2265 20% AUROC gain) on CIFAR10, SVHN and CIFAR100, which have constantly been dif\ufb01cult\nbenchmarks for UOD. Next, end-to-end E3Outlier almost consistently outperforms its decoupled\ncounterpart SSD-IF. Although SSD-IF performs closely to E3Outlier in simple cases, E3Outlier\nevidently prevails over SSD-IF on CIFAR10/SVHN/CIFAR100 by 11% to 24% AUROC gain. By\ncontrast, the decoupled CAE-IF/RDAE get better UOD performance than their end-to-end coun-\nterparts CAE/DRAE/DAGMM on MNIST/F-MNIST, and all of them yield inferior performance\non CIFAR10/SVHN/CIFAR100. Hence, observations above have justi\ufb01ed E3Outlier as a highly\neffective and end-to-end UOD solution. In addition, we would like to make two remarks: 1) We\nmust point out that the data augmentation effect (surrogate supervision will augment the training\ndata by K times) is not the reason why E3Outlier outperforms existing methods by a large mar-\ngin. Experiments show that when we train CAE with the same training data with E3Outlier, the\nperformance typically becomes worse than original CAE (e.g. 55.5%/63.9%/54.2%/50.0%/53.8%\nAUROC on MNIST/F-MNIST/CIFAR10/SVHN/CIFAR100 when \u03c1 = 10%). By contrast, E3Outlier\ncan effectively exploit the high-level discriminative label information from data of pseudo classes,\nwhich is fundamentally different from generative models like AE/CAE. 2) To fairly compare the\nquality of learned representation for CAE and SSD, CAE\u2019s hidden layer by default shares SSD\u2019s\npenultimate layer dimension, which is \ufb01xed to 256 by Wide-ResNet architecture. A different latent\ndimension may in\ufb02uence CAE\u2019s performance, but it cannot enable CAE to perform comparably\nto E3Outlier, especially on dif\ufb01cult datasets like CIFAR10. We also test other values for CAE\u2019s\nlatent dimensions, and experimental results show that even for a carefully selected latent dimension\n(e.g. 64) that performs best on most benchmarks, it brings minimal gain to CAE\u2019s performance on\ndif\ufb01cult datasets CIFAR10/CIFAR100 (e.g. 56.3%/56.1% AUROC when \u03c1 = 10%), and on simpler\ndatasets (MNIST/F-MNIST/SVHN) CAE\u2019s performance (71.9%/75.6%/53.4%, \u03c1 = 10%) is still far\nbehind E3Outlier (94.1%/93.3%/86.0%) despite some limited improvement. More importantly, a\nprior choice of the optimal latent dimension or CAE architecture for UOD is dif\ufb01cult in itself.\nDiscussion. We discuss \ufb01ve factors that are related to our E3Outlier framework\u2019s performance by\nexperiments. Since the trends under different values of \u03c1 are fairly similar, we visualize the results\nwhen using \u03c1 = 10%: 1) Operation set for surrogate supervision (see Fig. 6(a)): We test the UOD\n\n8\n\n0.050.100.150.200.25OutlierRatio2030405060708090100AUROC(%)CAECAE-IFDRAERDAEDAGMMSSD-IFE3Outlier0.050.100.150.200.25OutlierRatio2030405060708090100AUROC(%)CAECAE-IFDRAERDAEDAGMMSSD-IFE3Outlier0.050.100.150.200.25OutlierRatio102030405060708090AUROC(%)CAECAE-IFDRAERDAEDAGMMSSD-IFE3Outlier0.050.100.150.200.25OutlierRatio102030405060708090AUROC(%)CAECAE-IFDRAERDAEDAGMMSSD-IFE3Outlier0.050.100.150.200.25OutlierRatio102030405060708090AUROC(%)CAECAE-IFDRAERDAEDAGMMSSD-IFE3Outlier\f(a) Operation set\n\n(b) Network architecture\n\n(c) Scoring strategy\n\n(d) Training epochs\n\nFigure 6: Different factors\u2019 in\ufb02uence on E3Outlier\u2019s performance under \u03c1 = 10%.\n\nperformance with different combinations of operation subsets to be O. The results suggest that ORA\nalone already works satisfactorily, but a union of ORA, OIA and OP R produces the best performance,\nwhich re\ufb02ects the extendibility of operation sets. 2) Network architecture (see Fig. 6(b)): In addition\nto WRN, we explore ResNet-20/ResNet-50 [19] and DenseNet-40 [81] for SSD with other settings\n\ufb01xed. The results show that those architectures basically achieve satisfactory UOD performance with\nminor differences, which veri\ufb01es the applicability of different network architectures. In particular,\nwe note that a more complex architecture (ResNet-50/DenseNet-40) improves the UOD performance\non relatively complex datasets (CIFAR10, SVHN and CIFAR100), but its performance is inferior\non simple datasets. 3) Scoring strategy (see Fig. 6(c)): Among three scoring strategies (PL/MP/NE)\nproposed in Sec. 3.3, NE constantly yields the best performance by up to 2.3% AUC gain compared\nwith PL/MP, while MP also outperforms the naive PL. Thus, we use the NE by default for E3Outlier.\n4) Training epochs (see Fig. 6(d)): We measure the UOD performance when the SSD is trained by 1\nto 10 epochs respectively. In general, the UOD performance is improved at the initial stage of training\n(less than 3 training epochs) and then stabilizes as the training epochs continue to increase. 5) Outlier\nratio: First, we note that sometimes the ratio of outliers can be very small (e.g. \u2264 1%), so we also test\nE3Outlier\u2019s performance in such case. The experiments show that E3Outlier still achieves satisfactory\nperformance: For example, when \u03c1 = 0.5%, E3Outlier achieves 96.0%/93.6%/87.4%/91.0%/80.7%\nAUROC for MNIST/F-MNIST/CIFAR10/SVHN/CIFAR100 respectively, which is even better than\nthe case with a higher outlier ratio. We also notice that the performance of E3Outlier tends to drop as\nthe outlier ratio \u03c1 increases. This is reasonable in the setting of UOD because the \u201coutlierness\u201d of\noutliers will decrease as their number increases, i.e. they are less likely to be viewed as \u201coutliers\u201d\nunder the unsupervised setting as they gradually play a more important role in constituting the original\nunlabelled data.\n\n5 Conclusion\n\nIn this paper, we propose a framework named E3Outlier to achieve effective and end-to-end UOD\nfrom raw image data. E3Outlier exploits surrogate supervision rather than traditional AE/CAE\nfor representation learning in UOD, while a new property named inlier priority is demonstrated\ntheoretically and empirically as the foundation of end-to-end UOD. By inlier priority and the negative\nentropy based score, E3Outlier achieves signi\ufb01cant UOD performance leap when compared with state-\nof-the-art DNN based UOD methods. For future research, it is interesting to explore a quantitative\nmeasure of each operation\u2019s effectiveness for surrogate supervision and develop effective late fusion\nstrategies of different operations for scoring. As an open framework, different network architectures,\nsurrogate supervision operations and scoring strategies can also be explored for E3Outlier.\n\nAcknowledgement\n\nThis work is supported by National Key R&D Program of China 2018YFB1003203 and National\nNatural Science Foundation of China (NSFC) under Grant No. 61773392, 61672528. This work\nis also supported by the German Research Foundation (DFG) award KL 2698/2-1 and by the\nGerman Federal Ministry of Education and Research (BMBF) awards 031L0023A, 01IS18051A, and\n031B0770E. Xinwang Liu, En Zhu and Jianping Yin are corresponding authors of this paper.\n\n9\n\nMNISTF-MNISTCIFAR10SVHNCIFAR100707580859095AUROC(%)ORAORA\u222aOIAORA\u222aOIA\u222aOPRMNISTF-MNISTCIFAR10SVHNCIFAR100406080100AUROC(%)ResNet-20ResNet-50DenseNet-40WRN-10-4MNISTF-MNISTCIFAR10SVHNCIFAR100707580859095AUROC(%)PLMPNE13579Epochs5060708090AUROC(%)MNISTF-MNISTCIFAR10SVHNCIFAR100\fReferences\n[1] Douglas M Hawkins. Identi\ufb01cation of outliers, volume 11. Springer.\n\n[2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM\n\ncomputing surveys (CSUR), 41(3):15, 2009.\n\n[3] Mohiuddin Ahmed, Abdun Naser Mahmood, and Md Ra\ufb01qul Islam. A survey of anomaly\ndetection techniques in \ufb01nancial domain. Future Generation Computer Systems, 55:278\u2013288,\n2016.\n\n[4] Anna L Buczak and Erhan Guven. A survey of data mining and machine learning methods for\ncyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2):1153\u2013\n1176.\n\n[5] Alessandra De Paola, Salvatore Gaglio, Giuseppe Lo Re, Fabrizio Milazzo, and Marco Ortolani.\nAdaptive distributed outlier detection for wsns. IEEE transactions on cybernetics, 45(5):902\u2013\n913, 2015.\n\n[6] Charu C Aggarwal. Outlier Analysis. Springer, 2016.\n\n[7] Varun Chandola and Vipin Kumar. Outlier detection : A survey. Acm Computing Surveys, 41(3),\n\n2007.\n\n[8] Thomas Schlegl, Philipp Seeb\u00f6ck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg\nLangs. Unsupervised anomaly detection with generative adversarial networks to guide marker\ndiscovery. In International Conference on Information Processing in Medical Imaging, pages\n146\u2013157. Springer, 2017.\n\n[9] B Ravi Kiran, Dilip Mathew Thomas, and Ranjith Parakkal. An overview of deep learning\nbased methods for unsupervised and semi-supervised anomaly detection in videos. Journal of\nImaging, 4(2):36, 2018.\n\n[10] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassi\ufb01ed and out-of-distribution\n\nexamples in neural networks. International Conference on Learning Representations, 2017.\n\n[11] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-\ndistribution image detection in neural networks. International Conference on Learning Repre-\nsentations, 2018.\n\n[12] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple uni\ufb01ed framework for\ndetecting out-of-distribution samples and adversarial attacks. Neural Information Processing\nSystems, pages 7167\u20137177, 2018.\n\n[13] Wei Liu, Gang Hua, and John R Smith. Unsupervised one-class learning for automatic outlier\nremoval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 3826\u20133833, 2014.\n\n[14] Siqi Wang, Yijie Zeng, Qiang Liu, Chengzhang Zhu, En Zhu, and Jianping Yin. Detecting\nabnormality without knowing normality: A two-stage approach for unsupervised video abnormal\nevent detection. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 636\u2013\n644. ACM, 2018.\n\n[15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,\n\n2015.\n\n[16] Yan Xia, Xudong Cao, Fang Wen, Gang Hua, and Jian Sun. Learning discriminative reconstruc-\ntions for unsupervised outlier removal. In Proceedings of the IEEE International Conference on\nComputer Vision (ICCV), pages 1511\u20131519, 2015.\n\n[17] Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In\nProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 665\u2013674. ACM, 2017.\n\n10\n\n\f[18] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and\nHaifeng Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection.\nIn International Conference on Learning Representations (ICLR), 2018.\n\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[20] Raghavendra Chalapathy and Sanjay Chawla. Deep learning for anomaly detection: A survey.\n\narXiv preprint arXiv:1901.03407, 2019.\n\n[21] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013\n\n297, 1995.\n\n[22] Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Transactions\n\non Pattern Analysis & Machine Intelligence, (8):832\u2013844, 1998.\n\n[23] Yue Zhao and Maciej K Hryniewicki. Xgbod: improving supervised outlier detection with\nIn 2018 International Joint Conference on Neural\n\nunsupervised representation learning.\nNetworks (IJCNN), pages 1\u20138. IEEE, 2018.\n\n[24] David Martinus Johannes Tax. One-class classi\ufb01cation: Concept learning in the absence of\n\ncounter-examples. 2002.\n\n[25] Markos Markou and Sameer Singh. Novelty detection: a review\u2014part 1: statistical approaches.\n\nSignal processing, 83(12):2481\u20132497, 2003.\n\n[26] Bernhard Scholkopf, Ralf Herbrich, and Alexander J Smola. A generalized representer theorem.\n\neuropean conference on computational learning theory, pages 416\u2013426, 2001.\n\n[27] David MJ Tax and Robert PW Duin. Support vector data description. Machine learning,\n\n54(1):45\u201366, 2004.\n\n[28] Graham Williams, Rohan Baxter, Hongxing He, Simon Hawkins, and Lifang Gu. A comparative\nstudy of rnn for outlier detection in data mining. In 2002 IEEE International Conference on\nData Mining, 2002. Proceedings., pages 709\u2013712. IEEE, 2002.\n\n[29] Nathalie Japkowicz, Catherine Myers, and Mark Gluck. A novelty detection approach to\nclassi\ufb01cation. In International Joint Conference on Arti\ufb01cial Intelligence, pages 518\u2013523, 1995.\n\n[30] Mei-ling Shyu, Shu-ching Chen, Kanoksri Sarinnapakorn, and Liwu Chang. A novel anomaly\ndetection scheme based on principal component classi\ufb01er. In in Proceedings of the IEEE\nFoundations and New Directions of Data Mining Workshop, in conjunction with the Third IEEE\nInternational Conference on Data Mining (ICDM\u201903. IEEE, 2003.\n\n[31] Heiko Hoffmann. Kernel pca for novelty detection. Pattern recognition, 40(3):863\u2013874, 2007.\n\n[32] Frank E Grubbs. Procedures for detecting outlying observations in samples. Technometrics,\n\n11(1):1\u201321, 1969.\n\n[33] Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, Xiaowei Xu, et al. A density-based algorithm for\ndiscovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226\u2013231,\n1996.\n\n[34] Zengyou He, Xiaofei Xu, and Shengchun Deng. Discovering cluster-based local outliers.\n\nPattern Recognition Letters, 24(9-10):1641\u20131650, 2003.\n\n[35] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J\u00f6rg Sander. Lof: identifying\n\ndensity-based local outliers. In ACM sigmod record, volume 29, pages 93\u2013104. ACM, 2000.\n\n[36] Emanuel Parzen. On estimation of a probability density function and mode. The annals of\n\nmathematical statistics, 33(3):1065\u20131076, 1962.\n\n[37] JooSeuk Kim and Clayton D Scott. Robust kernel density estimation. Journal of Machine\n\nLearning Research, 13(Sep):2529\u20132565, 2012.\n\n11\n\n\f[38] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Ef\ufb01cient algorithms for mining\noutliers from large data sets. In ACM Sigmod Record, volume 29, pages 427\u2013438. ACM, 2000.\n\n[39] Fabrizio Angiulli and Clara Pizzuti. Fast outlier detection in high dimensional spaces. In\nEuropean Conference on Principles of Data Mining and Knowledge Discovery, pages 15\u201327.\nSpringer, 2002.\n\n[40] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE\n\nInternational Conference on Data Mining, pages 413\u2013422. IEEE, 2008.\n\n[41] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. On detecting clustered anomalies using\nsciforest. In Joint European Conference on Machine Learning and Knowledge Discovery in\nDatabases, pages 274\u2013290. Springer, 2010.\n\n[42] Sunil Aryal, Kai Ming Ting, Jonathan R Wells, and Takashi Washio. Improving iforest with\nrelative mass. In Paci\ufb01c-Asia Conference on Knowledge Discovery and Data Mining, pages\n510\u2013521. Springer, 2014.\n\n[43] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation\nfor imbalanced classi\ufb01cation. In Proceedings of the IEEE conference on computer vision and\npattern recognition, pages 5375\u20135384, 2016.\n\n[44] Salman H Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A Sohel, and Roberto\nTogneri. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE\ntransactions on neural networks and learning systems, 29(8):3573\u20133587, 2018.\n\n[45] Qi Dong, Shaogang Gong, and Xiatian Zhu.\n\nImbalanced deep learning by minority class\nincremental recti\ufb01cation. IEEE transactions on pattern analysis and machine intelligence,\n2018.\n\n[46] Chen Huang, Chen Change Loy, and Xiaoou Tang. Discriminative sparse neighbor approxi-\nmation for imbalanced learning. IEEE transactions on neural networks and learning systems,\n29(5):1503\u20131513, 2018.\n\n[47] Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei Zhang. Deep structured energy based\nmodels for anomaly detection. international conference on machine learning, pages 1100\u20131109,\n2016.\n\n[48] Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. Detecting anomalous events in videos by learning\ndeep representations of appearance and motion. Computer Vision and Image Understanding,\n156:117\u2013127, 2017.\n\n[49] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis.\nLearning temporal regularity in video sequences. In Proceedings of the IEEE conference on\ncomputer vision and pattern recognition, pages 733\u2013742, 2016.\n\n[50] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. Spatio-temporal\nIn Proceedings of the 25th ACM international\n\nautoencoder for video anomaly detection.\nconference on Multimedia, pages 1933\u20131941. ACM, 2017.\n\n[51] Lucas Deecke, Robert A Vandermeulen, Lukas Ruff, Stephan Mandt, and Marius Kloft.\nAnomaly detection with generative adversarial networks. In European Conference on Principles\nof Data Mining and Knowledge Discovery, pages 3\u201317, 2018.\n\n[52] Chu Wang, Yan-Ming Zhang, and Cheng-Lin Liu. Anomaly detection via minimum likelihood\ngenerative adversarial networks. In 2018 24th International Conference on Pattern Recognition\n(ICPR), pages 1121\u20131126. IEEE, 2018.\n\n[53] Thomas Schlegl, Philipp Seeb\u00f6ck, Sebastian M Waldstein, Georg Langs, and Ursula Schmidt-\nErfurth. f-anogan: Fast unsupervised anomaly detection with generative adversarial networks.\nMedical image analysis, 54:30\u201344, 2019.\n\n[54] Lukas Ruff, Nico G\u00f6rnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Robert Vandermeulen,\nAlexander Binder, Emmanuel M\u00fcller, and Marius Kloft. Deep one-class classi\ufb01cation. In\nInternational Conference on Machine Learning, pages 4390\u20134399, 2018.\n\n12\n\n\f[55] Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. Anomaly detection using\n\none-class neural networks. arXiv preprint arXiv:1802.06360, 2018.\n\n[56] Pramuditha Perera and Vishal M Patel. Learning deep features for one-class classi\ufb01cation.\n\narXiv preprint arXiv:1801.05365, 2018.\n\n[57] Patrick Schlachter, Yiwen Liao, and Bin Yang. Deep one-class classi\ufb01cation using data splitting.\n\narXiv preprint arXiv:1902.01194, 2019.\n\n[58] Izhak Golan and Ran El-Yaniv. Deep anomaly detection using geometric transformations. In\n\nAdvances in Neural Information Processing Systems, pages 9758\u20139769, 2018.\n\n[59] Yezheng Liu, Zhe Li, Chong Zhou, Yuanchun Jiang, Jianshan Sun, Meng Wang, and Xiangnan\nHe. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions\non Knowledge and Data Engineering, 2019.\n\n[60] Nikos Komodakis and Spyros Gidaris. Unsupervised representation learning by predicting\n\nimage rotations. In International Conference on Learning Representations (ICLR), 2018.\n\n[61] Rodrigo Santa Cruz, Basura Fernando, Anoop Cherian, and Stephen Gould. Visual permutation\n\nlearning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[62] Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised\nimage classi\ufb01cation and segmentation. arXiv: Computer Vision and Pattern Recognition, 2018.\n\n[63] Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural\n\nnetworks: A survey. arXiv: Computer Vision and Pattern Recognition, 2019.\n\n[64] Xi Peng, Jiashi Feng, Jiwen Lu, Wei-Yun Yau, and Zhang Yi. Cascade subspace clustering. In\n\nThirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[65] Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep\nadaptive image clustering. In Proceedings of the IEEE International Conference on Computer\nVision, pages 5879\u20135887, 2017.\n\n[66] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther.\nAutoencoding beyond pixels using a learned similarity metric. In International Conference on\nMachine Learning, pages 1558\u20131566, 2016.\n\n[67] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics\nbased on deep networks. In Advances in neural information processing systems, pages 658\u2013666,\n2016.\n\n[68] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision\n\nConference 2016. British Machine Vision Association, 2016.\n\n[69] Andrew F Emmott, Shubhomoy Das, Thomas Dietterich, Alan Fern, and Weng-Keen Wong.\nSystematic construction of anomaly detection benchmarks from real data. In Proceedings of the\nACM SIGKDD workshop on outlier detection and description, pages 16\u201321. ACM, 2013.\n\n[70] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on\n\nKnowledge & Data Engineering, (9):1263\u20131284, 2008.\n\n[71] Justin M. Johnson and Taghi M. Khoshgoftaar. Survey on deep learning with class imbalance.\n\nJournal of Big Data, 6(1):27, 2019.\n\n[72] L\u00e9on Bottou. Online learning and stochastic approximations. On-line learning in neural\n\nnetworks, 17(9):142, 1998.\n\n[73] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,\n\n2012.\n\n[74] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n13\n\n\f[75] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[76] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[77] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\n\nReading digits in natural images with unsupervised feature learning. 2011.\n\n[78] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In\nProceedings of the 23rd international conference on Machine learning, pages 233\u2013240, 2006.\n\n[79] Jonathan Masci, Ueli Meier, Dan Cire\u00b8san, and J\u00fcrgen Schmidhuber. Stacked convolutional\nauto-encoders for hierarchical feature extraction. In International Conference on Arti\ufb01cial\nNeural Networks, pages 52\u201359. Springer, 2011.\n\n[80] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep convo-\nlutional encoder-decoder networks with symmetric skip connections. In Advances in neural\ninformation processing systems, pages 2802\u20132810, 2016.\n\n[81] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 4700\u20134708, 2017.\n\n14\n\n\f", "award": [], "sourceid": 3196, "authors": [{"given_name": "Siqi", "family_name": "Wang", "institution": "National University of Defense Technology"}, {"given_name": "Yijie", "family_name": "Zeng", "institution": "Nanyang Technological University"}, {"given_name": "Xinwang", "family_name": "Liu", "institution": "National University of Defense Technology"}, {"given_name": "En", "family_name": "Zhu", "institution": "National University of Defense Technology"}, {"given_name": "Jianping", "family_name": "Yin", "institution": "Dongguan University of Technology"}, {"given_name": "Chuanfu", "family_name": "Xu", "institution": "National University of Defense Technology"}, {"given_name": "Marius", "family_name": "Kloft", "institution": "TU Kaiserslautern"}]}