{"title": "Transfer Anomaly Detection by Inferring Latent Domain Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 2471, "page_last": 2481, "abstract": "We propose a method to improve the anomaly detection performance on target\ndomains by transferring knowledge on related domains. Although anomaly labels\nare valuable to learn anomaly detectors, they are difficult to obtain due to their rarity.\nTo alleviate this problem, existing methods use anomalous and normal instances\nin the related domains as well as target normal instances. These methods require\ntraining on each target domain. However, this requirement can be problematic\nin some situations due to the high computational cost of training. The proposed\nmethod can infer the anomaly detectors for target domains without re-training by\nintroducing the concept of latent domain vectors, which are latent representations\nof the domains and are used for inferring the anomaly detectors. The latent\ndomain vector for each domain is inferred from the set of normal instances in the\ndomain. The anomaly score function for each domain is modeled on the basis of\nautoencoders, and its domain-specific property is controlled by the latent domain\nvector. The anomaly score function for each domain is trained so that the scores of\nnormal instances become low and the scores of anomalies become higher than those\nof the normal instances, while considering the uncertainty of the latent domain\nvectors. When target normal instances can be used during training, the proposed\nmethod can also use them for training in a unified framework. The effectiveness\nof the proposed method is demonstrated through experiments using one synthetic\nand four real-world datasets. Especially, the proposed method without re-training\noutperforms existing methods with target specific training.", "full_text": "Transfer Anomaly Detection by\n\nInferring Latent Domain Representations\n\nAtsutoshi Kumagai\n\nNTT Software Innovation Center\nNTT Secure Platform Laboratories\natsutoshi.kumagai.ht@hco.ntt.co.jp\n\nTomoharu Iwata\n\nNTT Communication Science Laboratories\n\ntomoharu.iwata.gy@hco.ntt.co.jp\n\nYasuhiro Fujiwara\n\nNTT Communication Science Laboratories\n\nyasuhiro.fujiwara.kh@hco.ntt.co.jp\n\nAbstract\n\nWe propose a method to improve the anomaly detection performance on target\ndomains by transferring knowledge on related domains. Although anomaly labels\nare valuable to learn anomaly detectors, they are dif\ufb01cult to obtain due to their rarity.\nTo alleviate this problem, existing methods use anomalous and normal instances\nin the related domains as well as target normal instances. These methods require\ntraining on each target domain. However, this requirement can be problematic\nin some situations due to the high computational cost of training. The proposed\nmethod can infer the anomaly detectors for target domains without re-training by\nintroducing the concept of latent domain vectors, which are latent representations\nof the domains and are used for inferring the anomaly detectors. The latent\ndomain vector for each domain is inferred from the set of normal instances in the\ndomain. The anomaly score function for each domain is modeled on the basis of\nautoencoders, and its domain-speci\ufb01c property is controlled by the latent domain\nvector. The anomaly score function for each domain is trained so that the scores of\nnormal instances become low and the scores of anomalies become higher than those\nof the normal instances, while considering the uncertainty of the latent domain\nvectors. When target normal instances can be used during training, the proposed\nmethod can also use them for training in a uni\ufb01ed framework. The effectiveness\nof the proposed method is demonstrated through experiments using one synthetic\nand four real-world datasets. Especially, the proposed method without re-training\noutperforms existing methods with target speci\ufb01c training.\n\n1\n\nIntroduction\n\nAnomaly detection is an important task in arti\ufb01cial intelligence [8, 6]. The goal of anomaly detection\nis to detect anomalous instances, called anomalies or outliers, that do not conform to the expected\nnormal pattern. Anomaly detection methods have been used in a wide variety of applications such as\nintrusion detection [16], fraud detection [33], medical care [30], and industrial asset monitoring [28].\nMany semi-supervised anomaly detection methods have been proposed such as autoencoders (AEs)\n[43], one-class support vector machines (OSVMs) [44], and isolation forests [35]. Since they require\nonly normal instances, which are relatively easy to prepare, to obtain anomaly detectors, they are\nparticularly used in practice. In some situations, anomaly labels that indicate anomalous instances\ncan be used. By using anomaly labels, supervised anomaly detection methods can detect anomalies\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmuch better than semi-supervised ones [29, 50, 23, 10]. Although anomaly labels are valuable, they\nare typically dif\ufb01cult to obtain since anomalies rarely occur.\nEven if anomaly labels are dif\ufb01cult to obtain in a domain of interest, called a target domain, they may\nbe obtainable in related domains, called source domains. For example, in cyber-security, security\ncompanies monitor customers\u2019 networks to prevent cyber-attacks [14]. Although anomalies are\ndif\ufb01cult to obtain from a new customer\u2019s network (target domain), they may be obtained from existing\ncustomers\u2019 networks (source domains) that have long been monitored. Similarly, in condition-based\nmonitoring of industrial assets such as coal mining drilling machines using sensor data [27], although\nanomalies are dif\ufb01cult to obtain from a new asset, anomalies may be obtained from related assets that\nhave long been working.\nSeveral transfer anomaly detection methods have been proposed to learn the anomaly detector by\nusing both anomalous and normal instances in the source domains [3, 7, 28, 17, 48]. These methods\nalso use target normal instances for training. However, training after obtaining target instances can be\nproblematic in some applications. For example, consider anomaly detection on Internet-of-Things\n(IoT) devices such as sensors, cameras, and cars. Since each device does not have suf\ufb01cient computa-\ntional resources, training is dif\ufb01cult to perform on these devices even if target domains appear that\ncontain normal instances. As another example, in cyber-security, widely various IoT devices them-\nselves need to be protected from cyber attacks [4]. However, it is dif\ufb01cult to protect all devices quickly\nwith time-consuming training since many new devices (target domains) appear one after another.\n\nIn this paper, we propose a novel method to im-\nprove the anomaly detection performance on target\ndomains by using anomalous and normal instances,\nor only normal instances in source domains. The\nproposed method can infer the anomaly detectors\nfor any domains given the sets of normal instances\nin the domains without re-training. In addition, the\nproposed method can use target normal instances for\ntraining in a uni\ufb01ed framework when they can be\nused during training. With the proposed method, an\nanomaly score function, which outputs an anomaly\nscore given an instance, is de\ufb01ned on the basis of the\nAE, which is widely used in recent anomaly detection\nmethods [43, 56, 12, 1, 50]. Note that the proposed\nmethod can use other semi-supervised anomaly de-\ntection methods with learnable parameters instead of\nAEs such as variational AEs [32] and energy-based\nmodels [55]. The parameters of the AE are shared\nacross all domains to learn the common property of\ndomains. However, each domain has a speci\ufb01c prop-\nerty that cannot be explained only by the common\nproperty. To re\ufb02ect these domain-speci\ufb01c properties\nto the anomaly score function ef\ufb01ciently, we intro-\nduce latent domain vectors, which are latent representations of domains. The latent domain vectors\nare used to condition the anomaly score function and control the property of the anomaly score\nfunction for each domain.\nThe latent domain vector of each domain is estimated from the set of normal instances in the domain\nby a neural network. This model enables us to infer the anomaly detectors for any domains given the\nsets of normal instances in the domains without re-training or anomalies. To infer the latent domain\nvectors of different domains, the neural networks need to take sets with different sizes of normal\ninstances as inputs. We realize this by using the deep sets [54], which are permutation invariant\nneural networks to the order of instances in the sets. Speci\ufb01cally, we model the parameters for the\nposterior of the latent domain vector with the deep sets.\nThe anomaly score function for each domain is trained so that the anomaly scores of normal instances\nbecome low, which is achieved by minimizing the reconstruction error of the normal instances.\nAnomaly labels are used so that the anomaly scores of anomalous instances become higher than those\n\nFigure 1: The latent domain vector zd is in-\nferred from the set of normal instances in\nthe domain (blue dotted line) so as to detect\nanomalies. The decision boundary for each\ndomain (red line) is induced by the domain-\nspeci\ufb01c anomaly score function, which is de-\ntermined by zd. The green dotted line de-\nnotes the decision boundary of AE, which\nuses only target normal instances. The pro-\nposed method can detect test anomalies that\nAE cannot via data on the related domains.\n\n2\n\n\u2026\u2026Source domain 1Source domain 2Latent spaceSource domain 1Source domain 2Target domain DS+DTz1: normal: anomalous: anomalous (test)Targetdomain DS+DTz2zDs+DT\fof normal instances, which is realized by adding a differentiable area under the curve (AUC) loss\nas the regularizer. This regularizer enables us to improve the performance even if the number of\nanomalies is small [29]. Since the latent domain vectors are estimated from data, the estimation is\noften uncertain. To handle this uncertainty appropriately, we take the expectation of the loss function\nw.r.t. the latent domain vector and regularize the posterior of the latent domain vector by a prior\nusing Kullback Leibler (KL) divergence, which is used as our objective function for the domain.\nThe parameters for the anomaly score function and the posteriors of the latent domain vectors are\nestimated simultaneously by minimizing the sum of the objective function for each domain. Figure 1\nillustrates the proposed method.\n\n2 Related Work\n\nAnomaly detection, which is also called outlier detection or novelty detection, has been widely\nstudied [8, 6]. Although many unsupervised or semi-supervised anomaly detection methods have\nbeen proposed such as AE based methods [43, 56, 12, 1], OSVM based methods [44, 7, 40], and\ndensity based methods [57, 51, 55], they cannot use anomaly labels. Supervised anomaly detection\nuses anomaly labels to improve the performance, which is also referred to as imbalanced classi\ufb01cation\n[9, 21]. They assume that all instances are obtained from one domain and thus cannot perform well\nwhen there is a domain difference which is focused on in this paper.\nTransfer learning or domain adaptation aims to solve a problem in a target domain using data in\nsource domains [39]. Unsupervised approaches aim to adapt to the target domain by using labeled\nsource data and target unlabeled data [26, 37, 5, 42, 36, 34]. Semi-supervised approaches use labeled\ntarget data as well as these data [38, 13, 41]. Although these methods performed impressively, they\nusually do not assume the class-imbalance and thus are not appropriate for anomaly detection [47].\nAlthough several transfer learning methods are designed to deal with the class-imbalance [2, 53, 18],\nthey assume anomalous and normal instances in the target domains for training. The proposed method\ndoes not assume target anomalous instances, which is more practical since anomalies rarely occur.\nSome transfer anomaly detection methods do not use target anomalies to obtain anomaly detectors\nas with the proposed method. The two-step approach is widely used in transfer anomaly detection\n[3, 7]. This approach \ufb01rst extracts discriminative features from data in the source domain with neural\nnetworks. After feature extraction, any semi-supervised algorithms like OSVM are applied to the\ntarget normal instances. Although this approach is effective to some extent, dividing anomaly detector\nand feature (transfer) learning risks losing information to be transferred [7, 40]. In contrast, the\nproposed method learns them simultaneously in an end-to-end learning manner. A few methods\nuse only normal instances in both the source and target domains [28, 17, 48]. One method uses\nan auxiliary dataset of outliers [22]. Another assumes target unlabeled data for training [47]. All\nthese methods require training on each target domain, which can be problematic in some situations\nas described in Section 1. The proposed method can instantly infer the anomaly detector for each\ndomain given the set of normal instances in the domain without re-training.\nOne-class data transfer learning (OTL) [11] can predict classi\ufb01ers for new domains without re-training\nalthough this ability is not mentioned by Chen and Liu [11]. OTL trains a regression model that\npredicts parameters of anomaly density from those of a normal density using labeled source data.\nOTL predicts a target classi\ufb01er by predicting densities given target normal instances. Although OTL\nrequires the anomaly density to be estimated for each source domain, this is quite dif\ufb01cult since the\nnumber of anomalies is small in anomaly detection tasks. In addition, OTL cannot use domains that\ncontain only normal instances for training although the proposed method can.\n\n3 Preliminary\n\nAEs are neural networks originally proposed for non-linear dimensionality reduction [25]. Due\nto their simplicity and effectiveness, AEs have become a fundamental component of recent semi-\n(cid:80)N\nsupervised anomaly detection [6, 43, 56, 12, 1]. Thus, we use AEs as a building block of the proposed\nmethod. Given instances X := {x1, . . . , xN}, the AE is trained by minimizing the following loss\nn=1 (cid:107)xn \u2212 G\u03b8G (F\u03b8F (xn))(cid:107)2, where F\u03b8F is a neural network with the\nfunction, L(\u03b8F , \u03b8G) := 1\nN\nparameter \u03b8F , called the encoder; G\u03b8G is a neural network with the parameter \u03b8G, called the decoder;\n(cid:107) \u00b7 (cid:107) is a Euclidean norm; and (cid:107)x \u2212 G\u03b8G (F\u03b8F (x))(cid:107)2 is a reconstruction error of x. When the AE is\n\n3\n\n\ftrained with normal instances, the reconstruction errors of normal instances become low. In contrast,\nthe reconstruction errors of instances dissimilar to normal instances, i.e., anomalies, can be expected\nto become high since they are not learned. Thus, AEs can be used for anomaly detection with the\nreconstruction error as the anomaly score.\n\n4 Proposed Method\n\n4.1 Task\n\nd\n\ndn}N +\n\nd := {x+\n\nn=1 be a set of anomalous instances in the d-th domain, where x+\ndn}N\n\ndn \u2208 RM is the\nLet X+\nM-dimensional feature vector of the n-th anomalous instance of the d-th domain and N +\nd is the\nnumber of the anomalous instances in the d-th domain. Similarly, let X\u2212\nn=1 be a set of\nnormal instances in the d-th domain. We assume that N +\nd in each domain since anomalies\nrarely occur, and the feature vector size M is the same in all domains as in many existing studies\n[38, 20, 28, 42]. Suppose that we have both anomalous and normal instances in DS source domains,\n{X+\nd=DS+1. Note that the proposed\nmethod can also treat source domains that have only normal instances although we assumed that all\nsource domains have anomalies for simplicity. Our goal is to obtain an appropriate domain-speci\ufb01c\nanomaly score function, which outputs its anomaly score given an instance, for each target domain.\n\nd=1, and normal instances in DT target domains, {X\u2212\n\nd := {x\u2212\n\nd }DS+DT\n\nd (cid:28) N\u2212\n\nd \u222a X\u2212\n\nd }DS\n\n\u2212\nd\n\n4.2 Domain-speci\ufb01c Anomaly Score Function\n\nWe de\ufb01ne the domain-speci\ufb01c anomaly score function based on the reconstruction error of the AE. To\nrepresent the property of each domain ef\ufb01ciently, we assume that each domain has a K-dimensional\nlatent continuous variable zd \u2208 RK, which is called a latent domain vector in this paper. For the d-th\ndomain, we de\ufb01ne the anomaly score function conditioned on the latent domain vector zd as follows,\n(1)\nwhere the parameters \u03b8 := (\u03b8F , \u03b8G) are shared among all domains. Unlike the original reconstruction\nerror, this score function depends on the latent domain vector zd. By changing the value of zd, the\nproposed method can \ufb02exibly control the property of the anomaly score function. Although we use\nthe reconstruction error for simplicity, we can use other anomaly score functions with learnable\nparameters such as autoregressive density models [19, 46] and \ufb02ow-based density models [15, 31].\n\ns\u03b8(xdn|zd) := (cid:107)xdn \u2212 G\u03b8G (F\u03b8F (xdn, zd))(cid:107)2,\n\n4.3 Models for Latent Domain Vectors\n\nSince the latent domain vectors are unobserved, we have to estimate them from data. For a \ufb01rst step,\nwe model the conditional probability of the latent domain vector given the set of normal instances as\na multivariate Gaussian distribution with a diagonal covariance matrix:\n\u03c6(X\u2212\n\nd ) := N (zd|\u00b5\u03c6(X\u2212\n\nq\u03c6(zd|X\u2212\n\nd ), diag(\u03c32\n\nd ))),\n\n(2)\n\n\u03c6(X\u2212\n\nd ) \u2208 RK\n\nd ) \u2208 RK and variance \u03c32\n\nwhere mean \u00b5\u03c6(X\u2212\n+ are modeled by neural networks with\nparameters \u03c6 that are shared among all domains, and diag(x) returns a diagonal matrix whose\ndiagonal elements are x. In this model, the latent domain vector zd depends on the set of normal\ninstances X\u2212\nd . By this modeling, we can infer the latent domain vectors of any domains when the sets\nof normal instances in the domains are given. Accordingly, we can obtain domain-speci\ufb01c anomaly\nscore functions for the domains without re-training or anomalies.\nSince the q\u03c6 deals with the set of normal instances X\u2212\nd as an input, the neural networks for the\nd ) \u2208 RK must be permutation invariant to the order of instances\nparameters \u00b5\u03c6(X\u2212\nin the set. To achieve this, we use the recently proposed deep sets architecture [54], \u03c4 (X\u2212\nd ) =\nd ), \u03c1 and \u03b7 are any\n\u03c1\nneural networks, respectively. This architecture is permutation invariant due to the summation.\nAlthough this architecture is quite simple, it can express any permutation invariant function and\npreserve all the properties of the set with suitable \u03c1 and \u03b7 [54]. Thus, we can capture the property of\neach domain well with this architecture.\n\nd ) represents one of the \u00b5\u03c6(X\u2212\n\n, where \u03c4 (X\u2212\n\n(cid:16)(cid:80)N\n\nn=1 \u03b7(x\u2212\ndn)\n\nd ) and ln \u03c32\n\nd ) and ln \u03c32\n\n\u03c6(X\u2212\n\n\u03c6(X\u2212\n\n(cid:17)\n\n\u2212\nd\n\n4\n\n\f4.4 Objective Function\n\nWe de\ufb01ne the objective function of the proposed method using the domain-speci\ufb01c anomaly score\nfunctions and latent domain vectors. First, the objective function for the d-th source domain condi-\ntioned on the latent domain vector zd to be minimized is de\ufb01ned by\n\nLd(\u03b8|zd) :=\n\n1\nN\u2212\n\nd\n\ns\u03b8(x\u2212\n\ndn|zd) \u2212\n\n\u03bb\nN\u2212\nd N +\n\nd\n\nf (s\u03b8(x+\n\ndm|zd) \u2212 s\u03b8(x\u2212\n\ndn|zd)),\n\n(3)\n\nd(cid:88)\n\n\u2212\n\nN\n\nn=1\n\nd(cid:88)\n\n\u2212\nd ,N +\n\nN\n\nn,m=1\n\n1\n\nwhere \u03bb \u2265 0 is the hyperparameter and f is the sigmoid function, f (x) =\n1+exp(\u2212x). This form of\nthe objective function has recently been proposed by Iwata and Yamanaka [29] and showed better\nperformance than existing methods although they do not consider domain differences. The \ufb01rst\nterm of Eq. (3) represents the anomaly scores of normal instances in the d-th domain. Since the\nanomaly scores of normal instances should be low, we minimize this term. The second term of Eq.\n(3) is a differentiable approximation of the AUC [52], which is effective for class-imbalanced data\n[23]. The anomaly scores of anomalous instances should be higher than those of normal instances,\nd , x\u2212\ns\u03b8(x+\nd . The AUC term encourages this since\nthe f (\u00b7) takes the maximal value one when s\u03b8(x+\ndn|zd) and the minimal value zero\ndn|zd). When there are no anomalies or \u03bb = 0, the second term of Eq.\nwhen s\u03b8(x+\n(3) becomes zero and the \ufb01rst term of Eq. (3) remains. Thus, it is a supervised extension of the AE\ndescribed in Section 3.\nSince the latent domain vector zd has uncertainty with the variance \u03c32\n\u03c6, we want to appropriately take\nthis into account for the objective function. To achieve this, we de\ufb01ne the objective function for the\nd-th source domain to be minimized as follows\n\ndn \u2208 X\u2212\ndm|zd) (cid:29) s\u03b8(x\u2212\n\ndn|zd) for any x+\n\ndm|zd) (cid:28) s\u03b8(x\u2212\n\ndm|zd) > s\u03b8(x\u2212\n\ndm \u2208 X+\n\n(4)\n\nd )||p(zd)),\n\nLd(\u03b8, \u03c6) := E\n\nq\u03c6(zd|X\n\nd ) [Ld(\u03b8|zd)] + \u03b2DKL(q\u03c6(zd|X\u2212\nd )||p(zd)) is the KL divergence between q\u03c6(zd|X\u2212\n\n\u2212\n\nd = \u00b5\u03c6(X\u2212\n\nwhere DKL(q\u03c6(zd|X\u2212\nd ) and a standard Gaussian\ndistribution p(zd) := N (0, I), and \u03b2 > 0 is a hyperparameter. The \ufb01rst term of Eq. (4) is the\nexpectation of the objective function (3) w.r.t. the q\u03c6(zd|X\u2212\n(cid:80)L\nd ). Since the expectation considers all the\nprobabilities of the zd, it can lead to robust training. This expectation term can be effectively approx-\n(cid:96)=1 Ld(\u03b8|z((cid:96))\nimated by the reparametrization trick [32]. That is, E\nd ),\nd \u223c N (0, I), and (cid:12) is an element-wise product. The\nwhere z((cid:96))\nsecond term of Eq. (4) is a regularization term that prevents over-\ufb01tting of the latent domain vectors,\nwhere its strength is controlled by \u03b2. This type of regularization is common in variational AEs\n[32, 24] and can be analytically calculated [32]. The q\u03c6(zd|X\u2212\nd ) is trained so as to minimize the\nloss term (the \ufb01rst term of Eq. (4)) while being constrained to the prior p(zd). The effectiveness of\nconsidering the uncertainty will be demonstrated in our experiments.\nThe objective function for the d-th target domain to be minimized is obtained by omitting the AUC\nloss term from Eq. (4) since the target domain does not have anomalous instances. That is,\n\nd ) [Ld(\u03b8|zd)] \u2248 1\n\nd (cid:12) \u03c3\u03c6(X\u2212\n\nd ) + \u0001((cid:96))\n\nd ), \u0001((cid:96))\n\nq\u03c6(zd|X\n\n\u2212\n\nL\n\n\uf8f9\uf8fb + \u03b2DKL(q\u03c6(zd|X\u2212\n\ns\u03b8(x\u2212\n\ndn|zd)\n\nd )||p(zd)),\n\n(5)\n\nLd(\u03b8, \u03c6) := E\n\n\uf8ee\uf8f0 1\neach domain, L(\u03b8, \u03c6) :=(cid:80)DS+DT\n\nq\u03c6(zd|X\n\n\u2212\nd )\n\nN\u2212\n\nd\n\nd(cid:88)\n\n\u2212\n\nN\n\nn=1\n\nwhere the \ufb01rst term can also be approximated using the reparametrization trick. As a result, the\nobjective function for the proposed method is the following weighted sum of the objective function for\n\u03b1dLd(\u03b8, \u03c6), where \u03b1d \u2265 0 is the hyperparameter. This objective\nfunction can be minimized w.r.t. \u03b8 and \u03c6 by gradient-based optimization methods. This formulation\nincludes various settings. For example, when no target instances can be used in the training phase,\nwe set \u03b1d = 0 for d = DS + 1, . . . , DS + DT and \u03b1d = 1 for d = 1, . . . , DS. The proposed method\ncan infer the anomaly score functions for the domains that are not used for training when the sets of\nnormal instances in the domains are given without re-training as described below.\n\nd=1\n\n5\n\n\fInference\n\n4.5\nBy using the learned parameters (\u03b8\u2217, \u03c6\u2217) and the normal instances X\u2212\nthe domain-speci\ufb01c anomaly score function for the d(cid:48)-th domain as follows:\n\n(cid:90)\n\ns(xd(cid:48)) :=\n\ns\u03b8\u2217 (xd(cid:48)|zd(cid:48))q\u03c6\u2217 (zd(cid:48)|X\u2212\n\nd(cid:48))dzd(cid:48) \u2248 1\nL\n\nd(cid:48), the proposed method infers\n\ns\u03b8\u2217 (xd(cid:48)|z((cid:96))\nd(cid:48) )\n\n(6)\n\nL(cid:88)\n\n(cid:96)=1\n\nd(cid:48) = \u00b5\u03c6\u2217 (X\u2212\n\nd(cid:48)) + \u0001((cid:96))(cid:12)\u03c3\u03c6\u2217 (X\u2212\n\nd(cid:48)), \u0001((cid:96)) \u223c N (0, I), and xd(cid:48) is any instance in the d(cid:48)-th domain.\nwhere z((cid:96))\nThe proposed method can infer the s(\u00b7) considering the uncertainty of the latent domain vectors by\nsampling zd(cid:48) from the q\u03c6\u2217 (zd(cid:48)|X\u2212\nd(cid:48)), which enables robust anomaly detection. This can be inferred\neven if X\u2212\nd(cid:48) + L).\n\nd(cid:48) are not used for training. The computational complexity for the inference is O(N\u2212\n\n5 Experiments\n\nWe demonstrate the effectiveness of the proposed method using one synthetic and four real-world\nclass-imbalanced datasets. To measure anomaly detection ability on target domains, we evaluated the\nAUC, which is a well used measure for anomaly detection tasks, on one domain while training the\nrest. We used the following setup: CPU was Intel Xeon E5-2660v3 2.6 GHz, the memory size was\n128 GB, and GPU was NVIDIA Tesla k80.\n\n5.1 Data\n\nWe created the simple two-dimensional dataset shown in Figure 2a. This dataset consists of eight\ndouble circles (domains) around the (0, 0). Each double circle has outer and inner circles that consist\nof normal and anomalous instances, respectively. We used the \u20187\u2019-th domain as the target domain\nand the rest as the source domains. We used four real-world public datasets: MNIST-r, Anuran\nCalls, Landmine, and IoT. The MNIST-r is derived from the MNIST by rotating images, which\nwas introduced by Ghifary et al. [20]. The MNIST-r has six domains. We selected the \u20184\u2019 digit as\nthe anomalous class and the rest as the normal class since it was the most dif\ufb01cult setting in our\npreliminarily experiment. The Anuran Calls is a real-world dataset collected from frog croaking\nsounds, which is used in the multi-task anomaly detection study [28]. We regarded each specie as a\ndomain referring to the study [28], and thus, the Anuran Calls has \ufb01ve domains. The Landmine is a\nreal-world dataset that is well-used in multi-task learning [49]. We used ten domains that consist of\nthe \ufb01rst \ufb01ve (1-5) and last \ufb01ve (25-29) domains. The IoT contains real network traf\ufb01c data, which are\ngathered from nine IoT devices infected by BASHLITE malware. We did not use the device that had\nd + N\u2212\nno normal data, and thus, the IoT has eight domains. The average anomaly rate N +\nd )\nof Synthetic, MNIST-r, Anuran Calls, Landmine, and IoT are 0.048, 0.1, 0.024, 0.062, and 0.05,\nrespectively. Due to the length limit of the paper, the details of the datasets including download links\nare provided in the supplemental material.\n\nd /(N +\n\n5.2 Comparison Methods\n\nWe evaluated two variants of the proposed method: ProT and ProS. ProT uses normal instances in\nthe target domain as well as anomalous and normal source instances for training. ProS does not use\ntarget normal instances for training. After training with source domains, ProS infers the anomaly\nscore function for the target domain using the set of target normal instances without re-training. The\nproposed method was implemented by Chainer [45].\nWe compared the proposed method with eight methods: the feed-forward neural network classi\ufb01er\n(NN), the NN for class-imbalanced data (NNAUC), the autoencoder based classi\ufb01er for class-\nimbalanced data (AEAUC) [29], the autoencoder (AE) [43], the one-class support vector machine\n(OSVM) [44], the contrastive semantic alignment (CCSA) [38], the transfer one-class support vector\nmachine (TOSVM) [3], and the one-class data transfer learning (OTL) [11]. AE and OSVM are\nsemi-supervised anomaly detection methods, which use only normal instances in the target domain\nfor training. NN, NNAUC, and AEAUC are supervised anomaly detection methods, which use both\nanomalous and normal instances in the source domains for training. AEAUC is obtained from ProS\nby omitting the latent domain vectors. CCSA, TOSVM, and OTL are transfer learning or transfer\nanomaly detection methods, which use both anomalous and normal instances in the source domains\n\n6\n\n\f(a) Synthetic Data\n\n(b) Latent Domain Vectors\n\n(c) Anomaly Score\n\n(d) AUC\n\nFigure 2: (a) Synthetic dataset consists of eight double circles (domains). Anomalous and normal\nsource instances are represented by orange and blue points, respectively. Anomalous and normal\ntarget instances are represented by red and green points, respectively. (b) Posteriors of the latent\ndomain vectors estimated by ProS. Orange points denote the mean of each posterior. (c) Heatmap of\nanomaly scores on the target domain obtained by ProS. Darker color indicates higher anomaly score.\n(d) Average and standard error of AUCs when \u03bb was changed.\n\nTable 1: Average and standard deviation of AUCs [%] on each target domain on MNIST-r.\n\nTarget\n0\n15\n30\n45\n60\n75\nAvg\n\nProT\n\n93.9(0.7)\n99.2(0.3)\n98.8(0.4)\n97.1(0.8)\n99.2(0.4)\n91.2(2.0)\n96.6(3.2)\n\nProS\n\n92.5(1.2)\n99.2(0.3)\n98.8(0.4)\n95.3(0.8)\n99.2(0.2)\n91.1(2.9)\n96.0(3.6)\n\nNN\n\n88.7(1.2)\n98.2(0.4)\n98.2(0.3)\n96.4(0.8)\n98.9(0.5)\n88.7(2.2)\n94.8(4.6)\n\nNNAUC AEAUC\n87.2(1.3)\n88.5(2.8)\n98.7(0.5)\n97.7(0.3)\n98.1(0.6)\n97.6(0.4)\n94.6(1.6)\n94.7(0.9)\n98.2(0.9)\n98.7(0.3)\n87.7(2.3)\n88.5(1.4)\n94.1(4.7)\n94.3(4.9)\n\nAE\n\n72.6(2.5)\n72.6(2.5)\n71.8(3.9)\n73.6(3.6)\n72.8(2.3)\n72.9(2.9)\n72.7(3.0)\n\nOSVM\n56.7(1.2)\n63.9(1.2)\n63.5(1.3)\n63.9(0.9)\n63.6(1.0)\n71.4(2.2)\n63.8(4.5)\n\nCCSA\n87.3(1.3)\n97.7(0.4)\n97.5(0.4)\n96.5(1.0)\n98.2(0.5)\n92.3(0.8)\n94.9(4.1)\n\nTOSVM\n89.4(4.8)\n95.5(1.8)\n94.6(2.4)\n90.6(4.8)\n96.2(1.8)\n93.5(2.8)\n93.3(4.1)\n\nOTL\n\n86.5(1.3)\n95.2(0.8)\n95.3(1.0)\n91.9(1.4)\n94.8(1.4)\n78.4(3.0)\n90.3(6.4)\n\nand normal instances in the target domain to obtain the anomaly detector. Although CCSA and\nTOSVM use target normal instances for training like ProT, OTL does not like ProS.\nWe selected hyperparameters using average validation AUC on the source domains for all methods\nexcept for AE. AE used the validation reconstruction error on the target domains. We evaluated the\ntest AUC when the method obtained the best validation AUC after 15 epochs to avoid over-\ufb01tting.\nWe conducted experiments on ten different datasets for each target domain and reported the mean test\nAUC. The details of the experimental setup such as network architectures for the proposed method\nand comparison methods and hyperparameter candidates are described in the supplemental material.\n\n5.3 Results\n\nd ) are inferred from an input X\u2212\nd ) are continuous w.r.t. the input X\u2212\n\nFirst, we show how the proposed method works by using the synthetic dataset. Each domain of\nthe synthetic dataset is located on a circle as shown Figure 2a. Figure 2b represents an example\nof the posterior of the latent domain vectors estimated by ProS with K = 2. We found that the\nestimated latent domain vectors were located on a circle to preserve the relationship between the\ndomains. Especially, the posterior of the target domain (\u20187\u2019-th domain) was predicted between the\nposteriors of the \u20180\u2019-th and \u20186\u2019-th domains even if the target domain was not used for training (i.e.,\nthe objective function). The parameters of q\u03c6(zd|X\u2212\nd so as to \ufb01nd\nanomalies. Since the parameters of q\u03c6(zd|X\u2212\nd , the posteriors\nof zd\u2019s become similar if the X\u2212\nd \u2019s are similar. Thus, similar domains (e.g., the \u20183\u2019-th and \u20184\u2019-th\ndomains) were located near each other in the latent space. Figure 2c represents an example of the\nheatmap of anomaly scores on the target domain obtained by ProS. We found that ProS was able to\ngive high (low) anomaly scores to the anomalous (normal) region on the target domain, i.e., AUC is\none even if the target domain was not used for training. Figure 2d shows average and standard error\nof AUCs with different \u03bb on the target domain. We used AEAUC as a baseline since it is obtained\nfrom ProS by omitting the latent domain vectors. ProT and ProS outperformed the others by large\nmargins when the values of \u03bb were relatively large. This result indicates the importance of both\nanomalies and modeling domain differences. Overall, these results demonstrate that the proposed\nmethod can capture the property of domains as the latent domain vectors and perform well.\nSecond, we evaluated anomaly detection performance on the target domain using the real-world\ndatasets. Tables 1 \u2013 4 show the average and standard deviation of AUCs with different target domains\n\n7\n\n00.10.20.30.40.50.60.70.80.910.1110100100010000AUCProTProSAEAUCAE\fTable 2: Average and standard deviation of AUCs [%] on each target domain on Anuran Calls.\n\nTarget\nAde\nAme\nHyl\nHCi\nHCo\nAvg\n\nProT\n\n99.9(0.1)\n99.4(1.3)\n99.7(0.3)\n99.9(0.2)\n99.9(0.1)\n99.8(0.6)\n\nProS\n\n95.5(4.8)\n93.5(3.6)\n98.6(2.0)\n98.9(1.6)\n97.6(3.5)\n96.8(3.8)\n\nNN\n\n84.8(19.)\n87.1(4.5)\n99.6(0.9)\n98.7(2.1)\n97.1(4.7)\n93.4(11.)\n\nNNAUC AEAUC\n95.6(4.0)\n64.0(17.)\n94.1(5.5)\n85.7(5.2)\n98.9(2.0)\n99.4(1.0)\n99.4(0.8)\n98.2(3.1)\n96.2(3.6)\n96.5(3.4)\n96.9(4.0)\n88.7(16.)\n\nAE\n\n84.5(12.)\n81.6(8.1)\n93.8(3.7)\n89.5(1.2)\n94.2(6.3)\n88.7(8.5)\n\nOSVM\n96.6(0.9)\n89.1(3.0)\n89.9(1.7)\n93.6(0.6)\n92.7(1.7)\n92.4(3.2)\n\nCCSA\n95.2(4.4)\n96.2(2.6)\n96.1(6.5)\n96.7(2.8)\n99.5(1.2)\n96.7(4.1)\n\nTOSVM\n77.9(20.)\n93.6(5.7)\n99.6(0.4)\n99.3(1.3)\n97.4(1.7)\n93.6(12.)\n\nOTL\n\n90.9(4.7)\n93.8(2.9)\n95.4(6.3)\n90.4(4.5)\n97.2(2.4)\n93.5(4.9)\n\nTable 3: Average and standard deviation of AUCs [%] on each target domain on Landmine.\n\nTarget\n1\n2\n3\n4\n5\n25\n26\n27\n28\n29\nAvg\n\nProT\n\n83.9(2.5)\n78.4(2.2)\n80.8(1.7)\n84.4(2.3)\n84.1(2.8)\n61.8(2.4)\n63.6(1.7)\n60.2(2.4)\n71.6(4.7)\n55.5(3.1)\n72.4(11.)\n\nProS\n\n83.8(2.0)\n78.5(2.3)\n80.9(1.7)\n84.1(1.7)\n84.3(2.9)\n62.8(3.6)\n62.4(2.7)\n60.9(5.2)\n70.8(3.9)\n55.3(3.4)\n72.4(11.)\n\nNN\n\n77.8(2.0)\n73.8(2.0)\n79.4(1.3)\n78.2(3.0)\n77.2(2.9)\n55.6(4.4)\n63.6(2.0)\n59.7(3.2)\n63.7(5.1)\n54.9(2.4)\n68.4(9.9)\n\nNNAUC AEAUC\n78.8(3.3)\n80.9(2.8)\n72.6(3.0)\n76.7(2.0)\n80.5(1.2)\n79.3(2.2)\n78.0(4.5)\n80.0(2.2)\n81.4(2.5)\n77.6(4.9)\n60.5(6.0)\n54.2(3.0)\n62.6(1.7)\n61.9(2.3)\n59.5(3.1)\n59.3(2.7)\n66.5(5.7)\n67.4(3.7)\n56.7(2.5)\n55.9(1.9)\n69.9(11.)\n69.1(9.4)\n\nAE\n\n50.0(5.7)\n61.6(3.8)\n58.6(3.8)\n47.0(7.6)\n51.4(7.6)\n67.3(1.9)\n48.3(3.2)\n57.4(6.0)\n67.5(3.1)\n50.6(5.3)\n56.0(8.8)\n\nOSVM\n38.0(7.2)\n45.9(5.7)\n49.2(9.7)\n64.0(13.)\n55.6(11.)\n56.7(1.0)\n63.0(2.4)\n68.3(2.6)\n72.5(1.8)\n59.6(2.9)\n57.3(12.)\n\nCCSA\n78.2(3.4)\n72.9(2.0)\n76.1(2.6)\n79.7(2.9)\n74.5(7.9)\n51.1(1.4)\n63.7(1.4)\n64.9(1.9)\n66.3(3.2)\n57.9(2.2)\n68.5(9.5)\n\nTOSVM\n55.9(6.2)\n61.6(5.2)\n60.2(4.9)\n49.5(12.)\n44.3(6.4)\n57.9(2.2)\n61.9(4.2)\n61.1(2.3)\n72.1(3.3)\n55.8(4.2)\n58.0(9.0)\n\nOTL\n\n61.0(6.2)\n64.5(2.2)\n61.0(2.9)\n59.2(3.3)\n55.5(2.3)\n60.2(2.4)\n59.9(3.2)\n62.2(2.4)\n69.0(2.3)\n57.5(1.9)\n61.0(4.6)\n\non all datasets. In Tables 1 \u2013 6, the boldface denotes the best and comparable methods according\nto the paired t-test at the signi\ufb01cance level of 5%. ProT showed the best/comparable AUCs in\nalmost all target domains (25 of 29 cases) and ProS also showed the best/comparable AUCs in\nmany target domains (21 of 29 cases). Supervised anomaly detection methods (NN, NNAUC, and\nAEAUC) tended to perform better than semi-supervised methods (AE and OSVM) for all datasets,\nwhich suggests the effectiveness of using anomaly labels on the related domains. ProT and ProS\noutperformed these supervised methods. Moreover, ProT and ProS outperformed the transfer learning\nmethods (CCSA, TOSVM, and OTL) by modeling the domain difference via the latent domain\nvectors. As for ProT and ProS, ProT showed better results than ProS in MNST-r and Anuran Calls\nsince ProT uses target normal instances for training. Although ProS does not use target normal\ninstances for training, it performed almost the same as ProT in Landmine and IoT, which indicates\nthe effectiveness of inferring domain-speci\ufb01c anomaly detectors without re-training. Overall, these\nresults showed that the proposed method detects anomalies superiorly on the target domains.\nThird, we investigated the effect of considering the uncertainty of the latent domain vectors in the\nproposed method. To asses this, we consider the deterministic variants of ProT and ProS, called\nD-ProT and D-ProS, respectively. The objective function for both methods is obtained by replacing\nEq. (2) with the delta distribution q\u03c6(zd|X\u2212\nd )) and omitting the KL divergence\nterms in Eqs. (4) and (5). Note that these methods also do not exist in previous studies, and thus, we\ncan regard them as our proposal. Table 5 shows the average AUCs over all target domains of each\ndataset. ProT and ProS performed better than D-ProT and D-ProS in all datasets. These results show\nthe effectiveness of considering the uncertainty in the proposed method.\nFourth, we investigated the effect of the number of anomalous training instances. Table 6 shows the\naverage AUCs over target domains when the anomaly rate of each source domain ra := N +\nd +\nN\u2212\nd ) was equally changed on MNIST-r. As expected, as the number of anomalous training instances\ndecreased, ProT, ProS, and AEAUC came to perform worse. However, even if the anomaly rate ra\nis small, ProT and ProS showed better results than AE. In addition, ProT and ProS outperformed\nAEAUC for all the values of the ra. This result suggests that the proposed method is relatively robust\nagainst the anomaly ratio ra.\nLast, we evaluated the computation time of the proposed method. We evaluated the training time\nof 100 epochs for ProT, ProS, and AEAUC on MNIST-r. We set the hyperparameters as follows:\nthe regularization parameter of the AUC loss \u03bb was 104, the dimension of the latent domain vector\nK was 20, the regularization parameter of the latent domain vector \u03b2 was one, and the sample size\nof the reparametrization trick L was one. The computation time of 100 epoch training of ProT,\nProS, and AEAUC were 10.58, 9.94 and 5.56 seconds, respectively. Although ProT and ProS took\nmore computation time than AEAUC due to the additional network for the latent domain vector,\nthe computation costs of ProT and ProS were not so large. Since ProT uses target normal instances\n\nd ) = \u03b4(zd \u2212 \u00b5\u03c6(X\u2212\n\nd /(N +\n\n8\n\n\fTable 4: Average and standard deviation of AUCs [%] on each target domain on IoT.\n\nProT\n\nTarget\n99.6(0.1)\nDbell\nTherm 99.6(0.1)\n99.6(0.2)\nEbell\n94.3(1.3)\nBaby\n98.5(0.4)\n737\n99.1(0.2)\n838\n99.1(0.2)\nWeb\n97.8(0.3)\n1002\n98.4(1.7)\nAvg\n\nProS\n\n99.6(0.1)\n99.6(0.1)\n99.6(0.1)\n94.7(1.0)\n98.3(0.6)\n99.2(0.2)\n99.0(0.2)\n97.8(0.3)\n98.5(1.6)\n\nNN\n\n99.4(0.4)\n99.6(0.1)\n99.5(0.1)\n93.1(1.4)\n97.9(0.6)\n99.0(0.3)\n99.2(0.1)\n97.5(0.5)\n98.2(2.1)\n\nNNAUC AEAUC\n99.5(0.1)\n99.5(0.2)\n99.5(0.2)\n99.6(0.1)\n99.6(0.2)\n99.6(0.1)\n90.4(2.7)\n93.5(1.3)\n98.1(1.0)\n98.1(0.5)\n99.1(0.2)\n99.1(0.3)\n99.1(0.2)\n99.2(0.1)\n97.6(0.6)\n97.7(0.2)\n98.3(2.1)\n97.9(3.1)\n\nAE\n\n82.1(13.)\n90.8(11.)\n92.8(6.6)\n46.8(3.3)\n68.9(11.)\n68.4(12.)\n85.6(12.)\n56.5(15.)\n74.0(19.)\n\nOSVM\n99.1(0.2)\n98.6(0.2)\n97.4(0.3)\n69.3(0.8)\n89.0(1.9)\n87.4(2.7)\n97.7(0.3)\n72.6(2.1)\n88.9(11.)\n\nCCSA\n99.3(0.4)\n99.6(0.1)\n99.5(0.2)\n94.1(1.4)\n97.5(0.5)\n99.1(0.3)\n99.1(0.1)\n97.7(0.2)\n98.2(1.8)\n\nTOSVM\n99.5(0.2)\n99.6(0.1)\n98.4(2.0)\n80.8(9.1)\n97.4(2.0)\n96.8(2.9)\n98.8(0.6)\n97.3(0.4)\n96.1(6.8)\n\nOTL\n\n99.2(0.2)\n97.8(0.4)\n97.5(0.4)\n92.3(2.3)\n97.3(0.5)\n97.9(0.4)\n97.7(0.2)\n94.8(2.8)\n96.8(2.4)\n\nTable 5: The effect of considering the uncertainty.\nAverage and standard deviation of AUCs [%] over\nall target domains of each dataset.\n\nTable 6: Average and standard deviation of AUCs\n[%] over all target domains when changing the\nanomaly rate ra on MNIST-r.\n\nData\nMNIST-r\nAnuran Calls\nLandmine\nIoT\n\nProT\n\n96.6(3.2)\n99.8(0.6)\n72.4(11.)\n98.4(1.7)\n\nProS\n\n96.0(3.6)\n96.8(3.8)\n72.4(11.)\n98.5(1.6)\n\nD-ProT\n95.9(4.0)\n86.5(2.2)\n71.1(11.)\n98.1(2.6)\n\nD-ProS\n95.5(4.0)\n84.8(2.2)\n71.1(11.)\n97.9(2.8)\n\nra\n0.1\n0.05\n0.01\n0.005\n\nProT\n\n96.6(3.2)\n94.5(5.0)\n89.7(7.1)\n87.0(8.4)\n\nProS\n\n96.0(3.6)\n94.1(5.6)\n88.7(8.5)\n85.0(9.8)\n\nAEAUC\n94.3(4.9)\n91.3(7.3)\n86.4(8.2)\n81.9(10.)\n\nAE\n\n72.7(3.0)\n72.7(3.0)\n72.7(3.0)\n72.7(3.0)\n\nas well as source instances to learn the target-speci\ufb01c anomaly score function, ProT took a little\nmore computation time than ProS. ProS infer the target-speci\ufb01c anomaly score function using the\nset of target normal instances without re-training. In this experiment, ProS inferred it with 0.0027\nseconds when the sample size of the reparametrization trick L was ten. This inference time was 3918\ntimes faster than the training time of ProT. Additional experimental results such as the \u03bb, K, and \u03b2\u2019s\ndependency are described in the supplemental material.\n\n6 Conclusion\n\nIn this paper, we proposed a method to improve the anomaly detection performance on target domains\nby inferring their latent domain vectors. The proposed method can infer the anomaly detectors for\nany domains given the sets of normal instances in the domains without re-training or anomalies. In\naddition, the proposed method can also use target normal instances for training. The most attractive\npoint of the proposed method is that it can infer domain-speci\ufb01c anomaly detectors in two situations,\ni.e., target normal instances can or cannot be used for training, in a uni\ufb01ed framework. In experiments\nusing one synthetic and four real-world datasets, the proposed method outperformed various existing\nanomaly detection methods. For future work, we will apply sophisticated density models such as\nautoregressive models and \ufb02ow-based models as the anomaly score function.\n\nReferences\n[1] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon. Ganomaly: Semi-supervised anomaly detection via\n\nadversarial training. arXiv, 2018.\n\n[2] S. Al-Stouhi and C. K. Reddy. Transfer learning for class imbalance problems with inadequate data.\n\nKnowledge and information systems, 48(1):201\u2013228, 2016.\n\n[3] J. T. Andrews, T. Tanay, E. J. Morton, and L. D. Grif\ufb01n. Transfer representation-learning for anomaly\n\ndetection. In Anomaly Detection Workshop in ICML, 2016.\n\n[4] S. Babar, P. Mahalle, A. Stango, N. Prasad, and R. Prasad. Proposed security model and threat taxonomy\n\nfor the internet of things (iot). In ICNSA, 2010.\n\n[5] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain\n\nadaptation with generative adversarial networks. In CVPR, 2017.\n\n[6] R. Chalapathy and S. Chawla. Deep learning for anomaly detection: A survey. arXiv, 2019.\n\n[7] R. Chalapathy, A. K. Menon, and S. Chawla. Anomaly detection using one-class neural networks. arXiv,\n\n2018.\n\n9\n\n\f[8] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR),\n\n41(3):15, 2009.\n\n[9] N. V. Chawla. Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery\n\nhandbook, pages 875\u2013886. Springer, 2009.\n\n[10] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling\n\ntechnique. Journal of arti\ufb01cial intelligence research, 16:321\u2013357, 2002.\n\n[11] J. Chen and X. Liu. Transfer learning with one-class data. Pattern Recognition Letters, 37:32\u201340, 2014.\n\n[12] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. Outlier detection with autoencoder ensembles. In SDM,\n\n2017.\n\n[13] H. Daum\u00e9 III. Frustratingly easy domain adaptation. ACL, 2007.\n\n[14] D. Deshpande. Managed security services: an emerging solution to security. In InfoSecCD, 2005.\n\n[15] L. Dinh, D. Krueger, and Y. Bengio. Nice: non-linear independent components estimation. arXiv, 2014.\n\n[16] P. Dokas, L. Ertoz, V. Kumar, A. Lazarevic, J. Srivastava, and P.-N. Tan. Data mining for network intrusion\n\ndetection. In Proc. NSF Workshop on Next Generation Data Mining, pages 21\u201330, 2002.\n\n[17] H. Fujita, T. Matsukawa, and E. Suzuki. One-class selective transfer machine for personalized anomalous\n\nfacial expression detection. In VISIGRAPP (5: VISAPP), pages 274\u2013283, 2018.\n\n[18] L. Ge, J. Gao, H. Ngo, K. Li, and A. Zhang. On handling negative transfer and imbalanced distributions in\nmultiple source transfer learning. Statistical Analysis and Data Mining: The ASA Data Science Journal,\n7(4):254\u2013271, 2014.\n\n[19] M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: masked autoencoder for distribution\n\nestimation. In ICML, 2015.\n\n[20] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi. Domain generalization for object recognition\n\nwith multi-task autoencoders. In ICCV, 2015.\n\n[21] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing. Learning from class-imbalanced\n\ndata: Review of methods and applications. Expert Systems with Applications, 73:220\u2013239, 2017.\n\n[22] D. Hendrycks, M. Mazeika, and T. G. Dietterich. Deep anomaly detection with outlier exposure. In ICLR,\n\n2019.\n\n[23] A. Herschtal and B. Raskutti. Optimising area under the roc curve using gradient descent. In ICML, 2004.\n\n[24] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner.\n\nbeta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.\n\n[25] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science,\n\n313(5786):504\u2013507, 2006.\n\n[26] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada:\n\ncycle-consistent adversarial domain adaptation. In ICML, 2018.\n\n[27] T. Id\u00e9. Collaborative anomaly detection on blockchain from noisy sensor data. In 2018 IEEE International\n\nConference on Data Mining Workshops (ICDMW), pages 120\u2013127. IEEE, 2018.\n\n[28] T. Id\u00e9, D. T. Phan, and J. Kalagnanam. Multi-task multi-modal models for collective anomaly detection. In\n\nICDM, 2017.\n\n[29] T. Iwata and Y. Yamanaka. Supervised anomaly detection based on deep autoregressive density estimators.\n\narXiv, 2019.\n\n[30] F. Keller, E. Muller, and K. Bohm. Hics: high contrast subspaces for density-based outlier ranking. In\n\nICDE, 2012.\n\n[31] D. P. Kingma and P. Dhariwal. Glow: generative \ufb02ow with invertible 1x1 convolutions. In NeurIPS, 2018.\n\n[32] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[33] Y. Kou, C.-T. Lu, S. Sirwongwattana, and Y.-P. Huang. Survey of fraud detection techniques. In ICNSC,\n\n2004.\n\n10\n\n\f[34] A. Kumagai and T. Iwata. Zero-shot domain adaptation without domain semantic descriptors. arXiv, 2018.\n\n[35] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In ICDM, 2008.\n\n[36] M. Long, Z. CAO, J. Wang, and M. I. Jordan. Conditional adversarial domain adaptation. In NeurIPS.\n\n2018.\n\n[37] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer\n\nnetworks. In NeurIPS, 2016.\n\n[38] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Uni\ufb01ed deep supervised domain adaptation and\n\ngeneralization. In ICCV, 2017.\n\n[39] S. J. Pan and Q. Yang. A survey on transfer learning.\n\nengineering, 22(10):1345\u20131359, 2010.\n\nIEEE Transactions on knowledge and data\n\n[40] L. Ruff, N. G\u00f6rnitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. M\u00fcller, and M. Kloft. Deep\n\none-class classi\ufb01cation. In ICML, 2018.\n\n[41] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV,\n\n2010.\n\n[42] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classi\ufb01er discrepancy for unsupervised\n\ndomain adaptation. In CVPR, 2018.\n\n[43] M. Sakurada and T. Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction.\nIn Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, page 4.\nACM, 2014.\n\n[44] B. Sch\u00f6lkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a\n\nhigh-dimensional distribution. Neural computation, 13(7):1443\u20131471, 2001.\n\n[45] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation open source framework for deep\n\nlearning. In Proceedings of workshop on machine learning systems (LearningSys) in NeurIPS, 2015.\n\n[46] B. Uria, M.-A. C\u00f4t\u00e9, K. Gregor, I. Murray, and H. Larochelle. Neural autoregressive distribution estimation.\n\nJMLR, 17(1):7184\u20137220, 2016.\n\n[47] J. Wang, Y. Chen, S. Hao, W. Feng, and Z. Shen. Balanced distribution adaptation for transfer learning. In\n\nICDM, 2017.\n\n[48] Y. Xiao, B. Liu, S. Y. Philip, and Z. Hao. A robust one-class transfer learning method with uncertain data.\n\nKnowledge and Information Systems, 44(2):407\u2013438, 2015.\n\n[49] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with dirichlet\n\nprocess priors. Journal of Machine Learning Research, 8(Jan):35\u201363, 2007.\n\n[50] Y. Yamanaka, T. Iwata, H. Takahashi, M. Yamada, and S. Kanai. Autoencoding binary classi\ufb01ers for\n\nsupervised anomaly detection. arXiv, 2019.\n\n[51] K. Yamanishi, J.-I. Takeuchi, G. Williams, and P. Milne. On-line unsupervised outlier detection using \ufb01nite\nmixtures with discounting learning algorithms. Data Mining and Knowledge Discovery, 8(3):275\u2013300,\n2004.\n\n[52] L. Yan, R. H. Dodier, M. Mozer, and R. H. Wolniewicz. Optimizing classi\ufb01er performance via an\n\napproximation to the wilcoxon-mann-whitney statistic. In ICML, 2003.\n\n[53] Z. Yuan, D. Bao, Z. Chen, and M. Liu. Integrated transfer learning algorithm using multi-source tradaboost\n\nfor unbalanced samples classi\ufb01cation. In CIIS, 2017.\n\n[54] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In\n\nNeurIPS, 2017.\n\n[55] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang. Deep structured energy based models for anomaly detection. In\n\nICML, 2016.\n\n[56] C. Zhou and R. C. Paffenroth. Anomaly detection with robust deep autoencoders. In KDD, 2017.\n\n[57] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen. Deep autoencoding gaussian\n\nmixture model for unsupervised anomaly detection. ICLR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1434, "authors": [{"given_name": "Atsutoshi", "family_name": "Kumagai", "institution": "NTT"}, {"given_name": "Tomoharu", "family_name": "Iwata", "institution": "NTT"}, {"given_name": "Yasuhiro", "family_name": "Fujiwara", "institution": "NTT Communication Science Laboratories"}]}