{"title": "Houdini: Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples", "book": "Advances in Neural Information Processing Systems", "page_first": 6977, "page_last": 6987, "abstract": "Generating adversarial examples is a critical step for evaluating and improving the robustness of learning machines. So far, most existing methods only work for classification and are not designed to alter the true performance measure of the problem at hand. We introduce a novel flexible approach named Houdini for generating adversarial examples specifically tailored for the final performance measure of the task considered, be it combinatorial and non-decomposable. We successfully apply Houdini to a range of applications such as speech recognition, pose estimation and semantic segmentation. In all cases, the attacks based on Houdini achieve higher success rate than those based on the traditional surrogates used to train the models while using a less perceptible adversarial perturbation.", "full_text": "Houdini: Fooling Deep Structured Visual and Speech\n\nRecognition Models with Adversarial Examples\n\nMoustapha Cisse\n\nFacebook AI Research\n\nmoustaphacisse@fb.com\n\nNatalia Neverova*\nFacebook AI Research\nnneverova@fb.com\n\nAbstract\n\nYossi Adi*\n\nBar-Ilan University, Israel\nyossiadidrum@gmail.com\n\nJoseph Keshet\n\nBar-Ilan University, Israel\njkeshet@cs.biu.ac.il\n\nGenerating adversarial examples is a critical step for evaluating and improving\nthe robustness of learning machines. So far, most existing methods only work\nfor classi\ufb01cation and are not designed to alter the true performance measure of\nthe problem at hand. We introduce a novel \ufb02exible approach named Houdini for\ngenerating adversarial examples speci\ufb01cally tailored for the \ufb01nal performance\nmeasure of the task considered, be it combinatorial and non-decomposable. We\nsuccessfully apply Houdini to a range of applications such as speech recognition,\npose estimation and semantic segmentation. In all cases, the attacks based on\nHoudini achieve higher success rate than those based on the traditional surrogates\nused to train the models while using a less perceptible adversarial perturbation.\n\n1\n\nIntroduction\n\nDeep learning has rede\ufb01ned the landscape of machine intelligence [22] by enabling several break-\nthroughs in notoriously dif\ufb01cult problems such as image classi\ufb01cation [20, 16], speech recognition [2],\nhuman pose estimation [35] and machine translation [4]. As the most successful models are perme-\nating nearly all the segments of the technology industry from self-driving cars to automated dialog\nagents, it becomes critical to revisit the evaluation protocol of deep learning models and design new\nways to assess their reliability beyond the traditional metrics. Evaluating the robustness of neural\nnetworks to adversarial examples is one step in that direction [32]. Adversarial examples are synthetic\npatterns carefully crafted by adding a peculiar noise to legitimate examples. They are indistinguish-\nable from the legitimate examples by a human, yet they have demonstrated a strong ability to cause\ncatastrophic failure of state of the art classi\ufb01cation systems [12, 25, 21]. The existence of adversarial\nexamples highlights a potential threat for machine learning systems at large [28] that can limit their\nadoption in security sensitive applications. It has triggered an active line of research concerned with\nunderstanding the phenomenon [10, 11], and making neural networks more robust [29, 7] .\nAdversarial examples are crucial for reliably evaluating and improving the robustness of the mod-\nels [12]. Ideally, they must be generated to alter the task loss unique to the application considered\ndirectly. For instance, an adversarial example crafted to attack a speech recognition system should be\ndesigned to maximize the word error rate of the targetted system. The existing methods for generating\nadversarial examples exploit the gradient of a given differentiable loss function to guide the search in\nthe neighborhood of legitimates examples [12, 25]. Unfortunately, the task loss of several structured\nprediction problems of interest is a combinatorial non-decomposable quantity that is not amenable\n\n*equal contribution\n\n\fFigure 1: We cause the network to generate a minion as segmentation for the adversarially perturbed\nversion of the original image. Note that the original and the perturbed image are indistinguishable.\n\nto gradient-based methods for generating adversarial example. For example, the metric for evaluat-\ning human pose estimation is the (normalized) percentage of correct keypoints. Automatic speech\nrecognition systems are assessed using their word (or phoneme) error rate. Similarly, the quality of a\nsemantic segmentation is measured by the intersection over union (IOU) between the ground truth\nand the prediction. All these evalutation measures are non-differentiable.\nThe solutions for this obstacle in supervised learning are of two kinds. The \ufb01rst route is to use a\nconsistent differentiable surrogate loss function in place of the task loss [5]. That is a surrogate which\nis guaranteed to converge to the task loss asymptotically. The second option is to directly optimize the\ntask loss by using approaches such as Direct Loss Minimization [14]. Both of these strategies have\nsevere limitations. (1) The use of differentiable surrogates is satisfactory for classi\ufb01cation because\nthe relationship between such surrogates and the classi\ufb01cation accuracy is well established [34]. The\npicture is different for the above-mentioned structured prediction tasks. Indeed, there is no known\nconsistency guarantee for the surrogates traditionally used in these problems (e.g. the connectionist\ntemporal classi\ufb01cation loss for speech recognition) and designing a new surrogate is nontrivial and\nproblem dependent. At best, one can only expect a high positive correlation between the proxy and\nthe task loss. (2) The direct minimization approaches are more computationally involved because\nthey require solving a computationally expensive loss augmented inference for each parameter update.\nAlso, they are notoriously sensitive to the choice of the hyperparameters. Consequently, it is harder\nto generate adversarial examples for structured prediction problems as it requires signi\ufb01cant domain\nexpertise with little guarantee of success when surrogate does not tightly approximate the task loss.\n\nResults.\nIn this work we introduce Houdini, the \ufb01rst approach for fooling any gradient-based\nlearning machine by generating adversarial examples directly tailored for the task loss of interest be\nit combinatorial or non-differentiable. We show the tight relationship between Houdini and the task\nloss of the problem considered. We present the \ufb01rst successful attack on a deep Automatic Speech\nRecognition (ASR) system, namely a DeepSpeech-2 based architecture [1], by generating adversarial\naudio \ufb01les not distinguishable from legitimate ones by a human (as validated by an ABX experiment).\nWe also demonstrate the transferability of adversarial examples in speech recognition by fooling\nGoogle VoiceTM in a black box attack scenario: an adversarial example generated with our model\nand not distinguishable from the legitimate one by a human leads to an invalid transcription by the\nGoogle Voice application (see Figure 8). We also present the \ufb01rst successful untargeted and targetted\nattacks on a deep model for human pose estimation [26]. Similarly, we validate the feasibility of\nuntargeted and targetted attacks on a semantic segmentation system [38] and show that we can make\nthe system hallucinate an arbitrary segmentation of our choice for a given image. Figure 1 shows an\nexperiment where we cause the network to hallucinate a minion. In all cases, our approach generates\nbetter quality adversarial examples than each of the different surrogates (expressly designed for the\nmodel considered) without additional computational overhead thanks to the analytical gradient of\nHoudini.\n\n2 Related Work\n\nAdversarial examples. The empirical study of Szegedy et al. [32] \ufb01rst demonstrated that deep\nneural networks could achieve high accuracy on previously unseen examples while being vulnerable to\nsmall adversarial perturbations. This \ufb01nding has recently aroused keen interest in the community [12,\n28, 32, 33]. Several studies have subsequently analyzed the phenomenon [10, 31, 11] and various\napproaches have been proposed to improve the robustness of neural networks [29, 7]. More closely\n\n2\n\nadversarialattackoriginalsemanticsegmentationframeworkcompromisedsemanticsegmentationframework\frelated to our work are the different proposals aiming at generating better adversarial examples [12,\n25]. Given an input (train or test) example (x, y), an adversarial example is a perturbed version of\nthe original pattern \u02dcx = x + \u03b4x where \u03b4x is small enough for \u02dcx to be undistinguishable from x by a\nhuman, but causes the network to predict an incorrect target. Given the network g\u03b8 (where \u03b8 is the set\nof parameters) and a p-norm, the adversarial example is formally de\ufb01ned as:\n\n(1)\nwhere \u0001 represents the strength of the adversary. Assuming the loss function (cid:96)(\u00b7) is differentiable,\nShaham et al. [31] propose to take the \ufb01rst order taylor expansion of x (cid:55)\u2192 (cid:96)(g\u03b8(x), y) to compute \u03b4x\nby solving the following simpler problem:\n\n\u02dcx = argmax\n\u02dcx:(cid:107)\u02dcx\u2212x(cid:107)p\u2264\u0001\n\n(cid:96)(cid:0)g\u03b8(\u02dcx), y(cid:1)\n\n(cid:0)\u2207x(cid:96)(g\u03b8(x), y)(cid:1)T\n\n(\u02dcx \u2212 x)\n\n\u02dcx = argmax\n\u02dcx:(cid:107)\u02dcx\u2212x(cid:107)p\u2264\u0001\n\n(2)\nWhen p = \u221e, then \u02dcx = x + \u0001sign(\u2207x(cid:96)(g\u03b8(x), y)) which corresponds to the fast gradient sign\nmethod [12]. If instead p = 2, we obtain \u02dcx = x + \u0001\u2207x(cid:96)(g\u03b8(x), y) where \u2207x(cid:96)(g\u03b8(x), y) is often\nnormalized. Optionally, one can perform more iterations of these steps using a smaller norm. This\nmore involved strategy has several variants [25]. These methods are concerned with generating\nadversarial examples assuming a differentiable loss function (cid:96)(\u00b7). Therefore they are not directly\napplicable to the task losses of interest. However, they can be used in combination with our proposal\nwhich derives a consistent approximation of the task loss having an analytical gradient.\n\nTask Loss Minimization. Recently, several works have focused on directly minimizing the task\nloss. In particular, McAllester et al. [24] presented a theorem stating that a certain perceptron-like\nlearning rule, involving feature vectors derived from loss-augmented inference, directly corresponds\nto the gradient of the task loss. While this algorithm performs well in practice, it is extremely sensitive\nto the choice of its hyper-parameter and needs two inference operations per training iteration. Do\net al. [9] generalized the notion of the ramp loss from binary classi\ufb01cation to structured prediction\nand proposed a tighter bound to the task loss than the structured hinge loss. The update rule of\nthe structured ramp loss is similar to the update rule of the direct loss minimization algorithm, and\nsimilarly it needs two inference operations per training iteration. Keshet et al. [19] generalized the\nnotion of the binary probit loss to the structured prediction case. The probit loss is a surrogate loss\nfunction naturally resulted in the PAC-Bayesian generalization theorems. it is de\ufb01ned as follows:\n\n\u00af(cid:96)probit(g\u03b8(x), y) = E\u0001\u223cN (0,I) [(cid:96)(y, g\u03b8+\u0001(x))]\n\n(3)\nwhere \u0001 \u2208 Rd is a d-dimensional isotropic Normal random vector. [18] stated \ufb01nite sample generaliza-\ntion bounds for the structured probit loss and showed that it is strongly consistent. Strong consistency\nis a critical property of a surrogate since it guarantees the tight relationship to the task loss. For\ninstance, an attacker of a given system can expect to deteriorate the task loss if she deteriorates the\nconsistent surrogate of it. The gradient of the structured probit loss can be approximated by averaging\nover samples from the unit-variance isotropic normal distribution, where for each sample an inference\nwith perturbed parameters is computed. Hundreds to thousands of inference operations are required\nper iteration to gain stability in the gradient computation. Hence the update rule is computationally\nprohibitive and limits the applicability of the structured probit loss despite its desirable properties.\nWe propose a new loss named Houdini. It shares the desirable properties of the structured probit loss\nwhile not suffering from its limitations. Like the structured probit loss and unlike most surrogates\nused in structured prediction (e.g. structured hinge loss for SVMs), it is tightly related to the task loss.\nTherefore it allows to reliably generate adversarial examples for a given task loss of interest. Unlike\nthe structured probit loss and like the smooth surrogates, it has an analytical gradient hence require\nonly a single inference in its update rule. The next section presents the details of our proposal.\n\n3 Houdini\nLet us consider a neural network g\u03b8 parameterized by \u03b8 and the task loss of a given problem (cid:96)(\u00b7).\nWe assume (cid:96)(y, y) = 0 for any target y. The score output by the network for an example (x, y) is\ng\u03b8(x, y) and the network\u2019s decoder predicts the highest scoring target:\n\n\u02c6y = y\u03b8(x) = argmax\n\ny\u2208Y\n\ng\u03b8(x, y).\n\n(4)\n\n3\n\n\f(5)\n\n(6)\n\nUsing the terminology of section 2, \ufb01nding an adversarial example fooling the model g\u03b8 with respect\nto the task loss (cid:96)(\u00b7) for a chosen p-norm and noise parameter \u0001 boils down to solving:\n\n(cid:96)(cid:0)y\u03b8(\u02dcx), y(cid:1)\n\n\u02dcx = argmax\n\u02dcx:(cid:107)\u02dcx\u2212x(cid:107)p\u2264\u0001\n\nThe task loss is often a combinatorial quantity which is hard to optimize, hence it is replaced with a\ndifferentiable surrogate loss, denoted \u00af(cid:96)(y\u03b8(\u02dcx), y). Different algorithms use different surrogate loss\nfunctions: structural SVM uses the structured hinge loss, Conditional random \ufb01elds use the log loss,\netc. We propose a surrogate named Houdini and de\ufb01ned as follows for a given example (x, y):\n\n\u00af(cid:96)H (\u03b8, x, y) = P\u03b3\u223cN (0,1)\n\ng\u03b8(x, y) \u2212 g\u03b8(x, \u02c6y) < \u03b3\n\n(cid:104)\n\n(cid:105) \u00b7 (cid:96)(\u02c6y, y)\n\nIn words, Houdini is a product of two terms. The \ufb01rst term is a stochastic margin, that is the\nprobability that the difference between the score of the actual target g\u03b8(x, y) and that of the predicted\ntarget g\u03b8(x, \u02c6y) is smaller than \u03b3 \u223c N (0, 1). It re\ufb02ects the con\ufb01dence of the model in its predictions.\nThe second term is the task loss, which given two targets is independent of the model and corresponds\nto what we are ultimately interested in maximizing. Houdini is a lower bound of the task loss. Indeed\ndenoting \u03b4g(y, \u02c6y) = g\u03b8(x, y)\u2212 g\u03b8(x, \u02c6y) as the difference between the scores assigned by the network\nto the ground truth and the prediction, we have P\u03b3\u223cN (0,1)(\u03b4g(y, \u02c6y) < \u03b3) is smaller than 1. Hence\nwhen this probability goes to 1, or equivalently when the score assigned by the network to the target\n\u02c6y grows without a bound, Houdini converges to the task loss. This is a unique property not enjoyed\nby most surrogates used in the applications of interest in our work. It ensures that Houdini is a good\nproxy of the task loss for generating adversarial examples.\nWe can now use Houdini in place of the task loss (cid:96)(\u00b7) in the problem 5. Following 2, we resort to a\n\ufb01rst order approximation which requires the gradient of Houdini with respect to the input x. The\nlatter is obtained following the chain rule:\n\n(cid:2)\u00af(cid:96)H (\u03b8, x, y)(cid:3) =\n\n\u2207x\n\n\u2202 \u00af(cid:96)H (\u03b8, x, y)\n\u2202g\u03b8(x, y)\n\n\u2202g\u03b8(x, y)\n\n\u2202x\n\n(7)\n\nTo compute the RHS of the above quantity, we only need to compute the derivative of Houdini\nwith respect to its input (the output of the network). The rest is obtained by backpropagation. The\nderivative of the loss with respect to the network\u2019s output is:\n= \u2207g\n\ng\u03b8(x, y) \u2212 g\u03b8(x, \u02c6y) < \u03b3\n\n\u2212v2/2dv (cid:96)(y, \u02c6y)\n\n(cid:90) \u221e\n\nP\u03b3\u223cN (0,1)\n\n\u2207g\n\n(cid:35)\n\n(cid:104)\n\n(cid:104)\n\n(8)\n\ne\n\n\u03b4g(y,\u02c6y)\n\nTherefore, expanding the right hand side and denoting C = 1/\n\n2\u03c0 we have:\n\n(cid:105)\n\n(cid:96)(y, \u02c6y)\n\n(cid:34)\n(cid:105)\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\u2212C \u00b7 e\u2212|\u03b4g(y,\u02c6y)|2/2(cid:96)(y, \u02c6y),\n\nC \u00b7 e\u2212|\u03b4g(y,\u02c6y)|2/2(cid:96)(y, \u02c6y),\n0,\n\n1\u221a\n2\u03c0\n\u221a\n\ng = g\u03b8(x, y)\ng = g\u03b8(x, \u02c6y)\notherwise\n\n(9)\n\n(cid:2)\u00af(cid:96)H (\u02c6y, y)(cid:3) =\n\n\u2207g\n\nEquation 9 provides a simple analytical formula for computing the gradient of Houdini with respect\nto its input, hence an ef\ufb01cient way to obtain the gradient with respect to the input of the network x\nby backpropagation. The gradient can be used in combination with any gradient-based adversarial\nexample generation procedure [12, 25] in two ways, depending on the form of attack considered.\nFor an untargeted attack, we want to change the prediction of the network without preference on\nthe \ufb01nal prediction. In that case, any alternative target y can be used (e.g. the second highest scorer\nas the target). For a targetted attack, we set the y to be the desired \ufb01nal prediction. Also note that,\nwhen the score of the predicted target is very close to that of the ground truth (or desired target),\nthat is when \u03b4g(y, \u02c6y) is small as we expect from the trained network we want to fool, we have\ne\u2212|\u03b4g(y,\u02c6y)|2/2 (cid:39) 1. In the next sections, we show the effectiveness of the proposed attack scheme on\nhuman pose estimation, semantic segmentation and automatic speech recognition systems.\n\n4 Human Pose Estimation\n\nWe evaluate the effectiveness of Houdini loss in the context of adversarial attacks on neural models\nfor human pose estimation. Compromising performance of such systems can be desirable for\n\n4\n\n\fFigure 2: Convergence dynamics for pose estimation attacks: (a) perturbation perceptibility vs nb.\niterations, (b) PCKh0.5 vs nb. iterations, (c) proportion of re-positioned joints vs perceptibility.\n\nmanipulating surveillance cameras, altering the analysis of crime scenes, disrupting human-robot\ninteraction or fooling biometrical authentication systems based on gate recognition. The pose\nestimation task is formulated as follows: given a single RGB image of a person, determine correct 2D\npositions of several pre-de\ufb01ned keypoints which typically correspond to skeletal joints. In practice,\nthe performance is measured by the percentage of correctly detected keypoints (PCKh) (i.e. whose\npredicted locations are within a certain distance from the corresponding target positions) [3]:\n\n(cid:80)N\ni=1 1((cid:107)yi \u2212 \u02c6yi(cid:107) < \u03b1h)\n\nN\n\nPCKh\u03b1 =\n\n,\n\n(10)\n\nwhere \u02c6y and y are the predicted and the desired positions of a given joint respectively, h is the head\nsize of the person (known at test time), \u03b1 is a threshold (set to 0.5), and N is the number of annotated\nkeypoints. Pose estimation is a good example of a problem where we observe a discrepancy between\nthe training objective and the \ufb01nal evaluation measure. Instead of directly minimizing the percentage\nof correctly detected keypoints, state-of-the-art methods rely upon a dense prediction of heatmaps,\ni.e. estimation of probabilities of every pixel corresponding to each of keypoint locations. These\nmodels can be trained with binary cross entropy [6], softmax [15] or MSE losses [26] applied to\nevery pixel in the output space, separately for each plane corresponding to every keypoint. In our \ufb01rst\nexperiment, we attack a state-of-the-art model for single person pose estimation based on Hourglass\nnetworks [26] and aim to minimize the value of PCKh0.5 metric given the minimal perturbation. For\nthis task we choose \u02c6y as:\n\n\u02c6y = argmax\n\n\u02dcy:(cid:107) \u02dcp\u2212p(cid:107)>\u03b1h\n\ng\u03b8(x, \u02dcy)\n\n(11)\n\n(cid:80)(x(cid:48)\n\nwhere p is the pixel coordinate on the heatmap corresponding to the argmax value of vector y. We\nperform the optimization iteratively till convergence with an update rule \u0001 \u00b7 \u2207x(cid:107)\u2207x(cid:107) where \u2207x are\ngradients with respect to the input and \u0001 = 0.1. We perform the evaluations on the validation subset\nof MPII dataset [3] consisting of 3000 images and de\ufb01ned as in [26]. We evaluate the perceived\ni \u2212 xi)2)1/2, where x and x(cid:48)\ndegree of perturbation where perceptibility is expressed as ( 1\nn\nare original and distorted images, and n is the number of pixels [32]. In addition, we report the\nstructural similarity index (SSIM) [36] which is known to correlate well with the visual perception of\nimage structure by a human. Figure 2 shows that Houdini only requires 100 iterations to maximally\ndeteriorate the percentage of correct key-points from 89.4 down to 0.57 while MSE deteriorates the\nperformance to only 24.12 after 100 iterations. This observation underlines the importance of the\nloss function used to generate adversarial examples in structured prediction problems. Also, for\nuntargeted attacks optimized to convergence, the perturbation generated with Houdini is up to 50%\nless perceptible than the one obtained with MSE.\nIn the second experiment, we perform a targetted attack in the form of pose transfer, i.e. we force\nthe network to hallucinate an arbitrary pose (with success de\ufb01ned, as before, given the target metric\nPCKh0.5). The experimental setup is as follows: for a given pair of images (i, j) we force the network\nto output the ground truth pose of the picture i when the input is image j and vice versa. This task is\nmore challenging and depends on the similarity between the original and target poses. Surprisingly,\ntargetted attacks are still feasible even when the two ground truth poses are very different. Figure 3\nshows an example where the model predicts the pose of a human body in horizontal position for an\nadversarially perturbed image depicting a standing person (and vice versa). A similar experiment\nwith two persons in standing and sitting positions respectively is also shown in Figure 3.\n\n5\n\n(a)(b)(c)\fFigure 3: Examples of successful targetted attacks on a pose estimation system. Despite the important\ndifference between the images selected, it is possible to make the network predict the wrong pose by\nadding an imperceptible perturbation. The images are part of the MPI dataset.\n\nMethod\n\nSSIM\n\nPerceptibility\n\n@mIoU/2 @mIoUlim @mIoU/2 @mIoUlim\n\nuntargeted: NLL loss\nuntargeted: Houdini loss\ntargetted: NLL loss\ntargetted: Houdini loss\n\n0.9989\n0.9995\n0.9972\n0.9975\n\n0.9950\n0.9959\n0.9935\n0.9937\n\n0.0037\n0.0026\n0.0074\n0.0054\n\n0.0117\n0.0095\n0.0389\n0.0392\n\nTable 1: Comparison of targetted and untargeted adversarial attacks on segmentation systems. mIoU/2\ndenotes 50% performance drop according to the target metric and mIoUlim corresponds to convergence\nor termination after 300 iterations. SSIM: the higher, the better, perceptibility: the lower, the better.\nHoudini based attacks are less perceptible.\n\n5 Semantic segmentation\n\nSemantic segmentation uses another customized metric to evaluate performance, namely the\nmean Intersection over Union (mIoU) measure de\ufb01ned as averaging over classes the IoU =\nTP/(TP + FP + FN), where TP, FP and FN stand for true positive, false positive and false neg-\native respectively, taken separately for each class. Compared to per-pixel accuracy, which appears to\nbe overoptimistic on highly unbalanced datasets, and per-class accuracy, which under-penalizes false\nalarms for non-background classes, this metric favors accurate object localization with tighter masks\n(in instance segmentation) or bounding boxes (in detection). The models are trained with a per-pixel\nsoftmax or multi-class cross entropy losses depending on the task formulation, i.e. optimized for\nmean per-pixel or per-class accuracy instead of mIoU. Primary targets of adversarial attacks in\nthis group of applications are self-driving cars and robots. Xie et al. [37] have previously explored\nadversarial attacks in the context of semantic segmentation. However, they exploited the same proxy\nused for training the network. We perform a series of experiments similar to the ones described in\nSec. 4. That is, we show targetted and untargeted attacks on a semantic segmentation model. We\nuse a pre-trained Dilation10 model for semantic segmentation [38] and evaluate the success of the\nattacks on the validation subset of Cityscapes dataset [8]. In the \ufb01rst experiment, we directly alter\nthe target mIoU metric for a given test image in both targetted and untargeted attacks. As shown in\nTable 1, Houdini allows fooling the model at least as well as the training proxy (NLL) while using a\nless perceptible. Indeed, Houdini based adversarial perturbations generated to alter the performance\nof the model by 50% are about 30% less noticeable than the noise created with NLL.\nThe second set of experiments consists of targetted attacks. That is, altering the input image to obtain\nan arbitrary target segmentation map as the network response. In Figure 4, we show an instance\nof such attack in a segmentation transfer setting, i.e. the target segmentation is the ground truth\nsegmentation of a different image. It is clear that this type of attack is still feasible with a small\n\n6\n\nPCKh0.5=87.5SSIM=0.9802distortion0.2112iter1000PCKh0.5=87.5SSIM=0.9824distortion0.211iter1000PCKh0.5=100.0SSIM=0.9922distortion0.145iter435PCKh0.5=100.0SSIM=0.9906distortion0.016iter574perceptibility0.145perceptibility0.016perceptibility0.211perceptibility0.210\fFigure 4: Targetted attack on a semantic segmentation system: switching target segmentation between\ntwo images from Cityscapes dataset [8]. The last two columns are respectively zoomed-in parts of\nthe perturbed image and the adversarial perturbation added to the original one.\n\nadversarial perturbation (even when after zooming in the picture). Figure 1 depicts a more challenging\nscenario where the target segmentation is an arbitrary map (e.g. a minion). Again, we can make the\nnetwork hallucinate the segmentation of our choice by adding a barely noticeable perturbation.\n\n6 Speech Recognition\n\nWe evaluate the effectiveness of Houdini concerning adversarial attacks on an Automatic Speech\nRecognition (ASR) system. Traditionally, ASR systems are composed of different components,\n(e.g. acoustic model, language model, pronunciation model, etc.) where each component is trained\nseparately. Recently, ASR research has focused on deep learning based end-to-end models. These type\nof models get as input a speech segment and output a transcript with no additional post-processing. In\nthis work, we use a deep neural network as our model with similar architecture to the one presented\nby [2]. The system is composed of two convolutional layers, followed by seven layers of Bidirectional\nLSTM [17] and one fully connected layer. We optimize the Connectionist Temporal Classi\ufb01cation\n(CTC) loss function [13], which was speci\ufb01cally designed for ASR systems. The model gets as\ninput raw spectrograms (extracted using a window size of 25ms, frame-size of 10ms and Hamming\nwindow), and outputs a transcript.\nA standard evaluating metric in speech recognition is the Word Error Rate (WER) or Character Error\nRate (CER). These metrics were derived from the Levenshtein Distance [23], which is the number\nof substitutions, deletions, and insertions divided by the target length. The model achieves 12%\nWord Error Rate and 1.5% Character Error Rate on the Librispeech dataset [27], with no additional\nlanguage modeling. In order to use Houdini for attacking an end-to-end ASR model, we need to\nget g\u03b8(x, y) and g\u03b8(x, \u02c6y), which are the scores for predicting y and \u02c6y respectively. Recall, in speech\nrecognition, the target y is a sequence of characters. Hence, the score g\u03b8(x, \u02c6y) is the sum of all\npossible paths to output \u02c6y. Fortunately, we can use the forward-backward algorithm [30], which is at\nthe core of the CTC, to get the score of a given label y.\n\nABX Experiment We generated 100 audio samples of adversarial examples and performed an\nABX test with about 100 humans. An ABX test is a standard way to assess the detectable differences\nbetween two choices of sensory stimuli. We present each human with two audio samples A and\nB. Each of these two samples is either the legitimate or an adversarial version of the same sound.\nThese two samples are followed by a third sound X randomly selected to be either A or B. Next,\nthe human must decide whether X is the same as A or B. For every audio sample, we executed the\nABX test with at least nine (9) different persons. Overall, Only 53.73% of the adversarial examples\ncould be distinguished from the original ones by the humans (the optimal ratio is 50%). Therefore the\ngenerated examples are not statistically signi\ufb01cantly distinguishable by a human ear. Subsequently,\nwe use such indistinguishable adversarial examples to test the robustness of ASR systems.\n\nHoudini vs Probit Loss Houdini and the Probit loss [19] are tightly related. We also initially ex-\nperimented with Probit but decided not to consider it further because: (1) Houdini is computationally\nmore ef\ufb01cient. It requires only one forward-backward pass to generate adversarial examples while\nProbit needs several times more passes as it must average many networks to reduce the variance\n\n7\n\ntargetsourcesourceimageinitialpredictionperturbedimagenoiseadversarialpredictionsourceimageinitialpredictionadversarialpredictionperturbednoiseadversarialpredictionperturbednoise\f\u0001 = 0.3\n\n\u0001 = 0.2\n\n\u0001 = 0.1\n\n\u0001 = 0.05\n\nWER CER WER CER WER CER WER CER\n2.5\n4.5\n\n51\n85.4\n\n29.8\n66.5\n\n6.9\n9.2\n\n4\n6.5\n\n20\n46.5\n\nCTC\nHoudini\n\n68\n96.1\n\n9.3\n12\n\nFigure 5: CER and WER in (%) for adversarial examples generated by both CTC and Houdini.\n\n(a) a great saint saint francis zaviour\n\n(b) i great sinkt shink t frimsuss avir\n\nFigure 6: The model\u2019s output for each of the spectrograms is located at the bottom of each spectrogram.\nThe target transcription is: A Great Saint Saint Francis Xavier.\n\nof the gradients. (2) The \"adversarial\" examples generated with Probit are not adversarial in the\nsense that they are easily distinguishable from the original examples by a human. This is due to\nthe noise (added to the parameters when computing the gradients with Probit) which seems to add\nwhite noise to the sound \ufb01les. We calculated the character error rates (CER) and the percentage of\nexamples that could be distinguished from original ones by a human (best is 50) for Houdini and\nProbit on the speech task. We used a perturbation of magnitude \u0001 = 0.05 and sampled 20 models for\nProbit (therefore 20x more computationally expensive than Houdini). In our results, while Probit\nand Houdini respectively achieve a CER of 5.97 and 4.50, adversarial examples generated with\nProbit are perfectly distinguishable by human (100%) in comparison to those generated with Houdini\n(53.73%).\n\nUntargeted Attacks\nIn the \ufb01rst experiment, we compare network performance after attacking it\nwith both Houdini and CTC. We generate two adversarial example, from each of the loss functions\n(CTC and Houdini), for every samples from the clean test set of Librispeech (2620 speech segments).\nWe have experienced with a set of different distortion levels, using the (cid:96)\u221e norm and WER as (cid:96). For\nall adversarial examples, we use \u02c6y = \"Forty Two\", which is the \"Answer to the Ultimate Question\nof Life, the Universe, and Everything.\" Results are summarized in 5. Notice that Houdini causes\na bigger decrease regarding both CER and WER than CTC for all the distortions values we have\ntested. In particular, for small adversarial perturbation (\u0001 = 0.05) the word error rate (WER) caused\nby an attack with Houdini is 2.3x larger than the WER obtained with CTC. Similarly, the character\nerror rate (CER) caused by an Houdini-based attack is 1.8x larger than a CTC-based one. Fig. 6\nshows the original and adversarial spectrograms for a single speech segment. (a) shows a spectrogram\nof the original sound \ufb01le and (b) shows the spectrogram of the adversarial one. They are visually\nundistinguishable.\n\nTargetted Attacks We push the model towards predicting a different transcription iteratively.\nIn this case, the input to the model at iteration i is the adversarial example from iteration i \u2212 1.\nCorresponding transcription samples are shown in Table 7. We notice that while setting \u02c6y to be\nphonetically far from y, the model tends to predict wrong transcriptions but not necessarily similar to\nthe selected target. However, when picking phonetically close ones, the model acts as expected and\npredict a transcription which is phonetically close to \u02c6y. Overall, targetted attacks seem to be much\nmore challenging when dealing with speech recognition systems than when we consider arti\ufb01cial\nvisual systems such as pose estimators or semantic segmentation systems.\n\nBlack box Attacks Lastly, we experimented with a black box attack, that is attacking a system in\nwhich we do not have access to the models\u2019 gradients but only to its predictions. In Figure 8 we show\nfew examples in which we use Google Voice application to predict the transcript for both original\n\n8\n\nTime (s)03.2305000Frequency (Hz)Time (s)03.2305000Frequency (Hz)\fManual Transcription\n\nAdversarial Target\n\nAdversarial Prediction\n\na great saint saint Francis\nXavier\nno thanks I am glad to give you\nsuch easy happiness\n\na green thank saint frenzier\n\na green thanked saint fredstus savia\n\nnotty am right to leave you\nsoggy happiness\n\nno to ex i am right like aluse o yve\nhave misser\n\nFigure 7: Examples of iteratively generated adversarial examples for targetted attacks. In all case\nthe model predicts the exact target transcription of the original example. Targetted attacks are more\ndif\ufb01cult when the speech segments are phonetically very different.\n\nGroundtruth Transcription:\nThe fact that a man can recite a poem does not show he remembers\nany previous occasion on which he has recited it or read it.\n\nG-Voice transcription of the original example:\nThe fact that a man can decide a poem does not show he\nremembers any previous occasion on which he has work cited or read it.\n\nG-Voice transcription of the adversarial example:\nThe fact that I can rest I\u2019m just not sure that you heard there is any\nprevious occasion I am at he has your side it or read it.\n\nGroundtruth Transcription:\nHer bearing was graceful and animated she led her son by the hand and\nbefore her walked two maids with wax lights and silver candlesticks.\n\nG-Voice transcription of the original example:\nThe bearing was graceful an animated she let her son by the hand and\nbefore he walks two maids with wax lights and silver candlesticks.\n\nG-Voice transcription of the adversarial example:\nMary was grateful then admitted she let her son before the walks\nto Mays would like slice furnace filter count six.\n\nFigure 8: Transcriptions from Google Voice application for original and adversarial speech segments.\n\nand adversarial audio \ufb01les. The original audio and their adversarial versions generated with our\nDeepSpeech-2 based model are not distinguishable by human according to our ABX test. We play\neach audio clip in front of an Android based mobile phone and report the transcription produced by\nthe application. As can be seen, while Google Voice could get almost all the transcriptions correct for\nlegitimate examples, it largely fails to produce good transcriptions for the adversarial examples. As\nwith images [32], adversarial examples for speech recognition also transfer between models.\n\n7 Conclusion\n\nWe have introduced a novel approach to generate adversarial examples tailored for the performance\nmeasure unique to the task of interest. We have applied Houdini to challenging structured prediction\nproblems such as pose estimation, semantic segmentation and speech recognition. In each case,\nHoudini allows fooling state of the art learning systems with imperceptible perturbation, hence\nextending the use of adversarial examples beyond the task of image classi\ufb01cation. What the eyes see\nand the ears hear, the mind believes. (Harry Houdini)\n\nAcknowledgments The authors thank Alexandre Lebrun, Pauline Luc and Camille Couprie for\nvaluable help with code and experiments. We also thank Antoine Bordes, Laurens van der Maaten,\nNicolas Usunier, Christian Wolf, Herve Jegou, Yann Ollivier, Neil Zeghidour and Lior Wolf for their\ninsightful comments on the early draft of this paper.\n\n9\n\n\fReferences\n[1] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen,\nM. Chrzanowski, A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recog-\nnition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.\n\n[2] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen,\nM. Chrzanowski, A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recog-\nnition in english and mandarin. In International Conference on Machine Learning, pages\n173\u2013182, 2016.\n\n[3] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New\n\nbenchmark and state of the art analysis. CVPR, 2014.\n\n[4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[5] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[6] A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap\n\nregression. ECCV, 2016.\n\n[7] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving\n\nrobustness to adversarial examples. arXiv preprint arXiv:1704.08847, 2017.\n\n[8] M. Cords, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth,\nand B. Schiele. The cityscapes dataset for semantic urban scene understanding. CVPR, 2016.\n\n[9] C. Do, Q. Le, C.-H. Teo, O. Chapelle, and A. Smola. Tighter bounds for structured estimation.\n\nIn Advances in Neural Information Processing Systems (NIPS) 22, 2008.\n\n[10] A. Fawzi, O. Fawzi, and P. Frossard. Analysis of classi\ufb01ers\u2019 robustness to adversarial perturba-\n\ntions. arXiv preprint arXiv:1502.02590, 2015.\n\n[11] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard. Robustness of classi\ufb01ers: from adversarial\nto random noise. In Advances in Neural Information Processing Systems, pages 1624\u20131632,\n2016.\n\n[12] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In\n\nProc. ICLR, 2015.\n\n[13] A. Graves, S. Fern\u00e1ndez, F. Gomez, and J. Schmidhuber. Connectionist temporal classi\ufb01cation:\nlabelling unsegmented sequence data with recurrent neural networks. In Proceedings of the\n23rd international conference on Machine learning, pages 369\u2013376. ACM, 2006.\n\n[14] T. Hazan, J. Keshet, and D. A. McAllester. Direct loss minimization for structured prediction.\n\nIn Advances in Neural Information Processing Systems, pages 1594\u20131602, 2010.\n\n[15] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. arXiv preprint arXiv:1703.06870,\n\n2016.\n\n[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n770\u2013778, 2016.\n\n[17] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[18] J. Keshet and D. A. McAllester. Generalization bounds and consistency for latent structural\nprobit and ramp loss. In Advances in Neural Information Processing Systems, pages 2205\u20132212,\n2011.\n\n[19] J. Keshet, D. McAllester, and T. Hazan. Pac-bayesian approach for minimization of phoneme\nerror rate. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International\nConference on, pages 2224\u20132227. IEEE, 2011.\n\n10\n\n\f[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105,\n2012.\n\n[21] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv\n\npreprint arXiv:1607.02533, 2016.\n\n[22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\n[23] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In\n\nSoviet physics doklady, volume 10, pages 707\u2013710, 1966.\n\n[24] D. McAllester, T. Hazan, and J. Keshet. Direct loss minimization for structured prediction. In\n\nAdvances in Neural Information Processing Systems (NIPS) 24, 2010.\n\n[25] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to\n\nfool deep neural networks. arXiv preprint arXiv:1511.04599, 2015.\n\n[26] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation.\n\nECCV, 2016.\n\n[27] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on\npublic domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE\nInternational Conference on, pages 5206\u20135210. IEEE, 2015.\n\n[28] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. Berkay Celik, and A. Swami. Practical\nblack-box attacks against deep learning systems using adversarial examples. arXiv preprint\narXiv:1602.02697, 2016.\n\n[29] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial\nperturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium\non, pages 582\u2013597. IEEE, 2016.\n\n[30] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech\n\nrecognition. Proceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[31] U. Shaham, Y. Yamada, and S. Negahban. Understanding adversarial training: Increasing local\n\nstability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432, 2015.\n\n[32] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.\n\nIntriguing properties of neural networks. In Proc. ICLR, 2014.\n\n[33] P. Tabacof and E. Valle. Exploring the space of adversarial images.\n\narXiv:1510.05328, 2015.\n\narXiv preprint\n\n[34] A. Tewari and P. L. Bartlett. On the consistency of multiclass classi\ufb01cation methods. Journal of\n\nMachine Learning Research, 8(May):1007\u20131025, 2007.\n\n[35] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n1653\u20131660, 2014.\n\n[36] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from\nerror visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600\u2013612,\n2004.\n\n[37] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. L. Yuille. Adversarial examples for semantic\n\nsegmentation and object detection. CoRR, abs/1703.08603, 2017.\n\n[38] F. Yu and V. Koltun. Multi-scale aggregation by dilated convolutions. ICLR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3501, "authors": [{"given_name": "Moustapha", "family_name": "Cisse", "institution": "Facebook AI Research"}, {"given_name": "Yossi", "family_name": "Adi", "institution": "Bar Ilan University"}, {"given_name": "Natalia", "family_name": "Neverova", "institution": "Facebook AI Research"}, {"given_name": "Joseph", "family_name": "Keshet", "institution": "Bar-Ilan University"}]}