{"title": "Efficient Methods for Privacy Preserving Face Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 57, "page_last": 64, "abstract": null, "full_text": "Ef\ufb01cient Methods for Privacy Preserving Face\n\nDetection\n\nMitsubishi Electric Research Labs\n\nDepartment of Computer Science\n\nShai Avidan\n\n201 Broadway\n\nCambridge, MA 02139\navidan@merl.com\n\nMoshe Butman\n\nBar Ilan University\nRamat-Gan, Israel\n\nbutmanm@cs.biu.edu\n\nAbstract\n\nBob offers a face-detection web service where clients can submit their images for\nanalysis. Alice would very much like to use the service, but is reluctant to reveal\nthe content of her images to Bob. Bob, for his part, is reluctant to release his face\ndetector, as he spent a lot of time, energy and money constructing it. Secure Multi-\nParty computations use cryptographic tools to solve this problem without leaking\nany information. Unfortunately, these methods are slow to compute and we intro-\nduce a couple of machine learning techniques that allow the parties to solve the\nproblem while leaking a controlled amount of information. The \ufb01rst method is an\ninformation-bottleneck variant of AdaBoost that lets Bob \ufb01nd a subset of features\nthat are enough for classifying an image patch, but not enough to actually recon-\nstruct it. The second machine learning technique is active learning that allows\nAlice to construct an online classi\ufb01er, based on a small number of calls to Bob\u2019s\nface detector. She can then use her online classi\ufb01er as a fast rejector before using\na cryptographically secure classi\ufb01er on the remaining image patches.\n\n1 Introduction\n\nThe Internet triggered many opportunities for cooperative computing in which buyers and sellers\ncan meet to buy and sell goods, information or knowledge. Placing classi\ufb01ers on the Internet allows\nbuyers to enjoy the power of a classi\ufb01er without having to train it themselves. This bene\ufb01t is hin-\ndered by the fact that the seller, that owns the classi\ufb01er, learns a great deal about the buyers\u2019 data,\nneeds or goals. This raised the need for privacy in Internet transactions. While it is now common to\nassume that the buyer and the seller can secure their data exchange from the rest of the world, we\nare interested in a stronger level of security that allows the buyer to hide his data from the seller as\nwell. Of course, the same can be said about the seller, who would like to maintain the privacy of his\nhard-earned classi\ufb01er.\n\nSecure Multi-Party Computation (SMC) are based on cryptographic tools that let two parties, Alice\nand Bob, to engage in a protocol that will allow them to achieve a common goal, without revealing\nthe content of their input. For example, Alice might be interested in classifying her data using Bobs\u2019\nclassi\ufb01er without revealing anything to Bob, not even the classi\ufb01cation result, and without learning\nanything about Bobs\u2019 classi\ufb01er, other than a binary answer to her query.\n\nRecently, Avidan & Butman introduced Blind Vision [1] which is a method for securely evaluating a\nViola-Jones type face detector [12]. Blind Vision uses standard cryptographic tools and is painfully\nslow to compute, taking a couple of hours to scan a single image. The purpose of this work is to\nexplore machine learning techniques that can speed up the process, at the cost of a controlled leakage\nof information.\n\n\fIn our hypothetical scenario Bob has a face-detection web service where clients can submit their\nimages to be analyzed. Alice would very much like to use the service, but is reluctant to reveal the\ncontent of the images to Bob. Bob, for his part, is reluctant to release his face detector, as he spent a\nlot of time, energy and money constructing it.\n\nIn our face detection protocol Alice raster scans the image and sends every image patch to Bob\nto be classi\ufb01ed. We would like to replace cryptographically-based SMC methods with Machine\nLearning algorithms that might leak some information but are much faster to execute. The challenge\nis to design protocols that can explicitly control the amount of information leaked. To this end we\npropose two, well known, machine learning techniques. One based on the information bottleneck\nand the other on active learning.\n\nThe \ufb01rst method is a privacy-preserving feature selection which is a variant of the information-\nbottleneck principle to \ufb01nd features that are useful for classi\ufb01cation but not for signal reconstruc-\ntion. In this case, Bob can use his training data to construct different classi\ufb01ers that offer different\ntrade-offs of information leakage versus classi\ufb01cation accuracy. Alice can then choose the trade-off\nthat suits her best and send only those features to Bob for classi\ufb01cation. This method can be used,\nfor example, as a \ufb01ltering step that rejects a large number of the image patches as having no face in-\ncluded in them, followed by a SMC method that will securely classify the remaining image patches,\nusing the full classi\ufb01er that is known only to Bob.\n\nThe second method is active learning and it helps Alice choose which image patches to send to Bob\nfor classi\ufb01cation. This method can be used either with the previous method or directly with an SMC\nprotocol. The idea being that instead of sending all image patches to Bob for classi\ufb01cation, Alice\nmight try to learn from the interaction as much as she can and use her online trained classi\ufb01er to\nreject some of the image patches herself. This can minimize the amount of information revealed to\nBob, if the parties use the privacy-preserving features or the computational load, if the parties are\nusing cryptographically-based SMC methods.\n\n2 Background\n\nSecure multi-party computation originated from the work of Yao [14] who gave a solution to the\nmillionaire problem: Two parties want to \ufb01nd which one has a larger number, without revealing\nanything else about the numbers themselves. Later, Goldriech et al. [5] showed that any function\ncan be computed in such a secure manner. However, the theoretical construct was still too demanding\nto be of practical use. An easy introduction to Cryptography is given in [9] and a more advanced\nand theoretical treatment is given in [4]. Since then many secure protocols were reported for various\ndata mining applications [7, 13, 1]. A common assumption in SMC is that the parties are honest\nbut curious, meaning that they will follow the agreed-upon protocol but will try to learn as much as\npossible from the data-\ufb02ow between the two parties. We will follow this assumption here.\n\nThe information bottleneck principle [10] shows how to compress a signal while preserving its\ninformation with respect to a target signal. We offer a variant of the self-consistent equations used\nto solve this problem and offer a greedy feature selection algorithm that satisfy privacy constraints,\nthat are represented as the percentage of the power spectrum of the original signal.\n\nActive learning methods assume that the student (Alice, in our case) has access to an Oracle (Bob)\nfor labeling. The usual motivation in active learning is that the Oracle is assumed to be a human op-\nerator and having him label data is a time consuming task that should be avoided. Our motivation is\nsimilar, Alice would like to avoid using Bob because of the high computational cost involved in case\nof cryptographically secure protocols, or for fear of leaking information in case non-cryptographic\nmethods are used. Typical active learning applications assume that the distribution of class size is\nsimilar [2, 11]. A notable exception is the work of [8] that propose an active learning method for\nanomaly detection. Our case is similar as image patches that contain faces are rare in an image.\n\n3 Privacy-preserving Feature Selection\n\nFeature selection aims at \ufb01nding a subset of the features that optimize some objective function, typi-\ncally a classi\ufb01cation task [6]. However, feature selection does not concern itself with the correlation\nof the feature subset with the original signal.\n\n\fThis is handled with the information bottleneck method [10], that takes a joint distribution p(x, y)\nand \ufb01nds a compressed representation of X, denoted by T , that is as informative about Y . This is\nachieved by minimizing the following functional:\n\nL : L \u2261 I(X; T ) \u2212 \u03b2I(T ; Y )\n\nmin\np(t|x)\n\nwhere \u03b2 is a trade-off parameter that controls the trade off between compressing X and maintaining\ninformation about Y . The functional L admits a set of self-consistent equations that allows one to\n\ufb01nd a suitable solution.\n\nWe map the information bottleneck idea to a feature selection algorithm to obtain a Privacy-\npreserving Feature Selection (PPFS) and describe how Bob can construct such a feature set. Let\nBob have a training set of image patches, their associated label and a weight associated with every\nn=1. Bob\u2019s goal is to \ufb01nd a feature subset I \u2261 {i1, . . . , ik} s.t.\nfeature (pixel) denoted {xn, yn, sn}N\na classi\ufb01er F (x(I)) will minimize the classi\ufb01cation error, where x(I) denotes a sample x that uses\nonly the features in the set I. Formally, Bob needs to minimize:\nNX\n\n(1)\n\n(2)\n\nmin\nF\n\n(F (xn(I)) \u2212 yn))2\n\nn=1\n\nsubject to\n\nsi < \u039b\n\nX\n\ni\u2208I\n\nwhere \u039b is a user de\ufb01ned threshold that de\ufb01nes the amount of information that can be leaked.\nWe found it useful to use the PCA spectrum to measure the amount of information. Speci\ufb01cally, Bob\ncomputes the PCA space of all the face images in his database and maps all the data to that space,\nwithout reducing dimensionality. The weights {sn}N\nn=1 are now set to the eigenvalues associated\nwith each dimension in the PCA space. This avoids the need to compute the mutual information\nbetween pixels by making the assumption that features do not carry mutual information with other\nfeatures beyond second order statistics.\n\nAlgorithm 1 Privacy-Preserving Feature Selection\nInput: {xn, yn, sn}N\nThreshold \u039b\nNumber of iterations T\n\nn=1\n\nOutput: A privacy-preserving strong classi\ufb01er F (x)\n\n\u2022 Start with weights wn = 1/N n = 1, 2, . . . , N, F (x) = 0, I = \u2205\n\u2022 Repeat for t = 1, 2, . . . , T\n\n\u2013 Set working index set J = I \u222a {j|sj +P\nPN\n\n\u2013 Repeat for j \u2208 J\n\nPN\n\n\u2013 Set ft = gi where ei < ej \u2200j \u2208 J\n\u2013 Update:\n\nwn\n\nn=1\n\n\u2217 Fit a regression stump gj(x(j)) \u2261 aj(x(j) > \u03b8j) + bj to the j-th feature, x(j)\n\u2217 Compute error ej =\n\nwn(yn\u2212(aj (xn(j)>\u03b8j +bj )2\n\nn=1\n\ni\u2208I si < \u039b}\n\nF (x) \u2190 F (x) + ft(x)\nwn \u2190 wne\u2212ynft(xn)\nI \u2190 I \u222a {i}\n\n(3)\n(4)\n(5)\n\nBoosting was used for feature selection before [12] and Bob takes a similar approach here. He uses\na variant of the gentleBoost algorithm [3] to \ufb01nd a greedy solution to (2). Speci\ufb01cally, Bob uses\ngentleBoost with \u201cstumps\u201d as the weak classi\ufb01ers where each \u201cstump\u201d works on only one feature.\nThe only difference from gentleBoost is in the choice of the features to be selected. In the original\nalgorithm all the features are evaluated in every iteration of the algorithm, but here Bob can only use\n\n\fa subset of the features. In each iteration Bob can use features that were already selected or those\nthat adding them will not increase the total weight of selected features beyond the threshold \u039b.\nOnce Bob computed the privacy-preserving feature subset, the amount of information it leaks and\nits classi\ufb01cation accuracy he publishes this information on the web. Alice then needs to map her\nimage patches to this low-dimensional privacy-preserving feature space and send the data to Bob for\nclassi\ufb01cation.\n\n4 Privacy-Preserving Active Learning\n\nIn our face detection example Alice needs to submit many image patches to Bob for classi\ufb01cation.\nThis is computationally expensive if SMC methods are used and reveals information, in case the\nprivacy-preserving feature selection method discussed earlier is used. Hence, it would be bene\ufb01cial\nif Alice could minimize the number of image patches she needs to send Bob for classi\ufb01cation. This\nis where she might use active learning. Instead of raster scanning the image and submitting every\nimage patch for classi\ufb01cation she sends a small number of randomly selected image patches, and\nbased on their label, she determines the next group of image patches to be sent for classi\ufb01cation. We\nfound that substantial gains can be made this way.\n\nSpeci\ufb01cally, Alice maintains an RBF network that is trained on-line, based on the list of labeled\nprototypes. Let {cj, yj}M\nj=1 be the list of M prototypes that were labeled so far. Then, Alice con-\nstructs a kernel matrix K where Kij = k(ci, cj) and solves the least squares equation Ku = y,\nwhere y = [y1, . . . , yM ]T . The kernel Alice uses is a Gaussian kernel whose width is set to be the\nrange of the prototype coordinates, in each dimension. The score of each image patch x is given by\nh(x) = [k(x, c1), . . . , k(x, cM)]u.\nFor the next round of classi\ufb01cation Alice chooses the image patches with the highest h(x) score.\nThis is in line with [2, 11, 8] that consider choosing the examples of which one has the least amount\nof information. In our case, Alice is interested in \ufb01nding image patches that contain faces (which we\nassume are labeled +1) but most of the prototypes will be labeled \u22121, because faces are a rear event\nin an image. As long as Alice does not sample a face image patch she will keep exploring the space\nof image patches in her image, by sampling image patches that are farthest away from the current set\nof prototypes. If an image patch that contains a face is sampled, then her online classi\ufb01er h(x) will\nlabel similar image patches with a high score, thus guiding the search towards other image patches\nthat might contain a face. To avoid large overlap between patches, we force a minimum distance, in\nthe image plane, between selected patches. The algorithm is given in algorithm 2.\n\nAlgorithm 2 Privacy-Preserving Active Learning\nInput: {xi}N\n\ni=1 unlabeled samples\n\nNumber M of classi\ufb01cation calls allowed\nNumber s of samples to classify in each iteration\n\nOutput: Online classi\ufb01er h(x)\n\n\u2022 Choose s random samples {xi}s\n\u2022 Repeat for m = 1, 2, ..., M times\n\n[y1, . . . , ys] from Bob.\n\ni=1, set C = [x1, . . . , xs] and obtain their labels y =\n\nleast squares Ku = y.\n\n\u2013 Construct the kernel matrix Kij = k(ci, cj) and solve for the weight vector u through\n\u2013 Evaluate h(xi) = [k(xi, c1), . . . , k(xi, cm)]u \u2200i = 1, . . . , N.\n\u2013 Choose top s samples with highest h(x) score, send them to Bob for classi\ufb01cation and\n\nadd them, and their labels to C, y, respectively.\n\n5 Experiments\n\nWe have conducted a couple of experiments to validate both methods.\n\n\fFigure 1: Privacy preserving feature selection. We show the ROC curves of four strong classi\ufb01ers,\neach trained with 100 weak, \u201cstump\u201d classi\ufb01ers, but with different levels of information leakage.\nThe information leakage is de\ufb01ned as the amount of PCA spectrum captured by the features used\nin each classi\ufb01er. The number in parenthesis shows how much of the eigen spectrum is captured by\nthe features used in each classi\ufb01er.\n\nThe \ufb01rst experiment evaluates the privacy-preserving feature selection method. The training set\nconsisted of 9666 image patches of size 24 \u00d7 24 pixels each, split evenly to face/no-face images.\nThe test set was of similar size. We then run algorithm 1 with different levels of the threshold \u039b and\ncreated a strong classi\ufb01er with 100 weak, \u201cstump\u201d based, classi\ufb01ers. The ROC curves of several\nsuch classi\ufb01ers are shown in \ufb01gure 1. We found that, for this particular dataset, setting \u039b = 0.1\ngives identical results to a full classi\ufb01er, without any privacy constraints. Reducing \u039b to 0.01 did\nhurt the classi\ufb01cation performance somewhat.\n\nThe second experiment tests the active learning approach. We assume that Alice and Bob use the\nclassi\ufb01er with \u039b = 0.05 from the previous experiment, and measure how effective is the on-line\nclassi\ufb01er that Alice constructs in rejecting no-face image patches.\n\nRecall that there are three classi\ufb01ers at play. One is the full classi\ufb01er that Bob owns, the second is the\nprivacy-preserving classi\ufb01er that Bob owns and the last is the on-line classi\ufb01er that Alice constructs.\nAlice uses the labels of Bobs\u2019 privacy-preserving classi\ufb01er to construct her on-line classi\ufb01er and the\nquestions is: how many image patches she can reject, without actually rejecting image patches that\nwill be classi\ufb01ed as faces by the full classi\ufb01er (that she knows nothing about)?\n\nBefore we performed the experiment, we conducted the following pre-processing operation: We\n\ufb01nd, for each image, the scale at which the largest number of faces are detected using Bob\u2019s full\nclassi\ufb01er, and used only the image at that scale.\n\nThe experiment proceeds as follows. Alice chooses 5 image patches in each round, maps them to the\nreduced PCA space and sends them to Bob for classi\ufb01cation, using his privacy-preserving classi\ufb01er.\nBased on his labels, Alice then picks the next 5 image patches according to algorithm 2 and so on.\nAlice repeats the process 10 times, resulting in 50 patches that are sent to Bob for classi\ufb01cation. The\n\ufb01rst 5 patches are chosen at random. Figure 2 shows the 50 patches selected by Alice, the online\nclassi\ufb01er h and the corresponding rejection/recall curve, for several test images. The rejection/recall\ncurve shows how many image patches Alice can safely reject, based on h, without rejecting a face\nthat will be detected by Bobs\u2019 full classi\ufb01er. For example, in the top row of \ufb01gure 2 we see that\nrejecting the bottom 40% of image patches based on the on-line classi\ufb01er h will not reject any face\nthat can be detected with the full classi\ufb01er. Thus 50 image patches that can be quickly labeled while\nleaking very little information can help Alice reject thousands of image patches.\n\nNext, we conducted the same experiment on a larger set of images, consisting of 65 of the\nCMU+MIT database images1. Figure 3 shows the results. We found that, on average (dashed line),\n\n1We used the 65 images in the newtest directory of the CMU+MIT dataset\n\n\f(a-1)\n\n(b-1)\n\n(c-1)\n\n(a-2)\n\n(b-2)\n\n(c-2)\n\n(a-3)\n\n(b-3)\n\n(c-3)\n\nFigure 2: Examples of privacy-preserving feature selection and active learning. (a) The input images\nand the image patches (marked with white rectangles) selected by the active learning algorithm. (b)\nThe response image computed by the online classi\ufb01er (the black spots correspond to the position of\nthe selected image patches). Brighter means a higher score. (c) The rejection/recall curve showing\nhow many image patches can be safely rejected. For example, panel (c-1) shows that Alice can reject\nalmost 50% of the image patches, based on her online classi\ufb01er (i.e., response image), and not miss\na face that can be detected by the full classi\ufb01er (that is known to Bob and not to Alice).\n\n\f(a)\n\nFigure 3: Privacy preserving active learning. Results on a dataset of 65 images. The \ufb01gure shows\nhow many image patches can be rejected, based on the online classi\ufb01er that Alice owns, without\nrejecting a face. The horizontal axis shows how many image patches are rejected, based on the\non-line classi\ufb01er, and the vertical axis shows how many faces are maintained. For example, the\n\ufb01gure shows (dashed line) that rejecting 20% of all image patches, based on the on-line classi\ufb01er,\nwill maintain 80% of all faces. The solid line shows that rejecting 40% of all image patches, based\non the on-line classi\ufb01er, will not miss a face in at least half (i.e. the median) of the images in the\ndataset.\n\nusing only 50 labeled image patches Alice can reject up to about 20% of the image patches in an\nimage while keeping 80% of the faces in that image (i.e., Alice will reject 20% of the image patches\nthat Bob\u2019s full classi\ufb01er will classify as a face). If we look at the median (solid line), we see that for\nat least half the images in the data set, Alice can reject a little more than 40% of the image patches\nwithout erroneously rejecting a face.\n\nWe found that increasing the number of labeled examples from 50 to a few hundreds does not greatly\nimprove results, unless many thousands of samples are labeled, at which point too much information\nmight be leaked.\n\n6 Conclusions\n\nWe described two machine learning methods to accelerate cryptographically secure classi\ufb01cation\nprotocols. The methods greatly accelerate the performance of the system, while leaking a controlled\namount of information. The two methods are a privacy preserving feature selection that is similar to\nthe information bottleneck and an active learning technique that was found to be useful in learning\na rejector from an extremely small number of labeled data. We plan to keep investigating these\nmethods, apply them to classi\ufb01cation tasks in other domains and develop new methods to make\nsecure classi\ufb01cation faster to use.\n\nReferences\n\n[1] S. Avidan and M. Butman. Blind vision. In Proc. of European Conference on Computer Vision,\n\n2006.\n\n[2] Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of\n\nMachine Learning Research, 5:255\u2013291, March 2004.\n\n[3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of\n\nboosting, 1998.\n\n[4] O. Goldreich. Foundations of Cryptography: Volume 1, Basic Tools. Cambridge University\n\nPress, New York, 2001.\n\n\f[5] O. Goldreich, S. Micali, and A. Wigderson. How to play any mental game or a completeness\nIn ACM Symposium on Theory of Computing,\n\ntheorem for protocols with honest majority.\npages 218\u2013229, 1987.\n\n[6] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Ma-\n\nchine Learning Research, 3:1157\u20131182, 2003.\n\n[7] Y. Lindell and B. Pinkas. Privacy preserving data mining. In CRYPTO: Proceedings of Crypto,\n\n2000.\n\n[8] D. Pelleg and A. Moore. Active learning for anomaly and rare-category detection.\n\nAdvances in Neural Information Processing Systems 18, 2004.\n\nIn In\n\n[9] B. Schneier. Applied Cryptography. John Wiley & Sons, New York, 1996.\n[10] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In In Proc. of 37th\n\nAllerton Conference on communication and computation, 1999.\n\n[11] S. Tong and D. Koller. Support vector machine active learning with applications to text classi-\n\n\ufb01cation. Journal of Machine Learning Research, 2:45\u201366, 2001.\n\n[12] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In\n\nCoference on Computer Vision and Pattern Recognition (CVPR), 2001.\n\n[13] R. N. Wright and Z. Yang. \u201dprivacy-preserving bayesian network structure computataion on\ndistributed heterogeneous data\u201d. In KDD \u201904: Proceeding of the tenth ACM SIGKDD interna-\ntional conference on Knowledge discovery in data mining, pages 22\u201325, 2004.\n\n[14] A. C. Yao. Protocols for secure computations. In Proc. 23rd IEEE Symp. on Foundations of\n\nComp. Science, pages 160\u2013164, Chicago, 1982. IEEE.\n\n\f", "award": [], "sourceid": 3081, "authors": [{"given_name": "Shai", "family_name": "Avidan", "institution": null}, {"given_name": "Moshe", "family_name": "Butman", "institution": null}]}