{"title": "A Stability-based Validation Procedure for Differentially Private Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2652, "page_last": 2660, "abstract": "Differential privacy is a cryptographically motivated definition of privacy which has gained considerable attention in the algorithms, machine-learning and data-mining communities. While there has been an explosion of work on differentially private machine learning algorithms, a major barrier to achieving end-to-end differential privacy in practical machine learning applications is the lack of an effective procedure for differentially private parameter tuning, or, determining the parameter value, such as a bin size in a histogram, or a regularization parameter, that is suitable for a particular application. In this paper, we introduce a generic validation procedure for differentially private machine learning algorithms that apply when a certain stability condition holds on the training algorithm and the validation performance metric. The training data size and the privacy budget used for training in our procedure is independent of the number of parameter values searched over. We apply our generic procedure to two fundamental tasks in statistics and machine-learning -- training a regularized linear classifier and building a histogram density estimator that result in end-to-end differentially private solutions for these problems.", "full_text": "A Stability-based Validation Procedure for\nDifferentially Private Machine Learning\n\nKamalika Chaudhuri\n\nDepartment of Computer Science and Engineering\n\nUC San Diego, La Jolla CA 92093\n\nkamalika@cs.ucsd.edu\n\nStaal Vinterbo\n\nDivision of Biomedical Informatics\nUC San Diego, La Jolla CA 92093\n\nsav@ucsd.edu\n\nAbstract\n\nDifferential privacy is a cryptographically motivated de\ufb01nition of privacy which\nhas gained considerable attention in the algorithms, machine-learning and data-\nmining communities. While there has been an explosion of work on differentially\nprivate machine learning algorithms, a major barrier to achieving end-to-end dif-\nferential privacy in practical machine learning applications is the lack of an ef-\nfective procedure for differentially private parameter tuning, or, determining the\nparameter value, such as a bin size in a histogram, or a regularization parameter,\nthat is suitable for a particular application.\nIn this paper, we introduce a generic validation procedure for differentially private\nmachine learning algorithms that apply when a certain stability condition holds on\nthe training algorithm and the validation performance metric. The training data\nsize and the privacy budget used for training in our procedure is independent of\nthe number of parameter values searched over. We apply our generic procedure to\ntwo fundamental tasks in statistics and machine-learning \u2013 training a regularized\nlinear classi\ufb01er and building a histogram density estimator that result in end-to-\nend differentially private solutions for these problems.\n\n1\n\nIntroduction\n\nPrivacy-preserving machine learning algorithms are increasingly essential for settings where sensi-\ntive and personal data are mined. The emerging standard for privacy-preserving computation for\nthe past few years is differential privacy [7]. Differential privacy is a cryptographically motivated\nde\ufb01nition, which guarantees privacy by ensuring that the log-likelihood of any outcome does not\nchange by more than \u03b1 due to the participation of a single individual; an adversary will thus have\ndif\ufb01culty inferring the private value of a single individual when \u03b1 is small. This is achieved by\nadding random noise to the data or to the result of a function computed on the data. The value \u03b1 is\ncalled the privacy budget, and measures the level of privacy risk allowed. As more noise is needed\nto achieve lower \u03b1,the price of higher privacy is reduced utility or accuracy. The past few years\nhave seen an explosion in the literature on differentially private algorithms, and there currently exist\ndifferentially private algorithms for many statistical and machine-learning tasks such as classi\ufb01ca-\ntion [4, 15, 23, 10], regression [18], PCA [2, 5, 17, 12], clustering [2], density estimation [28, 19],\namong others.\nMany statistics and machine learning algorithms involve one or more parameters, for example, the\nregularization parameter \u03bb in Support Vector Machines and the number of clusters in k-means.\nAccurately setting these parameters is critical to performance. However there is no good apriori way\nto set these parameters, and common practice is to run the algorithm for a few different plausible\nparameter values on a dataset, and then select the output that yields the best performance on held-out\nvalidation data. This process is often called parameter-tuning, and is an essential component of any\npractical machine-learning system.\n\n1\n\n\fA major barrier to achieving end-to-end differential privacy in practical machine-learning appli-\ncations is the absence of an effective procedure for differentially private parameter-tuning. Most\nprevious experimental works either assume that a good parameter value is known apriori [15, 5] or\nuse a heuristic to determine a suitable parameter value [19, 28]. Currently, parameter-tuning with\ndifferential privacy is done in two ways. The \ufb01rst is to run the training algorithm on the same data\nmultiple times. However re-using the data leads to a degradation in the privacy guarantees, and thus\nto maintain the privacy budget \u03b1, for each training, we need to use a privacy budget that shrinks\npolynomially with the number of parameter values. The second procedure, used by [4], is to divide\nthe training data into disjoint sets and train for each parameter value using a different set. Both so-\nlutions are highly sub-optimal, particularly, if a large number of parameter values are involved \u2013 the\n\ufb01rst due to the lower privacy budget, and the second due to less data. Thus the challenge is to design\na differentially private validation procedure that uses the data and the privacy budget effectively, but\ncan still do parameter-tuning. This is an important problem, and has been mentioned as an open\nquestion by [28] and [4].\nIn this paper, we show that it is indeed possible to do effective parameter-tuning with differential\nprivacy in a fairly general setting, provided the training algorithm and the performance measure\nused to evaluate its output on the validation data together obey a certain stability condition. We\ncharacterize this stability condition by introducing a notion of (\u03b21, \u03b22, \u03b4)-stability; loosely speaking,\nstability holds if the validation performance measure does not change very much when one person\u2019s\nprivate value in the training set changes, when exactly the same random bits are used in the training\nalgorithm in both cases or, when one person\u2019s private value in the validation set changes. The second\ncondition is fairly standard, and our key insight is in characterizing the \ufb01rst condition and showing\nthat it can help in differentially private parameter tuning.\nWe next design a generic differentially private training and validation procedure that provides end-\nto-end privacy provided this stability condition holds. The training set size and the privacy budget\nused by our training algorithms are independent of k, the number of parameter values, and the\naccuracy of our validation procedure degrades only logarithmically with k.\nWe apply our generic procedure to two fundamental tasks in machine-learning and statistics \u2013 train-\ning a linear classi\ufb01er using regularized convex optimization, and building a histogram density esti-\nmator. We prove that existing differentially private algorithms for these problems obey our notion\nof stability with respect to standard validation performance measures, and we show how to combine\nthem to provide end-to-end differentially private solutions for these tasks. In particular, our appli-\ncation to linear classi\ufb01cation is based on existing differentially private procedures for regularized\nconvex optimization due to [4], and our application to histogram density estimation is based on the\nalgorithm variant due to [19].\nFinally we provide an experimental evaluation of our procedure for training a logistic regression\nclassi\ufb01er on real data.\nIn our experiments, even for a moderate value of k, our procedure out-\nperformed existing differentially private solutions for parameter tuning, and achieved performance\nonly slightly worse than knowing the best parameter to use ahead of time. We also observed that\nour procedure, in contrast to the other procedures we tested, improved the correspondence between\npredicted probabilities and observed outcomes, often referred to as model calibration.\nRelated Work. Differential privacy, proposed by [7], has gained considerable attention in the algo-\nrithms, data-mining and machine-learning communities over the past few years as there has been a\nlarge explosion of theoretical and experimental work on differentially private algorithms for statis-\ntical and machine-learning tasks [10, 2, 15, 19, 27, 28, 3] \u2013 see [24] for a recent survey of machine\nlearning methods with a focus on continuous data. In particular, our case study on linear classi-\n\ufb01cation is based on existing differentially private procedures for regularized convex optimization,\nwhich were proposed by [4], and extended by [23, 18, 15]. There has also been a large body of\nwork on differentially private histogram construction in the statistics, algorithms and database liter-\nature [7, 19, 27, 28, 20, 29, 14]. We use the algorithm variant due to [19].\nWhile the problem of differentially private parameter tuning has been mentioned in several works,\nto the best of our knowledge, an ef\ufb01cient systematic solution has been elusive. Most previous\nexperimental works either assume that a good parameter value is known apriori [15, 5] or use a\nheuristic to determine a suitable parameter value [19, 28]. [4] use a parameter-tuning procedure\nwhere they divide the training data into disjoint sets, and train for a parameter value on each set. [28]\n\n2\n\n\fmentions \ufb01nding a good bin size for a histogram using differentially private validation procedure as\nan open problem.\nFinally, our analysis uses ideas similar to the analysis of the Multiplicative Weights Method for\nanswering a set of linear queries [13].\n\n2 Preliminaries\n\nPrivacy De\ufb01nition and Composition Properties. We adopt differential privacy as our notion of\nprivacy.\nDe\ufb01nition 1 A (randomized) algorithm A whose output lies in a domain S is said to be (\u03b1, \u03b4)-\ndifferentially private if for all measurable S \u2286 S, for all datasets D and D(cid:48) that differ in the value\nof a single individual, it is the case that: Pr(A(D) \u2208 S) \u2264 e\u03b1 Pr(A(D(cid:48)) \u2208 S) + \u03b4. An algorithm is\nsaid to be \u03b1-differentially private if \u03b4 = 0.\n\nHere \u03b1 and \u03b4 are privacy parameters where lower \u03b1 and \u03b4 imply higher privacy. Differential privacy\nhas been shown to have many desirable properties, such as robustness to side information [7] and\nresistance to composition attacks [11].\nAn important property of differential privacy is that the privacy guarantees degrade gracefully if\nthe same sensitive data is used in multiple private computations. In particular, if we apply an \u03b1-\ndifferentially private procedure k times on the same data, the result is k\u03b1-differential private as\n\nwell as (\u03b1(cid:48), \u03b4)-differentially private for \u03b1(cid:48) = k\u03b1(e\u03b1 \u2212 1) +(cid:112)2k log(1/\u03b4)\u03b1 [7, 8]. These privacy\n\ncomposition results are the basis of existing differentially private parameter tuning procedures.\nTraining Procedure and Validation Score. Typical (non-private) machine learning algorithms\nhave one or more undetermined parameters, and standard practice is to run the machine learning\nalgorithm for a number of different parameter values on a training set, and evaluate the outputs on a\nseparate held-out validation dataset. The \ufb01nal output is the one which performs best on the validation\ndata. For example, in linear classi\ufb01cation, we train logistic regression or SVM classi\ufb01ers with\nseveral different values of the regularization parameter \u03bb, and then select the classi\ufb01er which has\nthe best performance on held-out validation data. Our goal in this paper is to design a differentially\nprivate version of this procedure which uses the privacy budget ef\ufb01ciently.\nThe full validation process thus has two components \u2013 a training procedure, and a validation score\nwhich evaluates how good the training procedure is.\nWe assume that training and validation data are drawn from a domain X , and the result of the\ndifferentially private training algorithm lies in a domain C. For example, for linear classi\ufb01cation, X\nis the set of all labelled examples (x, y) where x \u2208 Rd and y \u2208 {\u22121, 1}, and C is the set of linear\nclassi\ufb01ers in d dimensions. We use n to denote the size of a training set, m to denote the size of a\nheld-out validation set, and \u0398 to denote a set of parameters.\nA differentially private training procedure is a randomized algorithm, which takes as input a (sensi-\ntive) training dataset, a parameter (of the training procedure), and a privacy parameter \u03b1 and outputs\nan element of C; the procedure is expected to be \u03b1-differentially private. For ease of exposition and\nproof, we represent a differentially private training procedure T as a tuple T = (G, F ), where G is\na density over sequences of real numbers, and F is a function, which takes as input a training set, a\nparameter in the parameter set \u0398, a privacy parameter \u03b1, and a random sequence drawn from G, and\noutputs an element of C. F is thus a deterministic function, and the randomization in the training\nprocedure is isolated in the draw from G.\nObserve that any differentially private algorithm can be represented as such a tuple. For example,\ngiven x1, . . . , xn \u2208 [0, 1], an \u03b1-differentially private approximation to the sample mean \u00afx is \u00afx +\n\u03b1n Z where Z is drawn from the standard Laplace distribution. We can represent this procedure\nas a tuple T = (G, F ) as follows: G is the standard Laplace density over reals, and for any \u03b8,\nF ({x1, . . . , xn}, \u03b8, \u03b1, r) = \u00afx + r\n\u03b1n. In general, more complicated procedures will require more\ninvolved functions F .\nA validation score is a function q : C \u00d7 X m \u2192 R which takes an object h in C and a validation\ndataset V , and outputs a score which re\ufb02ects the quality of h with respect to V . For example, a\n\n1\n\n3\n\n\fcommon validation score used in linear classi\ufb01cation is classi\ufb01cation accuracy.\nIn (non-private)\nvalidation, if hi is obtained by running the machine learning algorithm with parameter \u03b8i, then the\ngoal is to output the i (or equivalently the hi) which maximizes q(hi, V ); our goal is to output\nan i that approximately maximizes q(hi, V ) while still preserving the privacy of V as well as the\nsensitive training data used in constructing the his.\n\n3 Stability and Generic Validation Procedure\n\nWe now introduce and discuss our notion of stability, and provide a generic validation procedure\nthat uses the privacy budget ef\ufb01ciently when this notion of stability holds.\n\nDe\ufb01nition 2 ((\u03b21, \u03b22, \u03b4)-Stability) A validation score q is said to be (\u03b21, \u03b22, \u03b4)-stable with respect\nto a training procedure T = (G, F ), a privacy parameter \u03b1, and a parameter set \u0398 if the following\nholds. There exists a set \u03a3 such that PrR\u223cG(R \u2208 \u03a3) \u2265 1 \u2212 \u03b4, and whenever R \u2208 \u03a3, the following\ntwo conditions hold:\n\n1. Training Stability: For all \u03b8 \u2208 \u0398, V , and all training sets T and T (cid:48) that differ in a single\n\nentry, |q(F (T, \u03b8, \u03b1, R), V ) \u2212 q(F (T (cid:48), \u03b8, \u03b1, R), V )| \u2264 \u03b21\nn .\n\n2. Validation Stability: For all T , \u03b8 \u2208 \u0398, and for all V and V (cid:48) that differ in a single entry,\n\n|q(F (T, \u03b8, \u03b1, R), V ) \u2212 q(F (T, \u03b8, \u03b1, R), V (cid:48))| \u2264 \u03b22\nm .\n\nCondition (1), the training stability condition, bounds the change in the validation score q, when one\nperson\u2019s private data in the training set T changes, and the validation set V as well as the value of the\nrandom variable R remains the same. Our validation procedure critically relies on this condition,\nand our main contribution in this paper is to identify and exploit it to provide a validation procedure\nthat uses the privacy budget ef\ufb01ciently.\nAs F (T, \u03b8, \u03b1, R) is a deterministic function, Condition (2), the validation stability condition, bounds\nthe change in q when one person\u2019s private data in the validation set V changes, and the output of the\ntraining procedure remains the same. We observe that (some version of) Condition (2) is a standard\nrequirement in existing differentially private algorithms that preserve the privacy of the validation\ndataset while selecting a h \u2208 C that approximately maximizes q(h, V ), even if it is not required to\nmaintain privacy with respect to the training data.\nSeveral remarks are in order. First, observe that Condition (1) is a property of the differentially\nprivate training algorithm (in addition to q and the non-private quantity being approximated). Even\nif all else remains the same, different differentially private approximations to the same non-private\nquantity will have different values of \u03b21.\nSecond, Condition (1) does not always hold for small \u03b21 as an immediate consequence of differential\nprivacy of the training procedure. Differential privacy ensures that the probability of any outcome is\nalmost the same when the inputs differ in the value of a single individual; Condition (1) requires that\neven when the same randomness is used, the validation score evaluated on the actual output of the\nalgorithm does not change very much when the inputs differ by a single individual\u2019s private value.\nIn Section 6.1, we present an example of a problem and two \u03b1-differentially private training algo-\nrithms which approximately optimize the same function; the \ufb01rst algorithm is based on exponential\nmechanism, and the second on a maximum of Laplace random variables mechanism. We show\nthat while both provide \u03b1-differential privacy guarantees, the \ufb01rst algorithm does not satisfy train-\ning stability for \u03b21 = o(n) and small enough \u03b4 while the second one ensures training stability for\n\u03b21 = 1 and \u03b4 = 0. In Section 4, we present two case studies of commonly used differentially private\nalgorithms where Conditions (1) and (2) hold for constant \u03b21 and \u03b22.\nWhen the (\u03b21, \u03b22, \u03b4)-stability condition holds, we can design an end-to-end differentially private\nparameter tuning algorithm, which is shown in Algorithm 2. The algorithm \ufb01rst uses a validation\nprocedure to determine which parameter out of the given set \u0398 is (approximately) optimal based\non the held-out data (see Algorithm 1). In the next step, the training data is re-used along with the\nparameter output by Algorithm 1 and fresh randomness to generate the \ufb01nal output. Note that we\nuse Exp(\u03b3) to denote the exponential distribution with expectation \u03b3.\n\n4\n\n\fAlgorithm 1 Validate(\u0398, T , T , V , \u03b21, \u03b22, \u03b11, \u03b12)\n1: Inputs: Parameter list \u0398 = {\u03b81, . . . , \u03b8k}, training procedure T = (G, F ), validation score q,\ntraining set T , validation set V , stability parameters \u03b21 and \u03b22, training privacy parameter \u03b11,\nvalidation privacy parameter \u03b12.\n\n).\n\nDraw Ri \u223c G. Compute hi = F (T, \u03b8i, \u03b11, Ri).\nLet \u03b2 = max( \u03b21\nLet ti = q(hi, V ) + 2\u03b2Zi, where Zi \u223c Exp( 1\n\n2: for i = 1, . . . , k do\n3:\n4:\n5:\n6: end for\n7: Output i\u2217 = argmaxiti.\n\nn , \u03b22\n\nm ).\n\n\u03b12\n\nAlgorithm 1 takes as input a training procedure T , a parameter list \u0398, a validation score q, training\nand validation datasets T and V , and privacy parameters \u03b11 and \u03b12. It runs the training procedure\nT on the same training set T with privacy budget \u03b11 for each parameter in \u0398 to generate outputs\nh1, h2, . . ., and then uses an \u03b12-differentially private procedure to select the index i\u2217 such that\nthe validation score q(hi\u2217 , V ) is (approximately) maximum. For simplicity, we use a maximum of\nExponential random variables procedure, inspired by [1], to \ufb01nd the approximate maximum; an\nexponential mechanism [21] may also be used instead. Algorithm 2 then re-uses the training data\nset T to train with parameter \u03b8i\u2217 to get the \ufb01nal output.\n\nAlgorithm 2 End-to-end Differentially Private Training and Validation Procedure\n1: Inputs: Parameter list \u0398 = {\u03b81, . . . , \u03b8k}, training procedure T = (G, F ), validation score q,\ntraining set T , validation set V , stability parameters \u03b21 and \u03b22, training privacy parameter \u03b11,\nvalidation privacy parameter \u03b12.\n\n2: i\u2217 = Validate(\u0398,T , T, V, \u03b21, \u03b22, \u03b11, \u03b12).\n3: Draw R \u223c G. Output h = F (T, \u03b8i\u2217 , \u03b11, R).\n\n3.1 Performance Guarantees\n\nTheorem 1 shows that Algorithm 1 is (\u03b12, \u03b4)-differentially private, and Theorem 2 shows privacy\nguarantees on Algorithm 2. Detailed proofs of both theorems are provided in the Supplementary\nMaterial. We observe that Conditions (1) and (2) are critical to the proof of Theorem 1.\n\nis\nk )-stable with respect to the training procedure T , the privacy parameter \u03b11 and the\n\nTheorem 1 (Privacy Guarantees for Validation Procedure) If\n(\u03b21, \u03b22, \u03b4\nparameter set \u0398, then, Algorithm 1 guarantees (\u03b12, \u03b4)-differential privacy.\nTheorem 2 (End-to-end Privacy Guarantees) If the conditions in Theorem 1 hold, and if T is\n\u03b11-differentially private, then Algorithm 2 is (\u03b11 + \u03b12, \u03b4)-differentially private.\nTheorem 3 shows guarantees on the utility of the validation procedure \u2013 that it selects an index i\u2217\nwhich is not too suboptimal.\n\nvalidation\n\nscore\n\nthe\n\nq\n\nTheorem 3 (Utility Guarantees) Let h1, . . . , hk be the output of the differentially private train-\ning procedure in Step (3) of Algorithm 1. Then, with probability \u2265 1 \u2212 \u03b40, q(hi\u2217 , V ) \u2265\nmax1\u2264i\u2264k q(hi, V ) \u2212 2\u03b2 log(k/\u03b40)\n\n.\n\n\u03b12\n\n4 Case Studies\n\nWe next show that Algorithm 2 may be applied to design end-to-end differentially private training\nand validation procedures for two fundamental statistical and machine-learning tasks \u2013 training a lin-\near classi\ufb01er, and building a histogram density estimator. In each case, we use existing differentially\nprivate algorithms and validation scores for these tasks. We show that the validation score satis\ufb01es\nthe (\u03b21, \u03b22, \u03b4)-stability property with respect to the training procedure for small values of \u03b21 and\n\n5\n\n\f\u03b22, and thus we can apply in Algorithm 2 with a small value of \u03b2 to obtain end-to-end differential\nprivacy.\nDetails of the case study for regularized linear classi\ufb01cation is shown in Section 4.1, and those for\nhistogram density estimation is presented in the Supplementary Material.\n\n4.1 Linear Classi\ufb01cation based on Logistic Regression and SVM\nGiven a set of labelled examples (x1, y1), . . . , (xn, yn) where xi \u2208 Rd, (cid:107)xi(cid:107) \u2264 1 for all i, and\nyi \u2208 {\u22121, 1}, the goal in linear classi\ufb01cation is to train a linear classi\ufb01er that largely separates\nexamples from the two classes. A popular solution in machine learning is to \ufb01nd a classi\ufb01er w\u2217 by\nsolving a regulared convex optimization problem:\n\nw\u2217 = argminw\u2208Rd\n\n(cid:107)w(cid:107)2 +\n\n\u03bb\n2\n\n1\nn\n\n(cid:96)(w, xi, yi)\n\n(1)\n\nn(cid:88)\n\ni=1\n\nHere \u03bb is a regularization parameter, and (cid:96) is a convex loss function. When (cid:96) is the logistic loss\nfunction (cid:96)(w, x, y) = log(1 + e\u2212yiw(cid:62)xi ), then we have logistic regression. When (cid:96) is the hinge loss\n(cid:96)(w, x, y) = max(0, 1 \u2212 yiw(cid:62)xi), then we have Support Vector Machines. The optimal value of \u03bb\nis data-dependent, and there is no good pre-de\ufb01ned way to select \u03bb apriori. In practice, the optimal\n\u03bb is determined by training a small number of classi\ufb01ers with different \u03bb values, and picking the one\nthat has the best performance on a held-out validation dataset.\n[4] present two algorithms for computing differentially private approximations to these regularized\nconvex optimization problems for \ufb01xed \u03bb: output perturbation and objective perturbation. We restate\noutput perturbation as Algorithm 4 (in the Supplementary Material) and objective perturbation as\nAlgorithm 3. It was shown by [4] that provided certain conditions hold on (cid:96) and the data, Algorithm 4\nis \u03b1-differentially private; moreover, with some additional conditions on (cid:96), Algorithm 3 is \u03b1 +\n\n(cid:1)-differentially private, where c is a constant that depends on the loss function (cid:96), and\n\n2 log(cid:0)1 + c\n\n\u03bb is the regularization parameter.\n\n\u03bbn\n\nAlgorithm 3 Objective Perturbation for Differentially Private Linear Classi\ufb01cation\n1: Inputs: Regularization parameter \u03bb, training set T = {(xi, yi), i = 1, . . . , n}, privacy parame-\n2: Let G be the following density over Rd: \u03c1G(r) \u221d e\u2212(cid:107)r(cid:107). Draw R \u223c G.\n3: Solve the convex optimization problem:\n\nter \u03b1.\n\nw\u2217 = argminw\u2208Rd\n\n(cid:107)w(cid:107)2 +\n\n\u03bb\n2\n\n1\nn\n\n4: Output w\u2217.\n\n(cid:96)(w, xi, yi) +\n\nR(cid:62)w\n\n2\n\u03b1n\n\n(2)\n\nn(cid:88)\n\ni=1\n\nIn the sequel, we use the notation X to denote the set {x \u2208 Rd : (cid:107)x(cid:107) \u2264 1}.\nDe\ufb01nition 3 A function g : Rd \u00d7X \u00d7{\u22121, 1} \u2192 R is said to be L-Lipschitz if for all w, w(cid:48) \u2208 Rd,\nfor all x \u2208 X , and for all y, |g(w, x, y) \u2212 g(w(cid:48), x, y)| \u2264 L \u00b7 (cid:107)w \u2212 w(cid:48)(cid:107).\nLet V = {(\u00afxi, \u00afyi), i = 1, . . . , m} be the validation dataset. For our validation score, we choose a\nfunction of the form:\n\ng(w, \u00afxi, \u00afyi)\n\n(3)\n\nq(w, V ) = \u2212 1\nm\n\nm(cid:88)\n\ni=1\n\nwhere g is an L-Lipschitz loss function. In particular, the logistic loss and the hinge loss are 1-\nLipschitz, whereas the 0/1 loss is not L-Lipschitz for any L. Other examples of 1-Lipschitz but\nnon-convex losses include the ramp loss: g(w, x, y) = min(1, max(0, 1 \u2212 yw(cid:62)x)).\nThe following theorem shows that any non-negative and L-Lipschitz validation score is stable with\nrespect to Algorithms 3 and 4 and a set of regularization parameters \u039b; a detailed proof is provided\nin the Supplementary Material. Thus we can use Algorithm 2 along with this training procedure\n\n6\n\n\fand any L-Lipschitz validation score to get an end-to-end differentially private algorithm for linear\nclassi\ufb01cation.\nTheorem 4 (Stability of differentially private linear classi\ufb01ers) Let \u039b = {\u03bb1, . . . , \u03bbk} be a set\ni=1 \u03bbi, and let g\u2217 = max(x,y)\u2208X ,w\u2208Rd g(w, x, y). If\nof regularization parameters, let \u03bbmin = mink\n(cid:96) is convex and 1-Lipschitz, and if g is L-Lipschitz and non-negative, then, the validation score q in\nEquation 3 is (\u03b21, \u03b22, \u03b4\n\nk )-stable with respect to Algorithms 3 and 4, \u03b1 and \u039b for:\n\n\u03b21 =\n\n2L\n\u03bbmin\n\n,\n\n\u03b22 = min\n\ng\u2217,\n\nL\n\n\u03bbmin\n\n1 +\n\nd log(dk/\u03b4)\n\n\u03b1n\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)(cid:19)\n\n(cid:16)\n\n(cid:17)\n\nExample. For example,\n\n1\n\n1 + d log(dk/\u03b4)\n\nif g is chosen to be the hinge loss,\n\nand \u03b22 =\n. This follows from the fact that the hinge loss is 1-Lipschitz, but may be\n\nthen \u03b21 = 2\n\u03bbmin\n\n\u03b1n\n\n\u03bbmin\nunbounded for w of unbounded norm.\n, and \u03b22 = 1 (assuming that \u03bbmin \u2264 1). This\nIf g is chosen to be the ramp loss, then \u03b21 = 2\nfollows from the fact that the ramp loss is 1-Lipschitz, but bounded at 1 for any w and (x, y) \u2208 X .\n\u03bbmin\n\n5 Experiments\n\nIn order to evaluate Algorithm 2 empirically, we compare the regularizer parameter values and per-\nformance of regularized logistic regression classi\ufb01ers the algorithm produces with those produced\nby four alternative methods. We used datasets from two domains, and used 10 times 10-fold cross-\nvalidation (CV) to reduce variability in the computed performance averages.\n\nThe Methods Each method takes input (\u03b1, \u0398, T, V ), where \u03b1 denotes the allowed differential\nprivacy, T is a training set, V is a validation set, and \u0398 = {\u03b81, . . . , \u03b8k} a list of k regularizer values.\nAlso, let oplr (\u03b1, \u03bb, T ) denote the application of the objective perturbation training procedure given\nin Algorithm 3 such that it yields \u03b1-differential privacy.\nThe \ufb01rst of the \ufb01ve methods we compare is Stability, the application of Algorithm 2 with oplr used\nfor learning classi\ufb01ers, \u03b4 chosen in an ad-hoc manner to be 0.01, average negative ramp loss used as\nvalidation score q, and with \u03b11 = \u03b12 = \u03b1/2.\nThe four other methods work by performing the following 4 steps: (1) for each \u03b8i \u2208 \u0398, train a\ndifferentially private classi\ufb01er fi = oplr (\u03b1i, \u03b8i, Ti), (2) determine the number of errors ei each fi\nmakes on validation set V , (3) randomly choose i\u2217 from {1, 2, . . . , k} with probability P (i\u2217 = i|pi),\nand (4) output (\u03b8i\u2217 , fi\u2217 ).\nWhat differentiates the four alternative methods is how \u03b1i, Ti, and pi are determined. For\nalphaSplit: \u03b1i = \u03b1/k, Ti = T , pi \u221d e\u2212\u03b1ei/2, dataSplit: \u03b1i = \u03b1, partition T into k equally\nsized sets Ti, pi \u221d e\u2212\u03b1ei/2 (used in [4]), Random: \u03b1i = \u03b1, Ti = T , pi \u221d 1, and Control: \u03b1i = \u03b1,\nTi = T , pi \u221d 1(i = arg maxj q(fj, V )). Note that for alphaSplit, \u03b1/k > \u03b1(cid:48) where \u03b1(cid:48) is the\n\u03b1 = 0.3, then \u03b1/k > \u03b1(cid:48) \u2212 0.0003. The method Control is not private, and serves to provide an\napproximate upper bound on the performance of Stability. The three other alternative methods are\ndifferentially private which we state in the following theorem.\n\nsolution of \u03b1 = k(e\u03b1(cid:48) \u2212 1)\u03b1(cid:48) +(cid:112)2k log(1/\u03b4)\u03b1(cid:48) for all of our experimental settings, except when\n\nTheorem 5 (Privacy of alternative methods) If T and V are disjoint, both alphaSplit and\ndataSplit are \u03b1-differentially private. Random is \u03b1 differentially private even if T and V are\nnot disjoint, in which case alphaSplit and dataSplit are 2\u03b1-differentially private.\n\nProcedures and Data We performed 10 10-fold CV as follows. For round i in each of the CV\nexperiments, fold i was used as a test set W on which the produced classi\ufb01ers were evaluated, fold\n(i mod 10)+1 was used as V , and the remaining 8 folds were used as T . Furthermore k = 10 with\n\u0398 = {0.001, 0.112, 0.223, 0.334, 0.445, 0.556, 0.667, 0.778, 0.889, 1}. Note that the order of \u0398 is\nchosen such that i < j implies \u03b8i < \u03b8j. By Theorems 2 and 5, all methods except Control produce\n\n7\n\n\fa (\u03b1, \u03b4)-differentially private classi\ufb01er. Classi\ufb01er performance was evaluated using the area under\nthe receiver operator curve [25] (AUC) as well as mean squared error (MSE). All computations\nwere done using the R environment [22], and data sets were scaled such that covariate vectors were\nconstrained to the unit ball. We used the following data available from the UCI Machine Learning\nRepository [9]:\nAdult \u2013 98 predictors (14 original including categorical variables that needed to be recoded). The\ndata set describes measurements on cases taken from the 1994 Census data base. The classi\ufb01cation is\nwhether or not a person has an annual income exceeding 50000 USD, which has a prevalence of 0.22.\nEach experiment involves computing more than 24000 classi\ufb01ers. In order to reduce computation\ntime, we selected 52 predictors using the step procedure for a model computed by glm with family\nbinomial and logit link function.\nMagic \u2013 10 predictors on 19020 cases. The data set describes simulated high energy gamma par-\nticles registered by a ground-based atmospheric Cherenkov gamma telescope. The classi\ufb01cation is\nwhether particles are primary gammas (signal) or from hadronic showers initiated by cosmic rays in\nthe upper atmosphere (background). The prevalence of primary gammas is 0.35.\n\n(a) Averages of AUC for the two data sets.\n\n(b) Averages of MSE for the two data sets.\n\nFigure 1: A summary of 10 times 10-fold cross-validation experiments for different privacy levels\n\u03b1. Each point in the \ufb01gure represents a summary of 100 data points. The error bars indiciate a\nboot-strap sample estimate of the 95% con\ufb01dence interval of the mean. A small amount of jitter was\nadded to positions on the x-axes to avoid over-plotting.\n\nResults Figure 1 summarizes classi\ufb01er performances and regularizer choices for the different val-\nues of the privacy parameter \u03b1, aggregated over all cross-validation runs. Figure 1a shows average\nperformance in terms of AUC, and Figure 1b shows average performance in terms of MSE.\nLooking at AUC in our experiments, Stability signi\ufb01cantly outperformed alphaSplit and dataSplit.\nHowever, Stability only outperformed Random for \u03b1 > 1 in the Magic data set, and was in fact out-\nperformed by Random in the Adult data set. In the Adult data set, regularizer choice did not seem\nto matter as Random performed equally well to Control. For MSE on the other hand, Stability\noutperformed the differentially private alternatives in all experiments. We suggest the following\nintuition regarding these results. The calibration of a logistic regression model instance, i.e., the\ndifference between predicted probabilities and a 0/1 encoding of the corresponding labels, is not\ncaptured well by AUC (or 0/1 error rate) as AUC is insensitive to all strictly monotonically increas-\ning transformations of the probabilities. MSE is often used as a measure of probabilistic model\ncalibration and can be decomposed into two terms: reliability (a calibration term), and re\ufb01nement\n(a discrimination measure) which is related to the AUC. In the Adult data set, the minor change\nin AUC of Control and Random for \u03b1 > 0.5, together with the apparent insensitivity of AUC\nto regularizer value, suggests that any improvement in Stability performance can only come from\n(the observed) improved calibration. Unlike in the Adult data set, there is a AUC performance gap\nbetween Control and Random in the Magic data set. This means that regularizer choice matters for\ndiscrimination, and we observe improvement for Stability in both discrimination and calibration.\nAcknowledgements This work was supported by NIH grants R01 LM07273 and U54\nHL108460, the Hellman Foundation, and NSF IIS 1253942.\n\n8\n\nllllllllllllAdultMagic0.60.70.80.30.51.02.03.05.00.30.51.02.03.05.0alphallllllllllllAdultMagic0.180.200.220.240.30.51.02.03.05.00.30.51.02.03.05.0alphalStabilityalphaSplitdataSplitRandomControl\fReferences\n[1] R Bhaskar, S Laxman, A Smith, and A Thakurta. Discovering frequent patterns in sensitive\n\ndata. In KDD, 2010.\n\n[2] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In\n\nPODS, 2005.\n\n[3] K. Chaudhuri and D. Hsu. Convergence rates for differentially private statistical estimation. In\n\nICML, 2012.\n\n[4] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk mini-\n\nmization. Journal of Machine Learning Research, 12:1069\u20131109, March 2011.\n\n[5] K. Chaudhuri, A.D. Sarwate, and K. Sinha. Near-optimal algorithms for differentially-private\n\nprincipal components. Journal of Machine Learning Research, 2013 (to appear).\n\n[6] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer, 2001.\n[7] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private\n\ndata analysis. In Theory of Cryptography, Berlin, Heidelberg, 2006.\n\n[8] C. Dwork, G. Rothblum, and S. Vadhan. Boosting and differential privacy. In FOCS, 2010.\n[9] A. Frank and A. Asuncion. UCI machine learning repository, 2013.\n[10] A. Friedman and A. Schuster. Data mining with differential privacy. In KDD, 2010.\n[11] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary informa-\n\ntion in data privacy. In KDD, 2008.\n\n[12] M. Hardt and A. Roth. Beyond worst-case analysis in private singular vector computation. In\n\nSTOC, 2013.\n\n[13] M. Hardt and G. Rothblum. A multiplicative weights mechanism for privacy-preserving data\n\nanalysis. In FOCS, pages 61\u201370, 2010.\n\n[14] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially private\n\nhistograms through consistency. PVLDB, 3(1):1021\u20131032, 2010.\n\n[15] P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. In COLT, 2012.\n[16] M C Jones, J S Marron, and S J Sheather. A brief survey of bandwidth selection for density\n\nestimation. JASA, 91(433):401\u2013407, 1996.\n\n[17] M. Kapralov and K. Talwar. On differentially private low rank approximation. In SODA, 2013.\n[18] D. Kifer, A. Smith, and A. Thakurta. Private convex optimization for empirical risk minimiza-\n\ntion with applications to high-dimensional regression. In COLT, 2012.\n\n[19] J. Lei. Differentially private M-estimators. In NIPS 24, 2011.\n[20] A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and L. Vilhuber. Privacy: Theory\n\nmeets practice on the map. In ICDE, 2008.\n\n[21] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, 2007.\n[22] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation.\n[23] B. Rubinstein, P. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacy-\n\npreserving mechanisms for svm learning. Journal of Privacy and Con\ufb01dentiality, 2012.\n\n[24] A.D. Sarwate and K. Chaudhuri. Signal processing and machine learning with differential\n\nprivacy: Algorithms and challenges for continuous data. IEEE Signal Process. Mag., 2013.\n\n[25] J. A. Swets and R. M. Pickett. Evaluation of Diagnostic Systems. Methods from Signal Detec-\n\ntion Theory. Academic Press, New York, 1982.\n\n[26] Berwin A Turlach. Bandwidth selection in kernel density estimation: A review. In CORE and\n\nInstitut de Statistique. Citeseer, 1993.\n\n[27] S. Vinterbo. Differentially private projected histograms: Construction and use for prediction.\n\nIn ECML, 2012.\n\n[28] L. Wasserman and S. Zhou. A statistical framework for differential privacy.\n\n105(489):375\u2013389, 2010.\n\nJASA,\n\n[29] J. Xu, Z. Zhang, X. Xiao, Y. Yang, and G. Yu. Differentially private histogram publication. In\n\nICDE, 2012.\n\n9\n\n\f6 Appendix\n\n6.1 An Example to Show Training Stability is not a Direct Consequence of Differential\n\nPrivacy\n\nWe now present an example to illustrate that training stability is a property of the training algorithm\nand not a direct consequence of differential privacy. We present a problem and two \u03b1-differentially\nprivate training algorithms which approximately optimize the same function; the \ufb01rst algorithm\nis based on exponential mechanism, and the second on a maximum of Laplace random variables\nmechanism. We show that while both provide \u03b1-differential privacy guarantees, the \ufb01rst algorithm\ndoes not satisfy training stability while the second one does.\nLet i \u2208 {1, . . . , l}, and let f : X n \u00d7 R \u2192 [0, 1] be a function such that for all i and all datasets D\nand D(cid:48) of size n that differ in the value of a single individual, |f (D, i) \u2212 f (D(cid:48), i)| \u2264 1\nn.\nConsider the following training and validation problem. Given a sensitive dataset D, the private\ntraining procedure A outputs a tuple (i\u2217, t1, . . . , tl), where i\u2217 is the output of the \u03b1/2-differentially\nprivate exponential mechanism [21] run to approximately maximize f (D, i), and each ti is equal to\nf (D, i) plus an independent Laplace random variable with standard deviation 2l\n\u03b1n. For any validation\ndataset V , the validation score q((i\u2217, t1, . . . , tl), V ) = ti\u2217.\nIt follows from standard results that A is \u03b1-differentially private. Moreover, A can be represented\nby a tuple TA = (GA, FA), where GA is the following density over sequences of real numbers of\nlength l + 1:\n\nGA(r0, r1, . . . , rl) = 10\u2264r0\u22641 \u00b7 1\n\n2l e\u2212(|r1|+|r2|+...+|rl|)\n\nThus GA is the product of the uniform density on [0, 1] and l standard Laplace densities. Consider\nthe following map E0. For r \u2208 [0, 1], let\n\n(cid:80)\n(cid:80)\n\n(cid:80)\n(cid:80)\n\nE0(r) = i,\n\nif\n\nj