{"title": "Privacy-Preserving Q-Learning with Functional Noise in Continuous Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 11327, "page_last": 11337, "abstract": "We consider differentially private algorithms for reinforcement learning in continuous spaces, such that neighboring reward functions are indistinguishable. This protects the reward information from being exploited by methods such as inverse reinforcement learning. Existing studies that guarantee differential privacy are not extendable to infinite state spaces, as the noise level to ensure privacy will scale accordingly to infinity. Our aim is to protect the value function approximator, without regard to the number of states queried to the function. It is achieved by adding functional noise to the value function iteratively in the training. We show rigorous privacy guarantees by a series of analyses on the kernel of the noise space, the probabilistic bound of such noise samples, and the composition over the iterations. We gain insight into the utility analysis by proving the algorithm's approximate optimality when the state space is discrete. Experiments corroborate our theoretical findings and show improvement over existing approaches.", "full_text": "Privacy-preserving Q-Learning with Functional Noise\n\nin Continuous Spaces\n\nBaoxiang Wang\n\nThe Chinese University of Hong Kong\n\nBorealis AI, Edmonton\n\nbxwang@cse.cuhk.edu.hk\n\nNidhi Hegde\n\nBorealis AI, Edmonton\n\nnidhi.hegde@borealisai.com\n\nAbstract\n\nWe consider differentially private algorithms for reinforcement learning in contin-\nuous spaces, such that neighboring reward functions are indistinguishable. This\nprotects the reward information from being exploited by methods such as inverse\nreinforcement learning. Existing studies that guarantee differential privacy are\nnot extendable to in\ufb01nite state spaces, as the noise level to ensure privacy will\nscale accordingly to in\ufb01nity. Our aim is to protect the value function approximator,\nwithout regard to the number of states queried to the function. It is achieved by\nadding functional noise to the value function iteratively in the training. We show\nrigorous privacy guarantees by a series of analyses on the kernel of the noise\nspace, the probabilistic bound of such noise samples, and the composition over\nthe iterations. We gain insight into the utility analysis by proving the algorithm\u2019s\napproximate optimality when the state space is discrete. Experiments corroborate\nour theoretical \ufb01ndings and show improvement over existing approaches.\n\n1\n\nIntroduction\n\nIncreasing interest in reinforcement learning (RL) and deep reinforcement learning has led to recent\nadvances in a wide range of algorithms [SB18]. While a large part of the advancement has been in\nthe space of games, the applicability of RL extends to other practical cases such as recommendation\nsystems [ZZZ+18, LSTS15] and search engines [RJG+18, HDZ+18]. With the popularity of the\nRL algorithms increasing, so have concerns about their privacy. Namely, the released value (or\npolicy) function are trained based on the reward signal and other inputs, which commonly rely on\nsensitive data. For example, an RL recommendation system may use the reward signals simulated by\nusers\u2019 historical records. This historical information can thus be inferred by recursively querying\nthe released functions. We consider differentially privacy [DMNS06, DR14], a natural and standard\nprivacy notion, to protect such information in the RL methods.\nRL methods learn by carrying out actions, receiving rewards observed for that action in a given state,\nand transitioning to the next states. Observation of the learned value function can reveal sensitive\ninformation: the reward function is a succinct description of the task. It is also connected to the users\u2019\npreferences and the criteria of their decision-making; the visited states carry important contextual\ninformation on the users, such as age, gender, occupation, and etc.; the transition function includes\nthe dynamics of the system and the impact of the actions on the environment. Among those, the\nreward function is the most vulnerable and valuable component, and studies have been conducted\nto infer this information [AN04, NR00]. In this paper, our aim is to design differentially private\nalgorithms for RL, such that neighboring reward functions are indistinguishable.\nThere is a recent line of research on privacy-preserving algorithms by protecting the reward function.\nBalle et al. [BGP16] train the private value function using a \ufb01xed set of trajectories. However when\na new state is queried this privacy guarantee will not hold. Similar results are also considered in\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u221a\n\n\u221a\n\ncontextual multi-arm bandits [SS18, SS19, PFW+18], where the context vector is analogous to the\nstate. The gap that these works leave lead us to design a private algorithm that is not dependent on\nthe number of states queried to the value function.\nIn order to achieve this under continuous space settings, we investigate the Gaussian process mecha-\nnism proposed by Hall et al. [HRW13]. The mechanism adds functional noise to the value function\napproximation hence the function can be evaluated at arbitrarily many states while preserving pri-\nvacy. We show that our choice of the reproducing kernel Hilbert space (RKHS) embeds common\nneural networks, hence a nonlinear value function can also be used. We therefore adapt Q-learning\n[MKS+15, WD92, Bai95] so that the value function is protected after each update, even when new\nstates are visited.\nWe rigorously show differential privacy guarantees of our algorithm with a series of techniques.\nNotably, we derive a probabilistic bound of the sample paths thus ensuring that the RKHS norm\nof the noised function can be bounded. This bound is signi\ufb01cantly better than a union bound of\nall noise samples. Further, we analyze the composition of the privacy costs of the mechanism.\nThere is no known composition result of the functional mechanism, other than the general theorems\nthat apply to any mechanism [KOV13, DRV10, BKN10]. Inspired by these theorems, we derive a\nprivacy guarantee which is better than existing results. On the utility analysis, though there is no\nknown performance analysis on deep reinforcement learning, we gain insights by proving the utility\nguarantee under the tractable discrete state space settings. Empirically, experiments corroborate our\ntheoretical \ufb01ndings and show improvement over existing methods.\nRelated Works. There is a recent line of research that discusses privacy-preserving approaches\non online learning and stochastic multi-armed bandit problems [SB18, Sze10]. The algorithms\nprotect neighboring reward sequences from being distinguished, which is related to our de\ufb01nition of\nneighboring reward functions. In bandit problems, the algorithms preserve the privacy via mechanisms\nthat add noise to the estimates of the reward distribution [TD17, TD16, MT15, TS13, KGGW15].\nThis line of work shares similar motivations as our work, but they do not scale to the continuous\nspace because of the\nN ln N factor involved where N is the number of arms. Similarly,\nin the online learning settings, the algorithms preserve the privacy evaluated sequence of the oracle\n[GUK17, ALMT17, AS17, JKT12]. Their analyses are based on optimizing a \ufb01xed objective thus do\nnot apply to our setting.\nMore closely related are privacy studies on contextual bandits [SS19, SS18], where there is a\ncontextual vector that is analogous to the states in reinforcement learning. Equivalently, differentially\nprivate policy evaluation [BGP16] considers a similar setting where the value function is learned on\na one-step MDP. It worth note that they also consider the privacy with respect to the state and the\nactions, though in this paper we will focus only on the rewards. The major challenge to extend these\nworks is that reinforcement learning requires an iterative process of policy evaluation and policy\nimprovement. The additional states that are queried to the value function are not guaranteed to be\nvisited and protected by previous iterations. We propose an approach for both the evaluation and the\nimprovement, while also extending the algorithm to nonlinear approximations like neural networks.\nDifferential privacy in a Markov decision process (MDP) has been discussed [Ven13] via the input\nperturbation technique. In the work, the reward is reformulated as a weighted sum of the utility and\nthe privacy measure. With this formulation, it amounts to learn the MDP under this weighted reward.\nEssentially, input perturbation will cause relatively large utility loss and is therefore less preferred.\nSimilarly, output perturbation can be used to preserve privacy, as shown in our analysis. It is though\nobvious that the necessary noise level is relatively larger and also depends on more factors than our\nalgorithm does. Therefore, more subtle techniques will be required to improve the methods by input\nand output perturbation.\nA general approach that can be applied to continuous spaces is the differentially private deep learning\nframework [ACG+16, CBK+19]. The method perturbs the gradient estimator in the updates of the\nneural network parameters to preserve privacy. In our problem, applying the method will require\nlarge noise levels. In fact, the algorithm considers neighboring inputs that at most one data point\ncan be different, therefore bene\ufb01ts from a 1/B factor via privacy ampli\ufb01cation [KLN+11, BKN10]\nwhere B is the batch size. This no longer holds in reinforcement learning, as all reward signals can\nbe different for neighboring reward functions, causing the noise level to scale B times back.\n\nN or\n\n2\n\n\f2 Preliminaries\n\n2.1 Markov Decision Process and Reinforcement Learning\n\nMarkov decision process (MDP) is a framework to model decisions in an environment. We use\ncanonical settings of the discrete-time Markov decision process. An MDP is denoted by the tuple\n(S,A,T , r, \u03c10, \u03b3) which includes the state space S, the action space A = {1, . . . , m}, the stochastic\ntransition kernel T : S \u00d7 A \u00d7 S \u2192 R+, the reward function r : S \u00d7 A \u2192 R, the initial state\ndistribution \u03c10 : S \u2192 R+ and the discount factor \u03b3 \u2208 [0, 1). Denote m in the above as the number of\nactions in the action space. The objective is to maximize the expected discounted cumulative reward.\nFurther de\ufb01ne the policy function \u03c0 : S,A \u2192 R+ and the corresponding action-state value function\nas\n\nQ\u03c0(s, a) = E\u03c0[\n\n\u03b3tr(st, at)|s0 = s, a0 = a, \u03c0].\n\n\u221e(cid:88)\n\nt\u22650\n\nWhen the context is clear, we omit \u03c0 and write Q(s, a) instead.\nWe use the continuous state space setting for this paper, except in Appendix D. We investigate\nbounded and continuous state space S \u2286 R and without loss of generality assume that S = [0, 1].\nThe value function Q(s, a) is treated as a set of m functions Qa(s), where each function is de\ufb01ned\non [0, 1]. The reward function is similarly written as a set of m functions, each de\ufb01ned on [0, 1]. We\ndo not impose any particular assumptions on the reward function.\nOur algorithm is based on deep Q-learning [MKS+15, Bai95, WD92], which solves the Bellman\nequation. Our differential privacy guarantee can also be generalized to other Q-learning algorithms.\nThe objective of deep Q-learning is to minimize the Bellman error\n\n1\n2\n\n(Q(s, a) \u2212 E[r + \u03b3 max\n\na(cid:48) Q(s(cid:48), a(cid:48))])2,\n\nwhere s(cid:48) \u223c T (s, a, s(cid:48)) denotes the consecutive state after executing action a at state s. Similar to\n[MKS+15], we use a neural network to parametrize Q(s, a). We will focus on output a learned value\nfunction where the reward function r(\u00b7) and r(cid:48)(\u00b7) cannot be distinguished by observing Q(s, a), as\nlong as (cid:107)r \u2212 r(cid:48)(cid:107)\u221e \u2264 1. Here without ambiguity we write r(\u00b7), r(cid:48)(\u00b7) as r, r(cid:48), and the in\ufb01nity norm\n(cid:107)f (s)(cid:107)\u221e is de\ufb01ned as sups |f (s)|.\n2.2 Differential Privacy\n\nDifferential privacy [DKM+06, DMNS06] has developed into a strong standard for privacy guarantees\nin data analysis. It provides a rigorous framework for privacy guarantees under various adversarial\nattacks.\nThe de\ufb01nition of differential privacy is based on the notion that in order to preserve privacy, data\nanalysis should not differ at the aggregate level whether any given user is present in the input or not.\nThis latter condition on the presence of any user is formalized through the notion of neighboring\ninputs. The de\ufb01nition of neighboring inputs will vary according to the problem settings.\nLet d, d(cid:48) \u2208 D be neighboring inputs.\nDe\ufb01nition 1. A randomized mechanism M : D \u2192 U satis\ufb01es (\u0001, \u03b4)-differential privacy if for any\ntwo neighboring inputs d and d(cid:48) and for any subset of outputs Z \u2286 U it holds that\n\nP(M(d) \u2208 Z) \u2264 exp(\u0001)P(M(d(cid:48)) \u2208 Z) + \u03b4.\n\nAn important parameter of a mechanism is the (global) sensitivity of the output.\nDe\ufb01nition 2. For all pairs d, d(cid:48) \u2208 D of neighboring inputs, the sensitivity of a mechanism M is\nde\ufb01ned as\n\n\u2206M = sup\nd,d(cid:48)\u2208D\n\n(cid:107)M(d) \u2212 M(d(cid:48))(cid:107),\n\n(1)\n\nwhere (cid:107) \u00b7 (cid:107) is a norm function de\ufb01ned on U.\nVector-output mechanisms. For converting vector-valued functions into a (\u0001, \u03b4)-DP mechanism,\none of the standard approaches is the Gaussian mechanism. This mechanism adds N (0, \u03c32I) to the\noutput M(d). In this case U = Rn and (cid:107) \u00b7 (cid:107) in (1) is the (cid:96)2-norm (cid:107) \u00b7 (cid:107)2.\n\n3\n\n\f\u03c3 \u2265 (cid:112)2 ln(1.25/\u03b4)\u2206M/\u0001, then M(d) + y is (\u0001,\u03b4)-differentially private, where y is drawn from\n\nProposition 3 (Vector-output Gaussian mechanism; Theorem A.1 of [DR14]). If 0 < \u0001 < 1 and\nN (0, \u03c32I).\nFunction-output mechanisms. In this setting the output of the function is a function, which means\nthe mechanism is a functional. We consider the case where U is an RKHS and (cid:107)\u00b7(cid:107) in (1) is the RKHS\nnorm (cid:107) \u00b7 (cid:107)H. Hall et al. [HRW13] have shown that adding a Gaussian process noise G(0, \u03c32K) to the\noutput M(d) is differentially private, when K is the RKHS kernel of U. Let G denote the Gaussian\nprocess distribution.\nProposition 4 (Function-output Gaussian process mechanism [HRW13]). If 0 < \u0001 < 1 and \u03c3 \u2265\nand U is an RKHS with kernel function K.\n\n(cid:112)2 ln(1.25/\u03b4)\u2206M/\u0001, then M(d)+g is (\u0001,\u03b4)-differentially private, where g is drawn from G(0, \u03c32K)\nNote that in [HRW13] the stated condition was \u03c3 \u2265 (cid:112)2 ln(2/\u03b4)\u2206M/\u0001. The improvement from\n\nconstant 2 to 1.25 is natural but for the completeness we include a proof in Appendix B.\n\n3 Differentially Private Q-Learning\n\n3.1 Our Algorithm\n\nWe present our algorithm for privacy-preserving Q-learning under the setting of continuous state\nspace in Algorithm 1. The algorithm is based on deep Q-learning proposed by Mnih et al. [MKS+15].\nWe achieve privacy by perturbing the learned value function at each iteration, by adding a Gaussian\nprocess noise. The noise-adding is described by line 19-20 of the algorithm, where \u02c6g is the noise.\nThis noise is a discrete estimate of the continuous sample path, evaluated at the states st visited in\nthe trajectories. Intuitively, when (s, z) is an element of the list \u02c6g, it implies g(s) = z for the sample\npath g. Line 14-18 describes the necessary maintenance of \u02c6g to simulate the Gaussian process. Line\n7-9 samples a new Gaussian process sample path for every J iterations, which controls the balance\nbetween the approximation factor of privacy and the utility. The other steps are similar to [MKS+15].\n\nAlgorithm 1 Differentially Private Q-Learning with Functional Noise\n1: Input: the environment and the reward function r(\u00b7)\n2: Parameters: target privacy (\u0001, \u03b4), time horizon T , batch size B, action space size m, learning\n\nrate \u03b1, reset factor J\n\n3: Output: trained value function Q\u03b8(s, a)\n4: Initialization: s0 \u2208 [0, 1] uniformly, Q\u03b8(s, a) for each a \u2208 [m], linked list \u02c6gk[B][2] = {}\n\n5: Compute noise level \u03c3 =(cid:112)2(T /B) ln(e + \u0001/\u03b4)C(\u03b1, k, L, B)/\u0001;\n\nInsert s to \u02c6ga[:][1] such that the list remains monotonically increasing;\nSample zat \u223c N (\u00b5at, \u03c3dat)), according to Equation (2), Appendix A;\nUpdate the list \u02c6ga(s) \u2190 zat;\n\nt \u2190 jT /B + b;\nExecute at = arg maxa Q\u03b8(st, a) + \u02c6ga(st);\nReceive rt and st+1, s \u2190 st+1;\nfor a \u2208 [m] do\n\n\u02c6gk[B][2] \u2190 {};\n\nend if\nfor b in [B] do\n\nif j \u2261 0 mod T /JB then\n\n6: for j in [T /B] do\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\nB\u2207\u03b8\n22:\n23: end for\n24: Return the trained Q\u03b8(s, a) function;\n\nend for\nyt \u2190 rt + \u03b3 maxa Q\u03b8(st+1, a) + \u02c6ga(st+1);\nlt \u2190 1\nend for\nRun one step SGD \u03b8 \u2190 \u03b8 + \u03b1 1\n\n2 (Q\u03b8(st, at) + \u02c6ga(st) \u2212 yt)2;\n\n(cid:80)(j+1)B\n\nt=jB lt;\n\n4\n\n\fInsight into the algorithm design. To satisfy differential privacy guarantees, we require two reward\nfunctions r and r(cid:48) to be indistinguishable upon observation of the learned functions, as long as\n(cid:107)r \u2212 r(cid:48)(cid:107)\u221e \u2264 1. The major dif\ufb01culty is that the reward signal r(s, a) can appear at any s, and all\nthe reward signals can be different under r and r(cid:48). Therefore, we will need a stronger mechanism\nof privacy that does not rely on the \ufb01nite setting where at most one data point in a (\ufb01nite) dataset is\ndifferent, like in [ACG+16] and [BGP16]. This is also the major challenge in extending [BGP16]\nfrom policy evaluation to policy improvement. The natural approach to address the challenge is to\ntreat a function as one \u201cdata point\u201d, which leads to our utilization of the techniques studied by Hall et\nal. [HRW13].\n\n3.2 Privacy, Ef\ufb01ciency, and Utility of the Algorithm\n\nPrivacy analysis. There are three main components in the privacy analysis. First, we have to de\ufb01ne\nthe RKHS to invoke the Gaussian process mechanism in Proposition 4. This RKHS should also\ninclude the value function approximation we used in the algorithm, namely, neural networks. Second,\nwe give a privacy guarantee of composing the mechanism for T /B iterations. There is not a known\ncomposition result of such a functional mechanism, other than the general theorems that apply to any\nmechanism [KOV13, DRV10, BKN10]. But we derive such a privacy guarantee which is better than\nexisting results. Third, as the sample path is evaluated on multiple different states, the updated value\nfunction can be unbounded, which subsequently induces the RKHS norm to be unbounded. This is\naddressed by showing a probabilistic uniform bound of the sample path over the state space.\nOur privacy guarantee is shown in the following theorem.\nTheorem 5. The Q-learning algorithm in Algorithm 1 is (\u0001, \u03b4 + J exp(\u2212(2k \u2212 8.68\n\u221a\nwith respect to two neighboring reward functions (cid:107)r \u2212 r(cid:48)(cid:107)\u221e \u2264 1, provided that 2k > 8.68\n\n\u03b2\u03c3)2/2))-DP\n\u03b2\u03c3, and\n\n\u221a\n\n\u03c3 \u2265(cid:112)2(T /B) ln(e + \u0001/\u03b4)C(\u03b1, k, L, B)/\u0001,\n\nwhere C(\u03b1, k, L, B) = ((4\u03b1(k + 1)/B)2 + 4\u03b1(k + 1)/B)L2, \u03b2 = (4\u03b1(k + 1)/B)\u22121, L is the\nLipschitz constant of the value function approximation, B is the batch size, T is the number of\niterations, and \u03b1 is the learning rate.\n\nTheorem 5 provides a rigorous guarantee on the privacy of the reward function. We now present three\nstatements to address the challenges mentioned above and support the theorem.\nLemma 6 and its corollary, informally stated below and formally stated in Appendix C, describe\nthe RKHS that is necessary to both embedded the function approximators we use and invoke the\nmechanism in Proposition 4.\nLemma 6 (Informal statement). The Sobolev space H 1 with order 1 and the (cid:96)2-norm is de\ufb01ned as\n\n(cid:90) 1\n\nH 1 = {f \u2208 C[0, 1] : \u2202f (x) exists;\n\n(\u2202f (x))2dx < \u221e},\n\n0\n\nwhere \u2202f (x) denotes weak derivatives and the RKHS kernel is K(x, y) = exp(\u2212\u03b2|x \u2212 y|).\nImmediately following Lemma 6, we show that the common neural networks are in the Sobolev\nspace. That includes neural networks with nonlinear activation layers such as a ReLU function, a\nsigmoid function, or the tanh function. The proof of the following corollary is also in Appendix C.\nCorollary 7. Let \u02c6fW (x) denote the neural network with \ufb01nitely many \ufb01nite parameters W . For\n\u02c6fW (x) with \ufb01nitely many layers, if the gradient of the activation function is bounded, then \u02c6fW (x) \u2208\nH 1.\n\nBy the corollary \u02c6fW (x) is Lipschitz continuous. Denote L as the Lipschitz constant which only\ndepends on the network architecture. It follows from Lemma 6 immediately that, in the algorithm\n0(1 + \u03b2/2)/(1 \u2212 \u03b3)2 + L2/\u03b2, for each a,\nfor any Q(s, a) and Q(cid:48)(s, a), (cid:107)Q(\u00b7, a) \u2212 Q(cid:48)(\u00b7, a)(cid:107)2H \u2264 2r2\nwhere it assumes bounded reward |r(s, a)| \u2264 r0. This will lead to an alternative privacy guarantee,\nbut less preferred than in Theorem 5 due to the 1/(1 \u2212 \u03b3)2 and the r0 factor.\nLine 19 and 20 use \u02c6g, which is the list of Gaussian random variables evaluated at the Gaussian process\nwill cause the approximation factor to be \u03b4 +O(1\u2212 (1\u2212 exp(2k\u2212\u221a\nsample paths. Using a union tail bound we can derive a probabilistic bound of these variables, but it\n\u03b2\u03c3))T ), which is unrealistically\n\n5\n\n\flarge. We show in the lemma below that with high probability the entire sample path is uniformly\nbounded over any state st. We can then calibrate the \u03b4 to cover the exponentially small tail bound\nO(exp(\u2212u2)) of the noise. The proof is in Appendix A.\nLemma 8. Let P the probability measure over H 1 of the sample path f generated by G(0, \u03c32K).\nThen almost surely maxx\u2208[0,1] f (x) exists, and for any u > 0\n\nf (x) \u2265 8.68(cid:112)\u03b2\u03c3 + u) \u2264 exp(\u2212u2/2).\n\nP( max\nx\u2208[0,1]\n\nProof of Theorem 5. Let Q and Q(cid:48) denote the learned value function of the algorithm given r and r(cid:48),\nrespectively, where (cid:107)r \u2212 r(cid:48)(cid:107)\u221e \u2264 1. To make Q and Q(cid:48) indistinguishable, we inspect the update step\nin line 21. Let Q0 denote the value function after and before the update, we have\n\n(cid:107)Q \u2212 Q0(cid:107)\u221e \u2264 \u03b1L(2 + \u02c6ga(st+1) \u2212 \u02c6ga(st))/B.\n\n(cid:107)f(cid:107)2H \u2264 (1 + \u03b2/2)(4\u03b1L(k + 1)/B)2 + L2/2\u03b2\n\n\u221a\n\u03b2\u03c3)2/2), we have |Q \u2212 Q0| \u2264\nAs per Lemma 8, with probability at least 1 \u2212 exp(\u2212(2k \u2212 8.68\n2\u03b1L(k + 1)/B. By the triangle inequality, for any (cid:107)r \u2212 r(cid:48)(cid:107)\u221e \u2264 1, the corresponding Q and Q(cid:48)\nsatis\ufb01es (cid:107)Q \u2212 Q(cid:48)(cid:107)\u221e \u2264 4\u03b1L(k + 1)/B, given that Q0 is \ufb01xed by the previous step. Let f = Q \u2212 Q(cid:48),\nwe have\nby the formal statement Lemma 6, Appendix C. We choose 1/\u03b2 = 4\u03b1(k + 1)/B and have (cid:107)f(cid:107)2H \u2264\n((4\u03b1(k + 1)/B)2 + 4\u03b1(k + 1)/B)L2. Now by Proposition 4, adding g \u223c G(0, \u03c32K) to Q will\n\u221a\nmake the update step (\u0001(cid:48), \u03b4(cid:48) + exp(\u2212(2k \u2212 8.68\n\u03b2\u03c3)2/2)-differentially private, given that \u03c3 \u2265\n\n(cid:112)2 ln(1.25/\u03b4(cid:48))(cid:107)f(cid:107)H/\u0001(cid:48), where K(x, y) = exp(\u22124\u03b1L(k + 1)|x\u2212 y|/B) is our choice of the kernel\ntheorem [KOV13, Mir17] that any \u03c3 \u2265(cid:112)2(T /B) ln(1.25/\u03b4) ln(e + \u0001/\u03b4)(cid:107)f(cid:107)H/\u0001 is suf\ufb01cient. This\n\nfunction. Thus each iteration of update has a privacy guarantee.\nIt amounts to analyze the composition of T /B many iterations. It is shown by the composition\n\nis the best known bound, but we continue to derive the speci\ufb01c bound for our algorithm. Let z (a\nfunction, either Q or Q(cid:48)) be the output of a single update of the algorithm. Denote v = 4\u03b1(k + 1)/B\nand T (cid:48) = T /B for simplicity. By Lemma 11, Appendix A, we have\n\nE0[(P1(z \u2208 S)/P0(z \u2208 S))\u03bb] \u2264 exp(\n\n(\u03bb2 + \u03bb)(v2 + v)L2\n\n),\n\n2\u03c32\n\nwhere P0 and P1 are the probability distribution of z given r and r(cid:48), respectively. This moment\ngenerating function will scale exponentially if multiple independent instances of z are drawn. Namely,\nlet z be the vector of T (cid:48) many independent z, and PT (cid:48)\n1 be its probability distribution under r\nand r(cid:48). We have for \u03bb > 0, E0[(PT (cid:48)\n\n0 (z \u2208 S))\u03bb] \u2264 exp( (\u03bb2+\u03bb)(v2+v)L2\n\n0 and PT (cid:48)\n\nT (cid:48)). Thus,\n\n2\u03c32\n\n1 (z \u2208 S)/PT (cid:48)\nT (cid:48)(cid:107)f(cid:107)2H\n2\u03c32\n\nexp(\u03bb(ln(PT (cid:48)\n\n1 (z)/PT (cid:48)\n\n0 (z)) \u2212 \u0001) = exp(\n\nSince the argument holds for any \u03bb > 0, let \u03bb = \u2212 1\n\n1 (z) \u2212 exp(\u0001)PT (cid:48)\nPT (cid:48)\n\n(\u03bb +\n\n1\n2\n\n))2 \u2212 1\n4\n\n(1 \u2212 2\u0001\u03c32\nT (cid:48)(cid:107)f(cid:107)2H\n) > 0, then\n\n2 (1 \u2212 2\u0001\u03c32\nT (cid:48)(cid:107)f(cid:107)2H\n\n(1 \u2212 2\u0001\u03c32\nT (cid:48)(cid:107)f(cid:107)2H\n\n)2).\n\n0 (z)) \u2212 \u0001) + \u03bb ln \u03bb \u2212 (\u03bb + 1) ln(\u03bb + 1))]\n\n\u0001)2 + \u03bb ln \u03bb \u2212 (\u03bb + 1) ln(\u03bb + 1))\n\n0 (z) \u2264 E0[exp(\u03bb(ln(PT (cid:48)\n= exp(\u2212 T (cid:48)(cid:107)f(cid:107)2H\n\u2264 exp(\u2212 T (cid:48)(cid:107)f(cid:107)2H\n2\u03c32\n= exp(\u2212 \u03c32\n\n1 (z)/PT (cid:48)\n(1 \u2212 2\u03c32\nT (cid:48)(cid:107)f(cid:107)2H\n(1 \u2212 2\u03c32\nT (cid:48)(cid:107)f(cid:107)2H\n(\u0001 \u2212 T (cid:48)(cid:107)f(cid:107)2H\n\n2\u03c32\n\n2T (cid:48)(cid:107)f(cid:107)2H\n\n\u0001)2)(\u03bb + 1)\n\n2\u03c32\n\n)2)\n\n1\n\n1 + \u03c32\n\nT (cid:48)\u22062 (\u0001 \u2212 T (cid:48)(cid:107)f(cid:107)2H\n\n2\u03c32\n\n.\n\n)\n\n1 (z) \u2212 exp(\u0001)PT (cid:48)\nWe desire to \ufb01nd \u0001 and \u03b4 so that this difference PT (cid:48)\n(cid:114)\ntechniques as is in the proof Theorem 4.3 of [KOV13]. We choose\n2T (cid:48)(cid:107)f(cid:107)2Hw\n\nT (cid:48)(cid:107)f(cid:107)2H\n2\u03c32 +\n\n\u0001 =\n\n,\n\n\u03c32\n\n0 (z) is less than \u03b4. We use similar\n\n6\n\n\fwhere w = ln(e +(cid:112)T (cid:48)(cid:107)f(cid:107)2H/\u03c32/\u03b4). Thus the \ufb01rst term is e\u2212w and the second term is\nThis ensures that e\u2212w \u2264 \u03b4/(cid:112)T (cid:48)(cid:107)f(cid:107)2H/\u03c32 and\n\n(cid:114)\n\n(cid:114)\n\n\u2264\n\n1+\n\n1+\n\n1\n2\u03c32w\nT (cid:48)(cid:107)f(cid:107)2H\n\n1\nT(cid:48)(cid:107)f(cid:107)2H\n\n2\u03c32\n\n.\n\n1+\n\n1\n2\u03c32w\nT (cid:48)(cid:107)f(cid:107)2H\n, thereby guaranteeing\n\n(cid:114)\n(cid:113) 2T (cid:48)(cid:107)f(cid:107)2Hw\n\nand\n\n\u03c32\n\nthat PT (cid:48)\n\ufb01nd the suf\ufb01cient condition that\n\n0 (z) \u2264 \u03b4 for differential privacy. We solve \u0001 = T (cid:48)(cid:107)f(cid:107)2H\n\n1 (z) \u2212 exp(\u0001)PT (cid:48)\n(cid:113)\n\u2264(cid:112)2(T /B)((4\u03b1(k + 1)/B)2 + 4\u03b1(k + 1)/B)L2 ln(e + \u0001/\u03b4)/\u0001,\n\n(2T (cid:48)(cid:107)f(cid:107)2H/\u00012) ln(e + \u0001/\u03b4)\n\n2\u03c32 +\n\n\u03c3 =\n\nas desired. When this suf\ufb01cient condition is satis\ufb01ed, the approximation factor will be no larger\nthan \u03b4 plus J times the uniform bound derived above by Lemma 8. Namely, it achieves (\u0001, \u03b4 +\nJ exp(\u2212(2k \u2212 8.68\n\n\u03b2\u03c3)2/2))-DP.\n\n\u221a\n\nTime complexity. We show that the noise adding mechanism in our algorithm is ef\ufb01cient. In fact, the\nmost complex step introduced by the noise-adding is the insertion in line 15. This can be negligible\ncompared with the steps such as computing gradients and executing actions. A complete proof of the\nbelow proposition is given in Appendix A.\nProposition 9. The noised value function (during either training or released) in Algorithm 1 can\nrespond to Nq queries in O(Nq ln(Nq)) time.\nUtility analysis. To the best of our knowledge, there has not been a study to rigorously analyze the\nutility of deep reinforcement learning. In fact, in the continuous state space setting, the solution of the\nBellman equation is not unique in general. Hence, it is unlikely for Q-learning to achieve a guaranteed\nperformance, even if it converges. However, we gain insights by analyzing the algorithm\u2019s learning\nerror in the discrete state space setting. The learning error is de\ufb01ned by the discrepancy between the\n\u221a\nlearned state value function and the ground truth of the optimal state value function. We consider\nthe worst case J = 1 for utility (which is the best case for (\u0001, \u03b4 + exp(\u2212(2k \u2212 8.68\n\u03b2\u03c3)2/2))-\ndifferential privacy) where the noise is the most correlated through the iterations. We show the upper\nbound of the utility loss, which has a limit of zero as the number of states approaches in\ufb01nity. The\nproof involves the linear program formulation of MDP, which is given in Appendix D.\nProposition 10. Let v(cid:48) and v\u2217 be the value function learned by our algorithm and the optimal value\nfunction, respectively. In the case J = 1, |S| = n < \u221e, and \u03b3 < 1, the utility loss of the algorithm\nsatis\ufb01es\n\n(cid:107)v(cid:48) \u2212 v\u2217(cid:107)1] \u2264\n\n\u221a\n\nE[\n\n1\nn\n\n\u221a\n2\nn\u03c0(1 \u2212 \u03b3)\n\n2\u03c3\n\n.\n\n3.3 Discussion\n\nexample \u2207\u03b8\u03c0 ln \u03c0(a|s)A(s, a) in [DWS12, SML+15] where A(s, a) =(cid:80)T\n\nExtending to other RL algorithms. Our algorithm can be extended to the case where it learns both\na policy function and a value function coordinately, like the actor-critic method [MBM+16]. If in\nthe updates of the policy function, only the Q function is used in the policy gradient estimation,\nfor example \u2207\u03b8\u03c0 ln \u03c0(a|s)Q(s, a), then the algorithm has the same privacy guarantee. Also, any\npost-processing of the private Q function will not break the privacy guarantee. This also includes\nexperience replay and \u0001-greedy policies [MKS+15].\nHowever, consider the case where the reward is directly accessed in the policy gradient estimation, for\nt=0(\u03bb\u03b3)t(rt + v(st+1) \u2212\nv(st)). In this setting, the privacy guarantee no longer holds. To extend the privacy guarantee to this\ncase, one should add noise to the policy function as well.\nExtending to high-dimensional tasks. We assumed in our analysis S = [0, 1] for simplicity. The\nsetting extends to any bounded S \u2286 R by scaling the interval. Our approach can also be extended to\nhigh-dimensional spaces by choosing a high-dimensional RKHS and kernel. For example, the kernel\nfunction exp(\u2212\u03b2|x \u2212 y|) where x and y now belongs to Rn and | \u00b7 | is the Manhattan distance. It is\nalso possible to use other RKHS and kernels for the Gaussian process noise, such as the space of\nband-limited functions. Other than the re-calibration of the noise level to the new kernel, the privacy\nguarantee in the theorem holds in general for the respective de\ufb01nition of (cid:107) \u00b7 (cid:107)H. We note that the time\n\n7\n\n\fFigure 1: Empirical results of Algorithm 1 on different noise levels. The y-axis denotes the return.\nThe x-axis is the number of samples the agent has trained on. Each episode has 50 samples. The\nshadow denotes 1-std. The learning curves are averaged over 10 random seeds. The curves are\ngenerated without smoothing.\n\ncomplexity derived in Proposition 9 does not extend to other kernel functions, which requires the\nalgorithm to take O(N 2\n\nq ) in the noise generating process.\n\n4 Experiments\n\nWe present empirical results to corroborate our theoretical guarantees and to demonstrate the per-\nformance of the proposed algorithm on a small example. The exact MDP we use is described in\nAppendix E.1. The implementation is attached along with the manuscript submission.\nWe \ufb01rst plot the learning curve with a variety of noise levels and J values in Figure 1. With the\nincrease of the noise level, the algorithm requires more samples to achieve the same return than\nthe non-private version. This demonstrates the empirical privacy-utility tradeoff. We observe that\nwith the noise being reset every round (J = T /B), the algorithm is likely to converge with limited\nsub-optimality as desired, especially when \u03c3 < 0.4. Therefore, as J exp(\u2212(2k \u2212 8.68\n\u03b2\u03c3)2/2) is\nexponentially small, we suggest using J = T /B in practice to achieve a better utility.\nThe algorithm is then compared with a variety of baseline methods where they target the same (\u0001, \u03b4)\nprivacy guarantee, as shown in Figure 2(a) and 2(b). The policy evaluation method proposed by\nBalle, Gomrokchi and Precup [BGP16] is not differentially private under our context (while it is\n(\u0001, \u03b4)-DP with respect to the reward sequences). We compare with it to illustrative the utility, where it\nis observed that their approach shares similar performance with ours. Note that studies on contextual\nbandits by Sajed and Sheffet [SS19] and by Shariff and Sheffet [SS18] consider an equivalently\none-step MDP as [BGP16] and thus will yield the same method. We also compared our approach\nwith the input perturbation method proposed by Venkitasubramaniam [Ven13] and the differentially\nprivate deep learning framework by Abadi et al. [ACG+16]. Both the approaches are differentially\nprivate under our setting, while our algorithm signi\ufb01cantly outperforms them. Especially, on the\nhigher privacy regime \u0001 = 0.45, both the baseline methods do not improve over the training due\nthe the large noise level needed. The baseline implementations and the exact calculation of the\nparameters are detailed in Appendix E.2 and E.3, respectively.\n\n\u221a\n\n5 Conclusion\n\nWe have developed a rigorous and ef\ufb01cient algorithm for differentially private Q-learning in con-\ntinuous state space settings. Releasing and querying the algorithm\u2019s output value function will not\ndistinguish two neighboring reward functions. To achieve this, our method applies functional noise\ntaken from sample paths of a Gaussian process calibrated appropriately according to sensitivity\ncalculated under the RKHS measure. Theoretically, we show the privacy guarantee and insights\ninto the utility analysis. Empirically, experiments corroborate our theoretical \ufb01ndings and show\nimprovement over existing methods. Our approach is general enough to be extended to other domains\nbeyond reinforcement learning.\n\n8\n\n\f(a) Target \u0001 = 0.9, \u03b4 = 1 \u00b7 10\u22124\n\n(b) Target \u0001 = 0.45, \u03b4 = 1 \u00b7 10\u22124\n\nFigure 2: Empirical comparisons with other methods. Same con\ufb01gurations as Figure 1.\n\nAcknowledgement\n\nWe would like to thank Ruitong Huang, who provides helpful insight on the composition analysis\nand the algorithm design, and Kry Yik Chau Lui, who points out the idea to extend our approach to\nhigh-dimensional Sobolev spaces.\n\nReferences\n\n[ACG+16] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal\nTalwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the\n2016 ACM SIGSAC Conference on Computer and Communications Security, pages\n308\u2013318. ACM, 2016.\n\n[AN04]\n\n[AS17]\n\n[Bai95]\n\n[ALMT17] Jacob Abernethy, Chansoo Lee, Audra McMillan, and Ambuj Tewari. Online linear\noptimization through the differential privacy lens. arXiv preprint arXiv:1711.10019,\n2017.\nPieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement\nlearning. In International Conference on Machine Learning, page 1. ACM, 2004.\nNaman Agarwal and Karan Singh. The price of differential privacy for online learning.\nIn International Conference on Machine Learning, pages 32\u201340, 2017.\nLeemon Baird. Residual algorithms: Reinforcement learning with function approxima-\ntion. In Machine Learning Proceedings 1995, pages 30\u201337. Elsevier, 1995.\nBorja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evalua-\ntion. In International Conference on Machine Learning, pages 2130\u20132138, 2016.\nAmos Beimel, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. Bounds on the sample\ncomplexity for private learning and private data release. In Theory of Cryptography\nConference, pages 437\u2013454. Springer, 2010.\n\n[BKN10]\n\n[BGP16]\n\n[CBK+19] MAP Chamikara, P Bertok, I Khalil, D Liu, and S Camtepe. Local differential privacy\n\nfor deep learning. arXiv preprint arXiv:1908.02997, 2019.\n\n[DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor.\nOur data, ourselves: Privacy via distributed noise generation. In Annual International\nConference on the Theory and Applications of Cryptographic Techniques, pages 486\u2013\n503. Springer, 2006.\n\n[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise\nto sensitivity in private data analysis. In Theory of cryptography conference, pages\n265\u2013284. Springer, 2006.\nCynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy.\nFoundations and Trends R(cid:13) in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n\n[DR14]\n\n9\n\n\f[DRV10]\n\n[DWS12]\n\n[GUK17]\n\nCynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy.\nIn 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages\n51\u201360. IEEE, 2010.\nThomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv\npreprint arXiv:1205.4839, 2012.\nPratik Gajane, Tanguy Urvoy, and Emilie Kaufmann. Corrupt bandits for preserving\nlocal privacy. arXiv preprint arXiv:1708.05033, 2017.\n\n[HDZ+18] Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. Reinforcement learning\nto rank in e-commerce search engine: Formalization, analysis, and application. In Pro-\nceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery\n& Data Mining, pages 368\u2013377. ACM, 2018.\nRob Hall, Alessandro Rinaldo, and Larry Wasserman. Differential privacy for functions\nand functional data. Journal of Machine Learning Research, 14(Feb):703\u2013727, 2013.\nPrateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online\nlearning. In Conference on Learning Theory, pages 24\u20131, 2012.\n\n[HRW13]\n\n[JKT12]\n\n[KGGW15] Matt Kusner, Jacob Gardner, Roman Garnett, and Kilian Weinberger. Differentially\nprivate bayesian optimization. In International Conference on Machine Learning, pages\n918\u2013927, 2015.\n\n[KOV13]\n\n[KLN+11] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and\nAdam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793\u2013\n826, 2011.\nPeter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for\ndifferential privacy. arXiv preprint arXiv:1311.0776, 2013.\nElad Liebman, Maytal Saar-Tsechansky, and Peter Stone. Dj-mc: A reinforcement-\nlearning agent for music playlist recommendation. In Proceedings of the 2015 Inter-\nnational Conference on Autonomous Agents and Multiagent Systems, pages 591\u2013599,\n2015.\n\n[LSTS15]\n\n[MBM+16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy\nLillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods\nfor deep reinforcement learning. In International Conference on Machine Learning,\npages 1928\u20131937, 2016.\nIlya Mironov. Renyi differential privacy. In Computer Security Foundations Symposium\n(CSF), 2017 IEEE 30th, pages 263\u2013275. IEEE, 2017.\n\n[Mir17]\n\n[MT15]\n\n[MKS+15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,\nMarc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg\nOstrovski, et al. Human-level control through deep reinforcement learning. Nature,\n518(7540):529, 2015.\nNikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic\nmulti-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in\nArti\ufb01cial Intelligence, pages 592\u2013601. AUAI Press, 2015.\nAndrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In\nInternational Conference on Machine Learning, pages 663\u2013670. Morgan Kaufmann\nPublishers Inc., 2000.\n\n[NR00]\n\n[PFW+18] Yangchen Pan, Amir-massoud Farahmand, Martha White, Saleh Nabi, Piyush Grover,\nand Daniel Nikovski. Reinforcement learning with function-valued action spaces for\npartial differential equation control. arXiv preprint arXiv:1806.06931, 2018.\n\n[RJG+18] Corby Rosset, Damien Jose, Gargi Ghosh, Bhaskar Mitra, and Saurabh Tiwary. Op-\ntimizing query evaluations using reinforcement learning for web search. In The 41st\nInternational ACM SIGIR Conference on Research & Development in Information\nRetrieval, pages 1193\u20131196. ACM, 2018.\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT\npress, 2018.\n\n[SB18]\n\n10\n\n\f[SML+15]\n\n[SS18]\n\n[SS19]\n\n[Sze10]\n\n[TD16]\n\n[TD17]\n\n[TS13]\n\n[Ven13]\n\n[WD92]\n\nJohn Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel.\nHigh-dimensional continuous control using generalized advantage estimation. arXiv\npreprint arXiv:1506.02438, 2015.\nRoshan Shariff and Or Sheffet. Differentially private contextual linear bandits. In\nAdvances in Neural Information Processing Systems, pages 4301\u20134311, 2018.\nTouqir Sajed and Or Sheffet. An optimal private stochastic-mab algorithm based on an\noptimal private stopping rule. In International Conference on Machine Learning, 2019.\nCsaba Szepesv\u00e1ri. Algorithms for reinforcement learning. Synthesis lectures on arti\ufb01cial\nintelligence and machine learning, 4(1):1\u2013103, 2010.\nAristide CY Tossou and Christos Dimitrakakis. Algorithms for differentially private\nmulti-armed bandits. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\nAristide Charles Yedia Tossou and Christos Dimitrakakis. Achieving privacy in the ad-\nversarial multi-armed bandit. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence,\n2017.\nAbhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private on-\nline learning in full-information and bandit settings. In Advances in Neural Information\nProcessing Systems, pages 2733\u20132741, 2013.\nParv Venkitasubramaniam. Privacy in stochastic control: A markov decision process\nperspective. In Communication, Control, and Computing (Allerton), 2013 51st Annual\nAllerton Conference on, pages 381\u2013388. IEEE, 2013.\nChristopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013\n292, 1992.\n\n[ZZZ+18] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing\nXie, and Zhenhui Li. Drn: A deep reinforcement learning framework for news recom-\nmendation. In Proceedings of the 2018 World Wide Web Conference on World Wide\nWeb, pages 167\u2013176. International World Wide Web Conferences Steering Committee,\n2018.\n\n11\n\n\f", "award": [], "sourceid": 6048, "authors": [{"given_name": "Baoxiang", "family_name": "Wang", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Nidhi", "family_name": "Hegde", "institution": "Borealis AI"}]}