pi.\n3. If vi = b then h \u00b7 (xi, 1) \u2212 yi < \u2212pi.\n\ni=1, and a vector v \u2208 {1, 0, a, b}m\n\nif such exists, and \u03c6 otherwise.\nThe following algorithm partitions Rn according to GH(S), where in each iteration it \"discovers\" one\nmore point in the sequence S.\nAlgorithm: EMPIRICAL PAYOFF MAXIMIZATION (EPM)\n\n// v = (v1, v2, . . . , vm)\n\ni=1\n\nInput: S = (xi, yi, pi)m\nOutput: Empirical payoff maximizer w.r.t. S\n1 v \u2190 {0}m\n2 R0 \u2190 {v}\n3 for i = 1 to m do\nRi \u2190 \u2205\n4\nfor v \u2208 Ri\u22121 do\n5\n6\n7\n8\n\nif PVF (S, (v\u2212i, \u03b1)) (cid:54)= \u03c6 then\n\nfor \u03b1 \u2208 {1, a, b} do\n\nadd (v\u2212i, \u03b1) to Ri\n9 return v\u2217 \u2208 arg maxv\u2208Rm (cid:107)v(cid:107)1\nTheorem 2. When running EPM on a sequence of examples S, it \ufb01nds an empirical best response in\npoly(|S|) time.\n\n// (v\u2212i, \u03b1) = (v1, . . . vi\u22121, \u03b1, vi+1, . . . , vm)\n\n6\n\n\fFigure 2: An example of simple linear regression with linear strategies. On the left we have\na sample sequence of size 3, along with the strategy \u00afh = (\u00afa, \u00afb) of the opponent (the solid line)\nand a best response strategy of the agent (the dashed line). On the right the hypothesis space is\npresented, where each pair (a, b) represents a possible strategy, and each bounded set Ri is de\ufb01ned by\n\n(cid:9), i.e. the set of hypotheses which give xi better prediction\n\nRi =(cid:8)(a, b) \u2208 R2 : |a \u00b7 xi + b \u2212 yi| < pi\n\nthan \u00afh. Notice that (\u00afa, \u00afb) relies on the boundaries of all Ri, 1 \u2264 i \u2264 3. In addition, since (a\u2217, b\u2217) is\ninside R1 \u2229 R2 \u2229 R3, the strategy h\u2217 = (a\u2217, b\u2217), i.e. the line y = a\u2217 \u00b7 x + b\u2217, predicts all the points\nbetter than the opponent. Observe that by taking any convex combination of h\u2217, \u00afh, the agent not only\nperserves her empirical payoff but also improves her MSE score.\n\n(cid:1)\nWhen we combine Theorem 2 with Lemmas 2 and 1, we get:\nCorollary 1. Given \u0001, \u03b4 \u2208 (0, 1), if we run EPM on m \u2265 C\n\u00012 \u00b7\nexamples sampled i.i.d. from D (for a constant C), then it outputs h\u2217 such that with probability at\nleast 1 \u2212 \u03b4 satis\ufb01es\n\n(cid:0)max{(cid:98)2n \u00b7 log(n)(cid:99), 20} + log 1\n\n\u03b4\n\n\u03c0D(h\u2217) \u2265 sup\nh(cid:48)\u2208H\n\n\u03c0D(h(cid:48)) \u2212 \u0001.\n\nA desirable achievement would be if the best response prediction algorithm would also keep the loss\nsmall in the original (e.g. MSE) measure. We now show that in some cases the agent can, by slightly\nmodifying the output of EPM, \ufb01nd a strategy that is not only an approximate best response, but is\nalso robust with respect to additive functions of discrepancies. See Figure 2 for illustration.\nLemma 3. Assume the opponent uses a linear predictor \u00afh, and denote by h\u2217 the strategy output by\nEPM. Then, h\u2217 can be ef\ufb01ciently modi\ufb01ed to a strategy which is not only an empirical best response,\nbut also performs arbitrarily close to \u00afh w.r.t. to any additive function of the discrepancies.\n\nFinaly, we discuss the case where the dimension of the instances domain is a part of the input. It is\nknown that learning the best halfspace is NP-hard in binary classi\ufb01cation (w.r.t. to a given sequence\nof points), when the dimension of the data is not \ufb01xed (see e.g. [1]). We show that the empirical best\n(linear) response problem is of the same \ufb02avor.\nLemma 4. In case H is the set of linear functions in Rn\u22121 and n is not \ufb01xed, the empirical best\nresponse problem is NP-hard.\n\n4 Experimental results\n\nWe note that when n is large, the proposed method for \ufb01nding an empirical best response may not be\nsuitable. Nevertheless, if the agent is interested in \ufb01nding a \"good\" response to her opponents, she\nshould come up with something. With slight modi\ufb01cations, the linear best response problem can be\nformulated as a mixed integer linear program (MILP).1 Hence, the agent can exploit sophisticated\nsolvers and use clever heuristics. Further, one implication of Lemma 1 is that the true payoffs\n\n1See the appendix for the mixed integer linear programming formulation.\n\n7\n\n123xyy=\u00afa\u00b7x+\u00afby=a\u2217\u00b7x+b\u2217R1R2R3(a\u2217,b\u2217)(\u00afa,\u00afb)ab\fTable 1: Experiments on Boston Housing dataset\n\nThe opponent\u2019s strategy\n\nLeast square errors (LSE)\n\nLeast absolute errors (LAE)\n\nScenario Train payoff Test payoff\nTRAIN\nALL\nTRAIN\nALL\n\n0.699\n0.711\n0.621\n0.625\n\n0.641\n0.645\n0.570\n0.528\n\nResults obtained on the Boston Housing dataset. Each cell in the table represents the average payoff\nof the agent over 1000 simulations (splits into 80% train and 20% test). The \"train payoff\" is the\nproportion of points in the training set on which the agent is more accurate, and the \"test payoff\"\npayoff is the equivalent proportion with respect to the test (unseen) data.\n\nuniformly converge, and hence any empirical payoff obtained by the MILP is close to its real payoff\nwith high probability.\nIn this section, we show the extent to which classical linear regression algorithms can be beaten\nusing the Boston housing dataset [5], a built-in dataset in the leading data science packages (e.g.\nscikit-learn in Python and MASS in R). The Boston housing dataset contains 506 instances, where\neach instance has 13 continuous attributes and one binary attribute. The label is the median value of\nowner-occupied homes, and among the attributes are the per capita crime rate, the average number of\nrooms per dwelling, the pupil-teacher ratio by town and more. The R-squared measure for minimizing\nthe square error in the Boston housing dataset is 0.74, indicating that the use of linear regression is\nreasonable.\nAs possible strategies of the opponent, we analyzed the linear least squares estimators (LSE) and\nlinear least absolute estimators (LAE). The dataset was split into training (80%) and test (20%) sets,\nand two scenarios were considered:\n\nScenario TRAIN - the opponent\u2019s model is learned from the training set only.\nScenario ALL\n\n- the opponent\u2019s model is learned from both the training and the test sets.\n\nIn both scenarios the agent had access to the training set only, along with the opponent\u2019s discrepancy\nfor each point in the training set. Obviously, achieving payoff of more than 0.5 (that is, more than\n50% of the points) in the ALL scenario is a real challenge, since the opponent has seen the test set in\nher learning process. We ran 1000 simulations, where each simulation is a random split of the dataset.\nWe employed the MILP formulation, and used Gurobi software [4] in order to \ufb01nd a response, where\nthe running time of the solver was limited to one minute.2\nOur \ufb01ndings are reported in Table 1. Notice that against both opponent strategies, and even in\ncase where the opponent had seen the test set, the agent still gets more than 50% of the points. In\nboth scenarios, LAE guarantees the opponent more than LSE. This is because absolute error is less\nsensitive to large deviations. We also noticed that when the opponent learns from the whole dataset,\nthe empirical payoff of the agent is greater. Indeed, the latter is reasonable as in the ALL scenario the\nagent\u2019s strategy \ufb01ts the training set while the opponent strategy does not.\nBeyond the main analysis, we examined the success (or lack thereof) of the agent with respect to\nthe additive loss function optimized by the opponent (corresponding to the MSE for LSE, and the\nMAE (mean absolute error) for LAE), hereby referred to as the \"classical loss\". Recall that Lemma 3\nguarantees that the agent\u2019s classical loss can be arbitrarily close to that of the opponent when she\nplays a best response; however, the response we consider in this section (using the MILP) does not\nnecessarily converge to a best response. Therefore, we \ufb01nd it interesting to consider the classical loss\nas well, thereby presenting the complementary view.\nWe report in Table 2 the average ratio between the agent\u2019s classical loss and that of the opponent\nunder the TRAIN scenario with respect to the training and test sets. Notice that the agent suffers from\nless than a 0.7% increase with respect to the classical loss optimized by the opponent. In particular,\n\n2Code\n\nreproducing\nBest-Response-Regression\n\nfor\n\nthe\n\nexperiments\n\nis\n\navailable\n\nat\n\nhttps://github.com/omerbp/\n\n8\n\n\fTable 2: Ratio of the classical loss\n\nThe opponent\u2019s strategy\nLSE\n1.007\n0.999\n\nLAE\n1.005\n1.002\n\nTraining set\nTest set\n\nRatio of the agent\u2019s loss and the opponent\u2019s loss, where the loss function corresponds to the original\noptimization function of the opponent, under scenario TRAIN. For example, the upper leftmost cell\nrepresents the agent\u2019s MSE divided by the opponents MSE on the training set, where the opponent\nuses LSE. Similarly, the lower rightmost cell represents the agent\u2019s MAE (mean absolute error)\ndivided by the opponents MAE on the test data, when the opponent uses LAE.\n\nthe MSE of the agent (when she responds to LSE) on the test set is less than that of the opponent.\nThe same phenomenon, albeit on a smaller scale, occurs against LAE: the training set ratio is greater\nthan the test set ratio.\nTo conclude, the agent is not only able to obtain the majority of the points (and in some cases, up to\n70%), but also to keep the classical loss optimized by her opponent within less than 0.2% from the\noptimum on the test set.\n\n5 Discussion\n\n(cid:0)n log n + log 1\n\n(cid:1)(cid:1)\n\n\u03b4\n\nThis work introduces a game theoretic view of a machine learning task. After \ufb01nding suf\ufb01cient\nconditions for learning to occur, we analyzed the induced learning problem, when the agent is\nrestricted to a linear response. We showed that a best response with respect to a sequence of examples\ncan be computed in polynomial time in the number of examples, as long as the instance domain has a\nconstant dimension. Further, we showed an algorithm that for any \u0001, \u03b4 computes an \u0001-best response\n\nwith a probability of at least 1 \u2212 \u03b4, when it is given a sequence of poly(cid:0) 1\n\nexamples drawn i.i.d.\nAs the reader may notice, our analysis holds as long as the hypothesis is linear in its parameters,\nand therefore is much more general than linear regression. Interestingly, this is a novel type of\noptimization problem and so rich hypothesis, which are somewhat unnatural in the traditional task of\nregression, might be successfully employed in the proposed setting.\nFrom an empirical standpoint, the gap between the empirical payoff and the true payoff calls for\napplying regularization methods for the best response problem and encourages further algorithmic\nresearch. Exploring whether or not a response in the form of hyperplanes can be effective against a\nmore complex strategy employed by the opponent will be intriguing. For instance, showing that a\ndeep learner is beatable in this setting will be remarkable.\nThe main direction to follow is the analysis of the competitive environment introduced in the beginning\nof Section 2 as a simultaneous game: is there an equilibrium strategy? Namely, is there a linear\npredictor which, when used by both the agent and the opponent, is a best response to one another?\n\n\u00012\n\nAcknowledgments\n\nWe thank Gili Baumer and Argyris Deligkas for helpful discussions, and anonymous reviewers for\ntheir useful suggestions. This project has received funding from the European Research Council (ERC)\nunder the European Union\u2019s Horizon 2020 research and innovation programme (grant agreement n\u25e6\n740435).\n\nReferences\n[1] E. Amaldi and V. Kann. The complexity and approximability of \ufb01nding maximum feasible\n\nsubsystems of linear relations. Theoretical computer science, 147(1-2):181\u2013210, 1995.\n\n[2] R. Cole and T. Roughgarden. The sample complexity of revenue maximization. In Proceedings\n\nof the 46th Annual ACM Symposium on Theory of Computing, pages 243\u2013252. ACM, 2014.\n\n9\n\n\f[3] O. Dekel, F. Fischer, and A. D. Procaccia. Incentive compatible regression learning. Journal of\n\nComputer and System Sciences, 76(8):759\u2013777, 2010.\n\n[4] I. Gurobi Optimization. Gurobi optimizer reference manual, 2016.\n\n[5] D. Harrison and D. L. Rubinfeld. Hedonic housing prices and the demand for clean air. Journal\n\nof environmental economics and management, 5(1):81\u2013102, 1978.\n\n[6] N. Immorlica, A. T. Kalai, B. Lucier, A. Moitra, A. Postlewaite, and M. Tennenholtz. Dueling\nalgorithms. In Proceedings of the forty-third annual ACM symposium on Theory of computing,\npages 215\u2013224. ACM, 2011.\n\n[7] R. Meir, A. D. Procaccia, and J. S. Rosenschein. Algorithms for strategyproof classi\ufb01cation.\n\nArti\ufb01cial Intelligence, 186:123\u2013156, 2012.\n\n[8] N. Nisan and A. Ronen. Algorithmic mechanism design. In Proceedings of the thirty-\ufb01rst\n\nannual ACM symposium on Theory of computing, pages 129\u2013140. ACM, 1999.\n\n[9] D. Pechyony and V. Vapnik. On the theory of learnining with privileged information. In\n\nAdvances in neural information processing systems, pages 1894\u20131902, 2010.\n\n[10] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):\n\n145\u2013147, 1972.\n\n[11] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge University Press, 2014.\n\n[12] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142,\n\n1984.\n\n[13] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events\n\nto their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.\n\n[14] V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information.\n\nNeural networks, 22(5):544\u2013557, 2009.\n\n[15] V. Vapnik, A. Vashist, and N. Pavlovitch. Learning using hidden information: Master class\nlearning. NATO Science for Peace and Security Series, D: Information and Communication\nSecurity, 19:3\u201314, 2008.\n\n10\n\n\f", "award": [], "sourceid": 963, "authors": [{"given_name": "Omer", "family_name": "Ben-Porat", "institution": "Technion \u2013 Israel Institute of Technology"}, {"given_name": "Moshe", "family_name": "Tennenholtz", "institution": "Technion--Israel Institute of Technology"}]}