{"title": "Predicting Dynamic Difficulty", "book": "Advances in Neural Information Processing Systems", "page_first": 2007, "page_last": 2015, "abstract": "Motivated by applications in electronic games as well as teaching systems, we investigate the problem of dynamic difficulty adjustment. The task here is to repeatedly find a game difficulty setting that is neither `too easy' and bores the player, nor `too difficult' and overburdens the player. The contributions of this paper are ($i$) formulation of difficulty adjustment as an online learning problem on partially ordered sets, ($ii$) an exponential update algorithm for dynamic difficulty adjustment, ($iii$) a bound on the number of wrong difficulty settings relative to the best static setting chosen in hindsight, and ($iv$) an empirical investigation of the algorithm when playing against adversaries.", "full_text": "Predicting Dynamic Dif\ufb01culty\n\nand Thomas G\u00a8artner\nOlana Missura\nUniversity of Bonn and Fraunhofer IAIS\n\nSchlo\u00df Birlinghoven\n\n{olana.missura,thomas.gaertner}@uni-bonn.de\n\n52757 Sankt Augustin, Germany\n\nAbstract\n\nMotivated by applications in electronic games as well as teaching systems, we\ninvestigate the problem of dynamic dif\ufb01culty adjustment. The task here is to re-\npeatedly \ufb01nd a game dif\ufb01culty setting that is neither \u2018too easy\u2019 and bores the\nplayer, nor \u2018too dif\ufb01cult\u2019 and overburdens the player. The contributions of this pa-\nper are (i) the formulation of dif\ufb01culty adjustment as an online learning problem\non partially ordered sets, (ii) an exponential update algorithm for dynamic dif\ufb01-\nculty adjustment, (iii) a bound on the number of wrong dif\ufb01culty settings relative\nto the best static setting chosen in hindsight, and (iv) an empirical investigation of\nthe algorithm when playing against adversaries.\n\n1\n\nIntroduction\n\nWhile dif\ufb01culty adjustment is common practise in many traditional games (consider, for instance,\nthe handicap in golf or the handicap stones in go), the case for dynamic dif\ufb01culty adjustment in\nelectronic games has been made only recently [7]. Still, there are already many different, more\nor less successful, heuristic approaches for implementing it. In this paper, we formalise dynamic\ndif\ufb01culty adjustment as a game between a master and a player in which the master tries to predict the\nmost appropriate dif\ufb01culty setting. As the player is typically a human with changing performance\ndepending on many hidden factors as well as luck, no assumptions about the player can be made.\nThe dif\ufb01culty adjustment game is played on a partially ordered set which re\ufb02ects the \u2018more dif\ufb01cult\nthan\u2019-relation on the set of dif\ufb01culty settings. To the best of our knowledge, in this paper, we provide\nthe \ufb01rst thorough theoretical treatment of dynamic dif\ufb01culty adjustment as a prediction problem.\nThe contributions of this paper are: We formalise the learning problem of dynamic dif\ufb01culty adjust-\nment (in Section 2), propose a novel learning algorithm for this problem (in Section 4), and give\na bound on the number of proposed dif\ufb01culty settings that were not just right (in Section 5). The\nbound limits the number of mistakes the algorithm can make relative to the best static dif\ufb01culty set-\nting chosen in hindsight. For the bound to hold, no assumptions whatsoever need to be made on the\nbehaviour of the player. Last but not least we empirically study the behaviour of the algorithm under\nvarious circumstances (in Section 6). In particular, we investigate the performance of the algorithm\n\u2018against\u2019 statistically distributed players by simulating the players as well as \u2018against\u2019 adversaries\nby asking humans to try to trick the algorithm in a simpli\ufb01ed setting. Implementing our algorithm\ninto a real game and testing it on real human players is left to future work.\n\n2 Formalisation\n\nTo be able to theoretically investigate dynamic dif\ufb01culty adjustment, we view it as a game between\na master and a player, played on a partially ordered set modelling the \u2018more dif\ufb01cult than\u2019-relation.\nThe game is played in turns where each turn has the following elements:\n\n1\n\n\f1. the game master chooses a dif\ufb01culty setting,\n2. the player plays one \u2018round\u2019 of the game in this setting, and\n3. the game master experiences whether the setting was \u2018too dif\ufb01cult\u2019, \u2018just right\u2019, or \u2018too\n\neasy\u2019 for the player.\n\nThe master aims at making as few as possible mistakes, that is, at choosing a dif\ufb01culty setting\nthat is \u2018just right\u2019 as often as possible. In this paper, we aim at developing an algorithm for the\nmaster with theoretical guarantees on the number of mistakes in the worst case while not making\nany assumptions about the player.\nTo simplify our analysis, we make the following, rather natural assumptions:\n\n\u2022 the set of dif\ufb01culty settings is \ufb01nite and\n\u2022 in every round, the (hidden) dif\ufb01culty settings respect the partial order, that is,\n\n\u2013 no state that \u2018is more dif\ufb01cult than\u2019 a state which is \u2018too dif\ufb01cult\u2019 can be \u2018just right\u2019 or\n\n\u2018too easy\u2019 and\n\n\u2013 no state that \u2018is more dif\ufb01cult than\u2019 a state which is \u2018just right\u2019 can be \u2018too easy\u2019.\n\nEven with these natural assumptions, in the worst case, no algorithm for the master will be able to\nmake even a single correct prediction. As we can not make any assumptions about the player, we\nwill be interested in comparing our algorithm theoretically and empirically with the best statically\nchosen dif\ufb01culty setting, as is commonly the case in online learning [3].\n\n3 Related Work\n\nAs of today there exist a few commercial games with a well designed dynamic dif\ufb01culty adjustment\nmechanism, but all of them employ heuristics and as such suffer from the typical disadvantages\n(being not transferable easily to other games, requiring extensive testing, etc). What we would like\nto have instead of heuristics is a universal mechanism for dynamic dif\ufb01culty adjustment: An online\nalgorithm that takes as an input (game-speci\ufb01c) ways to modify dif\ufb01culty and the current player\u2019s\nin-game history (actions, performance, reactions, . . . ) and produces as an output an appropriate\ndif\ufb01culty modi\ufb01cation.\nBoth arti\ufb01cial intelligence researchers and the game developers community display an interest in the\nproblem of automatic dif\ufb01culty scaling. Different approaches can be seen in the work of R. Hunicke\nand V. Chapman [10], R. Herbich and T. Graepel [9], Danzi et al [7], and others. Since the perceived\ndif\ufb01culty and the preferred dif\ufb01culty are subjective parameters, the dynamic dif\ufb01culty adjustment\nalgorithm should be able to choose the \u201cright\u201d dif\ufb01culty level in a comparatively short time for any\nparticular player. Existing work in player modeling in computer games [14, 13, 5, 12] demonstrates\nthe power of utilising the player models to create the games or in-game situations of high interest\nand satisfaction for the players.\nAs can be seen from these examples the problem of dynamic dif\ufb01culty adjustment in video games\nwas attacked from different angles, but a unifying and theoretically sound approach is still miss-\ning. To the best of our knowledge this work contains the \ufb01rst theoretical formalization of dynamic\ndif\ufb01culty adjustment as a learning problem.\nUnder the assumptions described in Section 2, we can view the partially ordered set as a directed\nacyclic graph, at each round labelled by three colours (say, red, for \u2018too dif\ufb01cult\u2019 green for \u2018just\nright\u2019, and blue for \u2018too easy\u2019) such that\n\n\u2022 for every directed path in the graph between two equally labelled vertices, all vertices on\n\nthat path have the same colour and\n\n\u2022 there is no directed path from a green vertex to a red vertex and none from a blue vertex to\n\neither a red or a green vertex.\n\nThe colours are allowed to change in each round as long as they obey the above rules. The master,\ni.e., the learning algorithm, does not see the colours but must point at a green vertex as often as\n\n2\n\n\fpossible. If he points at a red vertex, he receives the feedback \u22121; if he points at a blue vertex, he\nreceives the feedback +1.\nThis setting is related to learning directed cuts with membership queries. For learning directed cuts,\ni.e., monotone subsets, G\u00a8artner and Garriga [8] provided algorithms and bounds for the case in\nwhich the labelling does not change over time. They then showed that the intersection between a\nmonotone and an antimonotone subset in not learnable. This negative result is not applicable in our\ncase, as the feedback we receive is more powerful. They furthermore showed that directed cuts are\nnot learnable with traditional membership queries if the labelling is allowed to change over time.\nThis negative result also does not apply to our case as the aim of the master is \u201conly\u201d to point at a\ngreen vertex as often as possible and as we are interested in a comparison with the best static vertex\nchosen in hindsight.\nIf we ignore the structure inherent in the dif\ufb01culty settings, we will be in a standard multi-armed\nbandit setting [2]: There are K arms, to which an unknown adversary assigns loss values on each\niteration (0 to the \u2018just right\u2019 arms, 1 to all the others). The goal of the algorithm is to choose an\narm on each iteration to minimize its overall loss. The dif\ufb01culty of the learning problem comes from\nthe fact that only the loss of the chosen arm is revealed to the algorithm. This setting was studied\nextensively in the last years, see [11, 6, 4, 1] and others. The standard performance measure is the\nso-called \u2018regret\u2019: The difference of the loss acquired by the learning algorithm and by the best\nstatic arm chosen in hindsight. The best known to-date algorithm that does not use any additional\ninformation is the Improved Bandit Strategy (called IMPROVEDPI in the following) [3]. The upper\n\nbound on its regret is of the order(cid:112)KT ln(T ), where T is the amount of iterations. IMPROVEDPI\n\nwill be the second baseline after the best static in hindsight (BSIH) in our experiments.\n\n4 Algorithm\n\nIn this section we give an exponential update algorithm for predicting a vertex that corresponds to a\n\u2018just right\u2019 dif\ufb01culty setting in a \ufb01nite partially ordered set (K, (cid:31)) of dif\ufb01culty settings. The partial\norder is such that for i, j \u2208 K we write i (cid:31) j if dif\ufb01culty setting i is \u2018more dif\ufb01cult than\u2019 dif\ufb01culty\nsetting j. The learning rate of the algorithm is denoted by \u03b2. The response that the master algorithm\ncan observe ot is +1 if the chosen dif\ufb01culty setting was \u2018too easy\u2019, 0 if it was \u2018just right\u2019, and \u22121 if it\nwas \u2018too dif\ufb01cult\u2019. The algorithm maintains a belief w of each vertex being \u2018just right\u2019 and updates\nthis belief if the observed response implies that the setting was \u2018too easy\u2019 or \u2018too dif\ufb01cult\u2019.\n\nAlgorithm 1 PARTIALLY-ORDERED-SET MASTER (POSM) for Dif\ufb01culty Adjustment\nRequire: parameter \u03b2 \u2208 (0, 1), K dif\ufb01culty Settings K, partial order (cid:31) on K, and a sequence of\nobservations o1, o2, . . .\n1: \u2200k \u2208 K : let w1(k) = 1\n2: for each turn t = 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n\nPREDICT kt = argmaxk\u2208K min{Bt(k), At(k)}\nOBSERVE ot \u2208 {\u22121, 0, +1}\nif ot = +1 then\n\n\u2200k \u2208 K : let At(k) =(cid:80)\n\u2200k \u2208 K : let Bt(k) =(cid:80)\n\nx\u2208K:x(cid:23)k wt(x)\nx\u2208K:x(cid:22)k wt(x)\n\n(cid:26)\u03b2wt(k)\n(cid:26)\u03b2wt(k)\n\nwt(x)\n\nwt(x)\n\nif k (cid:22) kt\notherwise\n\nif k (cid:23) kt\notherwise\n\n8:\n\n9:\n10:\n\n11:\n\n\u2200k \u2208 K : let wt+1(k) =\n\nend if\nif ot = \u22121 then\n\n\u2200k \u2208 K : let wt+1(k) =\n\nend if\n12:\n13: end for\n\nThe main idea of Algorithm 1 is that in each round we want to make sure we can update as much\nbelief as possible. The signi\ufb01cance of this will be clearer when looking at the theory in the next\nsection. To ensure it, we compute for each setting k the belief \u2018above\u2019 k as well as \u2018below\u2019 k .\n\n3\n\n\fThat is, At in line 3 of the algorithm collects the belief of all settings that are known to be \u2018more\ndif\ufb01cult\u2019 and Bt in line 4 of the algorithm collects the belief of all settings that are known to be \u2018less\ndif\ufb01cult\u2019 than k. If we observe that the proposed setting was \u2018too easy\u2019, that is, we should \u2018increase\nthe dif\ufb01culty\u2019, in line 8 we update the belief of the proposed setting as well as all settings easier than\nthe proposed. If we observe that the proposed setting was \u2018too dif\ufb01cult\u2019, that is, we should \u2018decrease\nthe dif\ufb01culty\u2019, in line 11 we update the belief of the proposed setting as well as all settings more\ndif\ufb01cult than the proposed. The amount of belief that is updated for each mistake is thus equal to\nBt(kt) or At(kt). To gain the most information independent of the observation and thus to achieve\nthe best performance, we choose the k that gives us the best worst case update min{Bt(k), At(k)}\nin line 5 of the algorithm.\n\n5 Theory\n\nWe will now show a bound on the number of inappropriate dif\ufb01culty settings that are proposed,\nrelative to the number of mistakes the best static dif\ufb01culty setting makes. We denote the number of\nmistakes of POSM until time T by m and the minimum number of times a statically chosen dif\ufb01culty\nsetting would have made a mistake until time T by M. We denote furthermore the total amount of\n\nbelief on the partially ordered set by Wt =(cid:80)\n\nThe analysis of the algorithm relies on the notion of a path cover of K, i.e., a set of paths covering\nK. A path is a subset of K that is totally ordered. A set of paths is covering K if the union of the\npaths is equal to K. Any path cover can be chosen but the minimum path cover of K achieves the\ntightest bound. It can be found in time polynomial in |K| and its size is equal to the size of the\nlargest antichain in (K,(cid:31)). We denote the chosen set of paths by C.\nWith this terminology, we are now ready to state the main result of our paper:\nTheorem 1. For the number of mistakes of POSM, it holds that:\n\nk\u2208K wt(k).\n\nt (k) = (cid:80)\n\n(cid:80)\n\nFor all c \u2208 C we denote the amount of belief on every chain by W c\nlief \u2018above\u2019 k on c by Ac\n\nx\u2208c:x(cid:23)k wt(x), and the belief \u2018below\u2019 k on c by Bc\n\nx\u2208c wt(x), the be-\nt (k) =\n\nx\u2208c:x(cid:22)k wt(x). Furthermore, we denote the \u2018heaviest\u2019 chain by ct = argmaxc\u2208C W c\nt .\n\nUnless stated otherwise, the following statements hold for all t.\nObservation 1.1. To relate the amount of belief updated by POSM to the amount of belief on each\nchain observe that\n\nmax\n\nk\u2208K min{At(k), Bt(k)} = max\nc\u2208C max\nk\u2208c\n\u2265 max\nc\u2208C max\nk\u2208c\n\u2265 max\nmin{Act\nk\u2208ct\n\nmin{At(k), Bt(k)}\nt (k)}\nmin{Ac\nt (k)} .\n\nt (k), Bct\n\nt (k), Bc\n\nObservation 1.2. As ct is the \u2018heaviest\u2019 among all chains and (cid:80)\n\nc\u2208C W c\n\nt \u2265 WT , it holds that\n\nt \u2265 Wt/|C|.\nW ct\nWe will next show that for every chain, there is a dif\ufb01culty setting for which it holds that: If we\nproposed that setting and made a mistake, we would be able to update at least half of the total\nweight of that chain.\nProposition 1.1. For all c \u2208 C it holds that\nmin{Ac\n\nt (k)} \u2265 W c\n\nt (k), Bc\n\nt /2 .\n\nmax\nk\u2208c\n\nProof. We choose\n\ni = argmax\n\nk\u2208c\n\n{Bc\n\nt (k) | Bc\n\nt (k) < W c\n\nt /2}\n\n4\n\n\uf8ef\uf8ef\uf8ef\uf8f0 ln|K| + M ln 1/\u03b2\n\n2|C|\n\n\uf8fa\uf8fa\uf8fa\uf8fb .\n\nln\n\n2|C|\u22121+\u03b2\n\nm \u2264\n\nt = (cid:80)\n\n\fand\n\nt /2} .\nj = argmin\nThis way, we obtain i, j \u2208 c for which Bc\nt (j) and which are consecutive, that\nis, (cid:64)k \u2208 c : i \u227a k \u227a j. Such i, j exist and are unique as \u2200x \u2208 K : wt(x) > 0. We then have\nt (i) + Ac\nBc\n\nt (k) \u2265 W c\nt /2 \u2264 Bc\n\nt (k) | Bc\nt (i) < W c\n\nt (j) = W c\n\n{Bc\n\nk\u2208c\n\nt and thus also Ac\nt /2 \u2264 min{Ac\nW c\n\nt (j) > W c\nt (j), Bc\n\nt /2. This immediatelly implies\nt (k)} .\n\nmin{Ac\n\nt (k), Bc\n\nt (j)} \u2264 max\nk\u2208c\n\nObservation 1.3. We use the previous proposition to show that in each iteration in which POSM\nproposes an inappropriate dif\ufb01culty setting, we update at least a constant fraction of the total weight\nof the partially ordered set:\n\nk\u2208K min{At(k), Bt(k)} \u2265 max\nk\u2208ct\n\nmax\n\nmin{Act\n\nt (k), Bct\n\nt (k)} \u2265 W ct\n\nt\n2\n\n\u2265 Wt\n2|C|\n\nProof (of Theorem 1). From the previous observations it follows that at each mistake we update at\nleast a fraction of 1/(2|C|) of the total weight and have at most a fraction of (2|C|\u2212 1)/(2|C|) which\nis not updated. This implies\n\n2|C| \u2212 1\n2|C|\n\n2|C| +\n\nWt .\n\n(cid:21)\n(cid:21)m\n\n(cid:20) \u03b2\n2|C| \u2212 1\n2|C| Wt \u2264\n(cid:20) \u03b2\n(cid:21)m \u2264 |K|\n(cid:20)|C| \u2212 1\n\n2|C| +\n\n\u03b2\n2|C|\n\n(cid:21)m\n\n\u03b2M \u2264 |K|\n\nWt+1 \u2264 \u03b2 \u00b7 1\n\n2|C| Wt +\n\n(cid:20) \u03b2\n\nApplying this bound recursively, we obtain for time T\n\nWT \u2264 W0\n\n2|C| \u2212 1\n2|C|\n\n2|C| +\n\n2|C| \u2212 1\n2|C|\n\n2|C| +\n\n.\n\nAs we only update the weight of a dif\ufb01culty setting if the response implied that the algorithm made\na mistake, \u03b2M is a lower bound on the weight of one dif\ufb01culty setting and hence also WT \u2265 \u03b2M .\nSolving\n\nfor m, proves the theorem.\n\nNote, that this bound is similar to the bound for the full information setting [3] despite much weaker\ninformation being available in our case. The in\ufb02uence of |C| is the new ingredient that changes the\nbehaviour of this bound for different partially ordered sets.\n\n6 Experiments\n\nWe performed two sets of experiments: simulating a game against a stochastic environment, as well\nas using human players to provide our algorithm with a non-oblivious adversary. To evaluate the\nperformance of our algorithm we have chosen two baselines. The \ufb01rst one is the best static dif\ufb01culty\nsetting in hindsight: it is a dif\ufb01culty that a player would pick if she knew her skill level in advance\nand had to choose the dif\ufb01culty only once. The second one is the IMPROVEDPI algorithm [3].\nIn the following we denote the subset of poset\u2019s vertices with the \u2018just right\u2019 labels the zero-zone\n(because in the corresponding loss vector their components are equal to zero). In both stochastic and\nadversarial scenario we consider two different settings: so called \u2018smooth\u2019 and \u2018non-smooth\u2019 one.\nThe settings\u2019 names describe the way the zero-zone changes with time. In the \u2018non-smooth\u2019 setting\nwe don\u2019t place any restrictions on it apart from its size, while in the \u2018smooth\u2019 setting the border of\nthe zero-zone is allowed to move only by one vertex at a time. These two settings represent two\nextreme situations: one player changing her skills gradually with time is changing the zero-zone\n\u2018smoothly\u2019; different players with different skills for each new challenge the game presents will\nmake the zero-zone \u2018jump\u2019. In a more realistic scenario the zero-zone would change \u2018smoothly\u2019\nmost of the time, but sometimes it would perform jumps.\n\n5\n\n\f(a) Loss.\n\n(b) Regret.\n\nFigure 1: Stochastic adversary, \u2018smooth\u2019 setting, on a single chain of 50 vertices.\n\n(a) Loss.\n\n(b) Regret.\n\nFigure 2: Stochastic adversary, \u2018smooth\u2019 setting, on a grid of 7x7 vertices.\n\n6.1 Stochastic Adversary\n\nIn the \ufb01rst set of experiments we performed, the adversary is stochastic: On every iteration the zero-\nzone changes with a pre-de\ufb01ned probability. In the \u2018smooth\u2019 setting only one of the border vertices\nof the zero-zone at a time can change its label.For the \u2018non-smooth\u2019 setting we consider a truly evil\ncase of limiting the zero-zone to always containing only one vertex and a case where the zero-zone\nmay contain up to 20% of all the vertices in the graph. Note that even relabeling of a single vertex\nmay break the consistency of the labeling with regard to the poset. The necessary repair procedure\nmay result in more than one vertex being relabeled at a time.\nWe consider two graphs that represent two different but typical games structures with regard to the\ndif\ufb01culty: a single chain and a 2-dimensional grid. A set of progressively more dif\ufb01cult challenges\nsuch that can be found in a puzzle or a time-management game can be directly mapped onto a chain\nof a length corresponding to the amount of challenges. A 2- (or more-) dimensional grid on the other\nhand is more like a skill-based game, where depending on the choices players make different game\nstates become available to them. In our experiments the chain contains 50 vertices, while the grid is\nbuilt on 7 \u00d7 7 vertices.\nIn all considered variations of the setting the game lasts for 500 iterations and is repeated 10 times.\nThe resulting mean and standard deviation values of loss and regret, respectively, are shown in\nthe following \ufb01gures: The \u2018smooth\u2019 setting in Figures 1(a), 1(b) and 2(a), 2(b); The \u2018non-smooth\u2019\nsetting in Figures 3(a), 3(b) and 4(a), 4(b). (For brevity we omit the plots with the results of other\n\u2018non-smooth\u2019 variations. They all show very similar behaviour.)\nNote that in the \u2018smooth\u2019 setting POSM is outperforming BSIH and, therefore, its regret is negative.\nFurthermore, in the considerably more dif\ufb01cult \u2018non-smooth\u2019 setting all algorithms perform badly\n(as expected). Nevertheless, in a slightly easier case of larger zero-zone, BSIH performs the best of\nthe three, and POSM performance starts getting better.\nWhile BSIH is a baseline that can not be implemented as it requires to foresee the future, POSM is a\ncorrect algorithm for dynamic dif\ufb01culty adjustment. Therefore it is surprising that POSM performs\nalmost as good as BSIH or even better.\n\n6\n\n 0 50 100 150 200 250 300 350 400 0 100 200 300 400 500losstimeImprovedPIPOSMBSIH-100-50 0 50 100 150 200 250 300 0 100 200 300 400 500regrettimeImprovedPIPOSM 0 50 100 150 200 250 300 350 0 100 200 300 400 500losstimeImprovedPIPOSMBSIH-100-50 0 50 100 150 200 250 0 100 200 300 400 500regrettimeImprovedPIPOSM\f(a) Loss.\n\n(b) Regret.\n\nFigure 3: Stochastic adversary, \u2018non-smooth\u2019 setting, exactly one vertex in the zero-zone, on a single\nchain of 50 vertices.\n\n(a) Loss.\n\n(b) Regret.\n\nFigure 4: Stochastic adversary, \u2018non-smooth\u2019 setting, up to 20% of all vertices may be in the zero-\nzone, on a single chain of 50 vertices.\n\n6.2 Evil Adversary\n\nWhile the experiments in our stochastic environment show encouraging results, of real interest to\nus is the situation where the adversary is \u2018evil\u2019, non-stochastic, and furthermore, non-oblivious. In\ndynamic dif\ufb01culty adjustment the algorithm will have to deal with people, who are learning and\nchanging in hard to predict ways. We limit our experiments to a case of a linear order on dif\ufb01culty\nsettings, in other words, the chain. Even though it is a simpli\ufb01ed scenario, this situation is rather\nnatural for games and it demonstrates the power of our algorithm.\nTo simulate this situation, we\u2019ve decided to use people as adversaries. Just as in dynamic dif\ufb01culty\nadjustment players are not supposed to be aware of the mechanics, our methods and goals were not\ndisclosed to the testing persons. Instead they were presented with a modi\ufb01ed game of cups: On\nevery iteration the casino is hiding a coin under one of the cups; after that the player can point at\ntwo of the cups. If the coin is under one of these two, the player wins it. Behind the scenes the\ncups represented the vertices on the chain and the players\u2019 choices were setting the lower and upper\nborders of the zero-zone. If the algorithm\u2019s prediction was wrong, one of the two cups was decided\non randomly and the coin was placed under it. If the prediction was correct, no coin was awarded.\nUnfortunately, using people in such experiments places severe limitations on the size of the game.\nIn a simpli\ufb01ed setting as this and without any extrinsic rewards they can only handle short chains\nand short games before getting bored. In our case we restricted the length of the chain to 8 and the\nlength of each game to 15. It is possible to simulate a longer game by not resetting the weights of\nthe algorithm after each game is over, but at the current stage of work it wasn\u2019t done.\nAgain, we created the \u2018smooth\u2019 and \u2018non-smooth\u2019 setting by placing or removing restrictions on\nhow players were allowed to choose their cups. To each game either IMPROVEDPI or POSM was\nassigned. The results for the \u2018smooth\u2019 setting are on Figures 5(a), 5(b), and 5(c); for the \u2018non-\nsmooth\u2019 on Figures 6(a), 5(b), and 6(c). Note, that due to the fact that this time different games were\nplayed by IMPROVEDPI and POSM, we have two different plots for their corresponding loss values.\n\n7\n\n 0 50 100 150 200 250 300 350 400 450 0 100 200 300 400 500losstimeImprovedPIPOSMBSIH-4-2 0 2 4 6 8 10 12 0 100 200 300 400 500regrettimeImprovedPIPOSM 0 50 100 150 200 250 300 350 400 0 100 200 300 400 500losstimeImprovedPIPOSMBSIH-10 0 10 20 30 40 50 60 0 100 200 300 400 500regrettimeImprovedPIPOSM\f(a) Games vs IMPROVEDPI.\n\n(b) Games vs POSM.\n\n(c) Regret.\n\nFigure 5: Evil adversary, \u2018smooth\u2019 setting, a single chain of 8 vertices.\n\n(a) Games vs IMPROVEDPI.\n\n(b) Games vs POSM.\n\n(c) Regret.\n\nFigure 6: Evil adversary, \u2018non-smooth\u2019 setting, a single chain of 8 vertices.\n\nWe can see that in the \u2018smooth\u2019 setting again the performance of POSM is very close to that of BSIH.\nIn the more dif\ufb01cult \u2018non-smooth\u2019 one the results are also encouraging. Note, that the loss of BSIH\nappears to be worse in games played by POSM. A plausible interpretation is that players had to\nfollow more dif\ufb01cult (less static) strategies to fool POSM to win their coins. Nevertheless, the regret\nof POSM is small even in this case.\n\n7 Conclusions\n\nIn this paper we formalised dynamic dif\ufb01culty adjustment as a prediction problem on partially or-\ndered sets and proposed a novel online learning algorithm, POSM, for dynamic dif\ufb01culty adjustment.\nUsing this formalisation, we were able to prove a bound on the performance of POSM relative to the\nbest static dif\ufb01culty setting chosen in hindsight, BSIH. To validate our theoretical \ufb01ndings empiri-\ncally, we performed a set of experiments, comparing POSM and another state-of-the-art algorithm to\nBSIH in two settings (a) simulating the player by a stochastic process and (b) simulating the player\nby humans that are encouraged to play as adverserially as possible. These experiements showed\nthat POSM performs very often almost as well as BSIH and, even more surprisingly, sometimes even\nbetter. As this is also even better than the behaviour suggested by our mistake bound, there seems to\nbe a gap between the theoretical and empirical performance of our algorithm.\nIn future work we will on the one hand investigate this gap, aiming at providing better bounds by,\nperhaps, making stronger but still realistic assumptions. On the other hand, we will implement\nPOSM in a range of computer games as well as teaching systems to observe its behaviour in real\napplication scenarios.\n\nAcknowledgments\n\nThis work was supported in part by the German Science Foundation (DFG) in the Emmy Noether-\nprogram under grant \u2018GA 1615/1-1\u2019. The authors thank Michael Kamp for proofreading.\n\n8\n\n 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14losstimeImprovedPIBest Static 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14losstimePOSMBest Static 0 2 4 6 8 10 0 2 4 6 8 10 12 14regrettimeImprovedPIPOSM 0 2 4 6 8 10 12 0 2 4 6 8 10 12 14losstimeImprovedPIBest Static 0 2 4 6 8 10 12 0 2 4 6 8 10 12 14losstimePOSMBest Static 0 2 4 6 8 10 12 0 2 4 6 8 10 12 14regrettimeImprovedPIPOSM\fReferences\n[1] J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for\n\nbandit linear optimization. 2008.\n\n[2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a rigged casino: The ad-\nversarial multi-armed bandit problem. Foundations of Computer Science, Annual IEEE Sym-\nposium on, 0:322, 1995.\n\n[3] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[4] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz.\n\nImproved second-order bounds for prediction\n\nwith expert advice. Machine Learning, 66:321\u2013352, 2007. 10.1007/s10994-006-5001-7.\n\n[5] D. Charles and M. Black. Dynamic player modeling: A framework for player-centered digital\ngames. In Proc. of the International Conference on Computer Games: Arti\ufb01cial Intelligence,\nDesign and Education, pages 29\u201335, 2004.\n\n[6] V. Dani and T. P. Hayes. Robbing the bandit: less regret in online geometric optimization\nagainst an adaptive adversary. In Proceedings of the seventeenth annual ACM-SIAM sympo-\nsium on Discrete algorithm, SODA \u201906, pages 937\u2013943, New York, NY, USA, 2006. ACM.\n\n[7] G. Danzi, A. H. P. Santana, A. W. B. Furtado, A. R. Gouveia, A. Leit\u02dcao, and G. L. Ramalho.\nOnline adaptation of computer games agents: A reinforcement learning approach. II Workshop\nde Jogos e Entretenimento Digital, pages 105\u2013112, 2003.\n\n[8] T. G\u00a8artner and G. C. Garriga. The cost of learning directed cuts. In Proceedings of the 18th\n\nEuropean Conference on Machine Learning, 2007.\n\n[9] R. Herbrich, T. Minka, and T. Graepel. Trueskilltm: A bayesian skill rating system. In NIPS,\n\npages 569\u2013576, 2006.\n\n[10] R. Hunicke and V. Chapman. AI for dynamic dif\ufb01culty adjustment in games. Proceedings\nof the Challenges in Game AI Workshop, Nineteenth National Conference on Arti\ufb01cial Intelli-\ngence, 2004.\n\n[11] H. McMahan and A. Blum. Online geometric optimization in the bandit setting against an\nadaptive adversary. In J. Shawe-Taylor and Y. Singer, editors, Learning Theory, volume 3120\nof Lecture Notes in Computer Science, pages 109\u2013123. Springer Berlin / Heidelberg, 2004.\n\n[12] O. Missura and T. G\u00a8artner. Player Modeling for Intelligent Dif\ufb01culty Adjustment. In Discovery\n\nScience, pages 197\u2013211. Springer, 2009.\n\n[13] J. Togelius, R. Nardi, and S. Lucas. Making racing fun through player modeling and track\nevolution. In SAB\u201906 Workshop on Adaptive Approaches for Optimizing Player Satisfaction in\nComputer and Physical Games, pages 61\u201370, 2006.\n\n[14] G. Yannakakis and M. Maragoudakis. Player Modeling Impact on Player\u2019s Entertainment in\n\nComputer Games. Lecture notes in computer science, 3538:74, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1134, "authors": [{"given_name": "Olana", "family_name": "Missura", "institution": null}, {"given_name": "Thomas", "family_name": "G\u00e4rtner", "institution": null}]}