{"title": "Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation", "book": "Advances in Neural Information Processing Systems", "page_first": 5465, "page_last": 5473, "abstract": "We introduce algorithmic assurance, the problem of testing whether\nmachine learning algorithms are conforming to their intended design\ngoal. We address this problem by proposing an efficient framework\nfor algorithmic testing. To provide assurance, we need to efficiently\ndiscover scenarios where an algorithm decision deviates maximally\nfrom its intended gold standard. We mathematically formulate this\ntask as an optimisation problem of an expensive, black-box function.\nWe use an active learning approach based on Bayesian optimisation\nto solve this optimisation problem. We extend this framework to algorithms\nwith vector-valued outputs by making appropriate modification in Bayesian\noptimisation via the EXP3 algorithm. We theoretically analyse our\nmethods for convergence. Using two real-world applications, we demonstrate\nthe efficiency of our methods. The significance of our problem formulation\nand initial solutions is that it will serve as the foundation in assuring\nhumans about machines making complex decisions.", "full_text": "Algorithmic Assurance: An Active Approach to\nAlgorithmic Testing using Bayesian Optimisation\n\nShivapratap Gopakumar, Sunil Gupta\u2217, Santu Rana, Vu Nguyen, Svetha Venkatesh\n\nCentre for Pattern Recognition and Data Analytics\n\nDeakin University, Geelong, Australia\n\nAbstract\n\nWe introduce algorithmic assurance, the problem of testing whether machine\nlearning algorithms are conforming to their intended design goal. We address this\nproblem by proposing an ef\ufb01cient framework for algorithmic testing. To provide\nassurance, we need to ef\ufb01ciently discover scenarios where an algorithm decision\ndeviates maximally from its intended gold standard. We mathematically formulate\nthis task as an optimisation problem of an expensive, black-box function. We use an\nactive learning approach based on Bayesian optimisation to solve this optimisation\nproblem. We extend this framework to algorithms with vector-valued outputs by\nmaking appropriate modi\ufb01cation in Bayesian optimisation via the EXP3 algorithm.\nWe theoretically analyse our methods for convergence. Using two real-world\napplications, we demonstrate the ef\ufb01ciency of our methods. The signi\ufb01cance of\nour problem formulation and initial solutions is that it will serve as the foundation\nin assuring humans about machines making complex decisions.\n\n1\n\nIntroduction\n\nSupervised learning algorithms today serve as proxies for decision processes traditionally performed\nby humans. As decision making processes get increasingly automated, it is reasonable to ask if\nour algorithms are behaving as intended. How far is the algorithm from the gold standard (human\ndecision maker) it is serving as a proxy for? For example, consider a metallurgist who routinely\nmakes decisions about elemental compositions to design a target alloy. If an algorithm is built to\nserve as a proxy for this decision process, can we provide assurance that the difference in the decision\nmade by the algorithm and the metallurgist is within a stipulated bound? Similarly if an algorithm\nhas been trained to recognize digits, can we ensure that the recognition error of the algorithm is\nacceptable across all allowable visual variations within which a human can recognise digits correctly?\nTo provide such assurance we need to compare an algorithm against its gold standard and \ufb01nd the\nmaximum deviation. An exhaustive comparison may solve this problem but would be prohibitively\nexpensive as we need gold standard decisions for a large number of test instances. In absence of such\na large set, how do we \ufb01nd such deviations ef\ufb01ciently?\nTraditionally machine learning algorithms are tested by separating a small fraction of the available\ndata as a validation set. Considering the validation set as a collection of random samples from the\ndata space, we may need a large validation set to have high con\ufb01dence on the algorithmic assurance,\ni.e. the maximal deviation of the algorithm from its gold standard is within an acceptable limit. Let\nus assume a hypervolume v\u03b5 wherein a function takes values within 1\u2212 \u03b5 of its maximum. Then\na random search will sample this hypervolume with the probability v\u03b5\nV where V is the total search\nspace volume. Assuming V = Rd and v\u03b5 \u2248 rd, where d is the input dimension, the random search\nnumber of samples [3]. This can be expensive - e.g.\n\nscheme would need, on an average O(cid:16)(cid:0) r\n\n(cid:1)\u2212d(cid:17)\n\nR\n\n\u2217Corresponding author email:sunil.gupta@deakin.edu.au\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fT\n\n(cid:19)\n\n(cid:18)(cid:113) dln T\n\nR = 0.01, nearly a million samples are required in just a three dimensional space. Therefore, a\n\nif r\nsample ef\ufb01cient alternative is needed for the algorithmic assurance.\nWe propose to use an active strategy for \ufb01nding the maximum deviation that samples the data\nspace such that each sample is only queried if it is aligned with the goal to \ufb01nd the maximum.\nBayesian optimisation is one such ef\ufb01cient active learning method with a convergence guarantee\non the average regret as O\n(using a Gaussian process model with squared-exponential\nkernel [4, 19]) where T is the number of iterations/samples. Thus to reach the same regret level \u03b5,\nBayesian optimisation requires much smaller sample numbers. This approach actively recommends\nnew instances during validation for which decisions are required from both algorithm and the human\nexpert. Although costly, it remains practical because of the sample-ef\ufb01ciency guarantee of Bayesian\noptimisation. Our experience shows that it common to reach the maximum within tens of samples\nper dimension.\nWe develop a Bayesian Optimisation (BO) framework to ef\ufb01ciently discover the scenario wherein an\nalgorithm maximally deviates from its gold standard. Given a difference function y = f (x), where\nx representing input instances and y representing the difference of the algorithm\u2019s decision, our\nproposed algorithmic assurance framework aims to ef\ufb01ciently discover the instance for which the\nalgorithmic decision differs most from the gold standard. We assume that functions underlying\nthe decision making of both the gold standard process and the algorithm are smooth and therefore\ntheir difference function f (x) is also smooth. We model f (x) using a Gaussian process [15], and\nits predictive distribution is used to predict the deviation of the algorithm from the gold standard\nalong with any epistemic uncertainty. This prediction is then used to construct a cheap surrogate\nfunction (acquisition function) that takes higher values at points where either the algorithm deviation\nor the epistemic uncertainty is high. The acquisition function is \ufb01nally maximised to recommend a\nnew instance for the algorithm testing. Both the gold standard and the algorithm decisions are then\nacquired to evaluate f (x) at the new instance and this information is used to update the Gaussian\nprocess model of f (x). This process iterates until convergence. We call this framework single-task\nalgorithmic assurance.\nWe next move to multi-task algorithmic assurance where we extend our framework to provide\nassurance for algorithms that have vector-valued outputs. Our goal now is to \ufb01nd the scenario where\nan algorithm maximally deviates from its gold standard across any output. For example, in alloy\ndesign, elements are combined and heated, leading to phase formations. The strength of the resultant\nalloy is related to these phase fractions. An algorithm can be used to model the relation between the\nelemental composition and phases. Some phases are more common, thus statistics for each phase is\nnot equally strong. This makes the rarer phase prediction more error prone. We therefore need to\nef\ufb01ciently \ufb01nd the elemental composition where our algorithm\u2019s phase prediction maximally deviates\nfrom the true phase values across any phase, since predicting each phase is equally important. This\nboils down to a BO problem with C black-box expensive functions and our task is \ufb01nd the largest\nglobal maximum across all the functions. To address this ef\ufb01ciently, we formulate each function as\nan arm of a multi-arm bandit and de\ufb01ne the reward for pulling an arm as the best value found by\nBO for the corresponding function (up to any iteration). This method can ef\ufb01ciently switch across C\noptimisation problems to quickly discover the optimum point. We theoretically analyse this algorithm\nand show that its simple regret has the order O\n\n(cid:18)(cid:113) dln T\n\n(cid:113)ClnC\n\nT\n\n(cid:19)\n\n.\n\nT +\n\nIt may appear super\ufb01cially that the multi-task BO [20] is related, however, this method optimizes\nmultiple related functions concurrently through mutual learning. We note that our problem is different\nin two ways: (1) we do not aim to maximise each function, rather quickly identify the function with\nthe largest maximum and then \ufb01nd its maximum point; and, (2) multiple functions in our setting need\nnot be related, which is a crucial assumption in the multi-task BO.\nWe demonstrate our framework on two problems: Prediction of strength-determining phases in\nan alloy design process, and recognition of handwritten digits under visual distortions. Our main\ncontributions are:\n\u2022 Introduction of a new notion of algorithmic assurance to assess the deviation behaviour of\n\u2022 Construction of an ef\ufb01cient framework for algorithmic assurance in both single and multi-\n\nan algorithm from its intended use;\n\ntask settings;\n\n2\n\n\f\u2022 Demonstration of the ef\ufb01ciency of our methods using two real world applications.\n\nThe signi\ufb01cance of our problem formulation and solutions is that it will be the \ufb01rst step towards\nproviding assurance to users of an algorithm.\n\n2 An Active Approach to Algorithmic Testing\n\ni ,otr\n\ni ),i = 1, . . . ,n}. Given the dataset Dtrain\n\nIn this section we present our proposed framework for ef\ufb01cient algorithmic testing. Let us assume\nwe have an unknown function a(x) to be modelled using a set of observations of the form Dtrain\nn =\n{(xtr\n, a typical approach is to use a machine learning\nalgorithm (e.g. a neural network) to learn an approximation A (x) of a(x). De\ufb01ne a function\nf (x) = L (a(x), A (x)) that measures the deviation of A (x) from a(x) at any point x. Various form of\ndeviation can be used, for example, L (a(x), A (x)) = (a(x)\u2212 A (x))2 when dealing with a regression\nproblem. In our proposed algorithmic testing framework, our goal is to ef\ufb01ciently identify a scenario\nx\u2217 wherein the algorithm output A (x\u2217) maximally deviates from the function a(x\u2217). We express this\ngoal through the following optimisation problem\n\nn\n\nx\u2217 = argmax\nx\u2208X\n\nf (x) = argmax\n\nx\u2208X\n\n(a(x)\u2212 A (x))2\n\n(1)\n\nSince function f (x) is not known analytically, the objective function in the above optimisation\nproblem is treated as a black-box function. In addition, evaluating f (x) is expensive. The problem is\nthus \ufb01nding the optimum of an expensive, black-box function.\n\n\u03b5 )), collectively denoted as Dt = {xi,yi}t\n\nBayesian Optimisation: A method that has recently become popular for ef\ufb01cient global optimisa-\ntion of expensive, black-box functions is Bayesian optimisation (BO) [17, 6, 7, 13]. It represents the\nblack-box function through a probabilistic model, which is then used to reason about where in the\nspace the optimum is located (for exploitation of available knowledge) and where we have the least\nknowledge about the function (for exploration for further knowledge). Based on this reasoning, the\nfunction is evaluated at a new location balancing the exploration and exploitation requirements and\nthe new observation is used to update the function model. This sequential procedure repeats until\nthe global optimum is reached or the optimisation budget is exceeded. The BO algorithms [19, 4]\ncome with an ef\ufb01ciency guarantee on their convergence and usually have sub-linear growth rate for\ncumulative regret.\nGaussian processes are most popular for modelling the unknown function when doing Bayesian\noptimisation though other models have also been used [9, 18, 14]. Using Gaussian process prior, a\nfunction is modelled as f (x) \u223c GP (m (x) ,k (x,x(cid:48))), where m is a mean function and k (x,x(cid:48)) contains\nthe covariance of any two points on the function. With availability of noisy observations of the form\nyi = f (xi) + \u03b5i (where \u03b5i \u223c N (0,\u03c3 2\ni=1, we can derive the\npredictive distribution for the function value at a new observation x(cid:48) to be a Gaussian distribution [15] -\nits mean and variance are given as \u00b5t (x(cid:48)) = kT (K +\u03c3 2\n\u03b5 I)\u22121k\nwhere assuming k as a covariance function [15], K is a matrix of size t \u00d7t whose (i, j)-th element is\nde\ufb01ned as k(xi,x j) and k is a vector (overloaded notation) with its i-th element de\ufb01ned as k(x,xi).\nA nice property of BO when using Gaussian processes is that it usually avoids convergence to any\n\u201cspurious peaks\u201d and mostly converges to a stable peak. This property is useful for our algorithmic\ntesting framework when we are interested in \ufb01nding not just the location of the largest deviation of\nthe algorithm but a region where the deviations are generally high. This may help in understanding\nthe reasons of algorithm deviation and any potential remedies.\nAn illustrative example: To understand how BO avoids convergence to any \u201cspurious peaks\u201d, let us\nconsider an illustrative example function f (x) with two peaks (see Figure 1) at locations x0 and x(cid:48)\n0 such\nthat f (x0) > f (x(cid:48)\n0). Now consider two cases such that in the \ufb01rst case, the peak at x is sharper (red)\nthan the second case (grey). When using a Gaussian process model, we can show that if the two cases\nhave previous observations at the same locations {x1, . . . ,xt}, the predictive mean of the Gaussian pro-\n\u03b5 I)\u22121y\ncess model \u00b5t (x) for case-1 will be lower than that of case-2. This is because \u00b5t (x) = kT (K +\u03c3 2\nand since yi\u2019s of the case-1 are only equal or lower than the corresponding yi\u2019s of case-2. With the\nassumption of previous observations for the two cases being at the same locations, the predictive\nvariance of the Gaussian process model \u03c3t (x) for both cases would be equal. Thus an acquisition func-\ntion \u03b1t (x) (based on typical acquisition function such as GP-UCB [19] or EI [10]) will take a lower\nvalue for case-1 than case-2. Since the two cases mainly differ around location x0 (see Figure), the\n\nt (x(cid:48)) = k(x(cid:48),x(cid:48))\u2212kT (K +\u03c3 2\n\n\u03b5 I)\u22121y and \u03c3 2\n\n3\n\n\facquisition function \u03b1t (x) around x0 will be\nlower for case-1 than case-2. Therefore the prob-\nability of a point x around x0 being the maxima\nof \u03b1t (x) is lower for case-1 than case-2. Further,\nthe narrower the peak of case-1, the lower is this\nprobability. Therefore, Bayesian optimisation\nalgorithm converges to the narrower peak with\nlower probability. This result would generally\nhold as long as the observations used in BO have\na small measurement noise. If convergence to\nnarrow peaks is becoming unavoidable or com-\nmon, one may resort to BO methods that are\ncustomised to avoid spurious peaks [12, 5].\n\n3 Multi-Task Algorithmic Testing\n\nFigure 1 \u2013 An example function illustrating spuri-\nous (red) and wider (grey) peaks.\n\nThere are several applications where we need to model vector-valued outputs. In other words, this\ninvolves modelling multiple outputs or tasks. For example, in alloy design, for each composition\nof constituent elements, we have multiple phases. Let us assume that we have trained one machine\nlearning model for each of these tasks. These models can be either independently or jointly trained\ndepending on whether the tasks are independent or related. Since each task is different, the scenario\nwhere the algorithm output maximally deviates from the true output differs from task to task. Our\nabove-mentioned single-task algorithmic testing method can be applied to this multi-task problem by\naggregating the deviations for all tasks and thus can only discover the scenario where the algorithm\ndeviates from the true output in an average sense. However, when it is important to get the assurance\nfor each task or output, this approach may be insuf\ufb01cient.\nIn our proposed multi-task algorithmic testing, we aim to ef\ufb01ciently discover the scenario wherein the\nalgorithm maximally differs from the true function for any of the outputs or tasks. Let us assume that\nthere are C tasks, indexed as c = 1, . . . ,C and for c-th task, the true and the trained algorithm functions\nare ac(x) and Ac(x) respectively. We denote the discrepancy functions between the algorithm and the\ntrue functions by f1(x), ..., fC(x). Each function has an optimum f \u2217\nc = max\u2200x\u2208X fc(x). We aim to \ufb01nd\nf \u2217\nc and the optimizer location x\u2217 = argmaxx\u2208X fc\u2217(x).\nboth the optimal index c\u2217 such that c\u2217 = argmax\n\nc\u2208C\n\nA simple approach to solve our problem is to perform Bayesian optimisation for each function fc(x)\nto obtain f \u2217\nc and then \ufb01nally \ufb01nd c\u2217 and x\u2217. However this approach is inef\ufb01cient as it unnecessarily\nevaluates the suboptimal functions for their complete Bayesian optimisation sequence. Our intuition\nis that it is possible to identify the tasks for which the algorithm has high errors within few function\nevaluations from all tasks and then mostly perform function evaluations for tasks with high deviations\nfrom the gold standard.\nIn multi-arm bandit (MAB) research, this problem can be thought of\nidentifying an arm with the best reward (or simply the \u201cbest arm\u201d). There are several algorithms to\nidentify the best arm, e.g. UCB1, \u03b5-greedy, Hedge, EXP3 etc [1, 2]. Of these, Hedge and EXP3\nare the algorithms that can be used under most general conditions with few assumptions on reward\ndistributions unlike UCB1 and \u03b5-greedy that require i.i.d. assumption. In our case, at any iteration of\nBayesian optimisation, we de\ufb01ne the reward of choosing a task at any iteration as the best function\nvalue reached up to that iteration from that task. Since the \u201cbest so far\u201d statistics is not independent\nacross iterations, the reward distribution is not i.i.d.The use of Hedge algorithm with BO has been\nconsidered earlier by [8] in a different context to ours. The Hedge algorithm in [8] is used to select\nacquisition functions for Bayesian optimisation. A requirement of the Hedge algorithm is that it\nneeds the observation of rewards from each arm at all iterations. Unfortunately this requirement is not\nmet in our scheme as if we only receive the reward for the selected arm \u2013 a partial reward feedback\nscenario. Therefore, for our multi-task algorithmic testing framework we use EXP3 algorithm [2, 16]\nwhich is capable of working in a partial reward feedback scenario.\nUsing the EXP3 algorithm we proceed as follows. At each iteration t, we \ufb01rst select a function\nindexed as ht = c and then advance its (one step) Bayesian optimisation to select the next point for\n\u03b1t (x | Dt (ht )) where Dt (ht ) are\nevaluation by maximising the acquisition function as xt = argmax\nobservations up to iteration t for the task indexed as ht. The reward for the selected function is denoted\n\nx\u2208X\n\n4\n\nxf(x)Case1Case2x0x'0\ft as pc\n\n(cid:113) C lnC\n\nt = (1\u2212\u03b7) \u03c9c\n\u2211C\nc=1 \u03c9c\n\nby gt (c) and is de\ufb01ned as the best function value so far, i.e. gt (c) = max\u2200xi\u2208Dt (c) fc(xi). Using rewards\nC , where \u03c9c = \u03c9c \u00d7 exp(\u03b7 \u02c6gt (c)/C) and\n+ \u03b7\ngt (c), we compute a probability pc\n\u03b7 =\n(e\u22121)T is a EXP3 parameter pre-de\ufb01ned given the maximum budget T (as per Corollary 3.2\nof [2]). The probability vector pt = [p1\nt ] indicates the promise of different tasks for obtaining\nhigh values and is used to select a function for performing Bayesian optimisation. This process\ncontinues iteratively either until convergence or the function evaluation budget is exhausted. We refer\nto this algorithm as EXP3BO (see Algorithm 1).\n\nt , ..., pC\n\n(cid:113) C lnC\n\nAlgorithm 1 EXP3BO Algorithm for Multi-task Algorithmic Testing\nInput \u03b7 =\n1: Init \u03c9c = 1,\u2200c = 1...C.\n2:\n3:\n\n(e\u22121)T , C #categorical choice, T #max iteration\n\nfor t = 1 to T\n\n+ \u03b7\n\nt = (1\u2212 \u03b7) \u03c9c\n\u2211C\nc=1 \u03c9c\n\nCompute the probability pc\nChoose a categorical variable at random ht \u2208 [1, ...,C] \u223c pt = [p1\nOptimize the acquisition function xt = argmax\u03b1t (x|Dt (ht )) given ht.\nEvaluate the blackbox function yt = f ([xt ,ht = c]) and augment Dt (ht ) = Dt\u22121(ht )\u222a (xt ,yt ).\nUpdate the reward gt (ht ) = max\u2200xi\u2208Dt (ht ) fht (xi) and normalise as \u02c6gt (ht ) = gt (ht )/pt\nc.\nUpdate the weight \u03c9ht = \u03c9ht \u00d7 exp(\u03b7 \u02c6gt (ht )/C).\n\nt , ..., pC\nt ].\n\nC ,\u2200c = 1...C.\n\n4:\n5:\n6:\n7:\n8:\n9: end for\nOutput: DT\n\nConvergence Analysis\n\nWe now present the convergence analysis. All the bounds are probabilistic bounds that hold with\nhigh probability. Let \u03b3T be the maximum information gain over any T iterations, it can be bounded\nfor common kernels (e.g. for SE kernel \u03b3T \u223c O\nLemma 1. (Due to [19]) Let T be the number of iterations, d be the input space dimension, then we\ncan bound the simple regret ST after T iterations of GP-UCB by a sublinear term as\n\n) [19].\n\n(cid:16)\n(lnT )d+1(cid:17)\nc \u2212 fc(xt )) \u2264 O(cid:16)(cid:112)\u03b3T lnT /T\n\n( f \u2217\n\nST = f \u2217\n\nc \u2212 max\n\u2200xt ,t\u2264T\n\nfc(xt ) \u2264 1\nT\n\n(cid:17)\n\n.\n\nT\n\n\u2211\n\nt=1\n\nSince we do not know which function among f1, . . . , fC has the overall maxima f \u2217, as discussed\nearlier a na\u00efve algorithm can divide any available function evaluation budget T equally among C\noptions. We refer to this algorithm as Round-robin BO. This algorithm only allocates T\nC evaluations\nfor the optimal function indexed by c\u2217. We next provide the convergence rate for this Round-robin\nalgorithm and later show that our proposed EXP3BO algorithm will have a tighter bound than the\nRound-robin BO. Another similar na\u00efve approach (Random Categorical BO) is to randomly select a\nfunction and optimise. On average, this approach will also allocate T\nC evaluations for each function.\nLemma 2. Given C choices, the Round-robin BO and the Random Categorical BO methods will\n\nhave the simple regret bounded as ST \u2264 O((cid:112)C\u03b3T lnT /T ).\n\nProof. Since these methods allocate only T\nLemma 1 we can write the simple regret bound as ST = S T\n\u221a\nC\nthe regret increases as O(\nC).\n\nC evaluations to optimize the optimal function fc\u2217(x), using\nC ). We can see that\n\n(c\u2217) \u2264 O(\n\nT ln T\n\n(cid:113)C\u03b3T\n\nLemma 3. (Due to [2]) For T > 0, setting \u03b7 =\nis bounded as\n\n(e\u22121)T , the expected regret of the EXP3 algorithm\n\nmax\nh\u2208[C]\n\nT\n\n\u2211\n\nt=1\n\ngt (h)\u2212 E\n\n\u2264 O(cid:16)\u221a\n\n(cid:17)\n\nTC lnC\n\n,\n\n(cid:34) T\n\n\u2211\n\nt=1\n\n(cid:113) C lnC\n(cid:35)\n\ngt (ht )\n\n5\n\n\fFigure 2 \u2013 Left: Element compositions (parallel coordinates) for max error during training and test.\nBounds for each element composition in shaded regions. Right: Convergence of test error (RMSE).\n\nwhere we denote gt (c) = max\u2200xi\u2208Dt (c) fc(xi). The expectation is under randomness in the algorithm\nto select ht.\nTheorem 4. The EXP3BO algorithm has its simple regret bounded by\n\nE(cid:104)\n\nSExp3BO\nT\n\n(cid:34)\n\n(cid:105) \u2264 O(cid:16)(cid:112)\u03b3T lnT /T +(cid:112)C lnC/T\n(cid:40)\n(cid:35)\n\n(cid:41)\n\n.\n\n(cid:17)\n(cid:32)(cid:114)C lnC\n\nT\n\n(cid:33)\n(cid:3) = f \u2217 \u2212 E [gT (hT )] \u2264 f \u2217 \u2212 E(cid:2) 1\n\ngt (h)\n\n< O\n\n\u2211\n\nT\n\n.\n\n(2)\n\nt=1 gt (ht )(cid:3).\n\nProof. Let f \u2217 = max\u2200c\u2208C,\u2200x\u2208X fc(x) be the optimum value that we seek. From Lemma 3, we can\nwrite\n\nT \u2211T\n\nSince 1\nDenote the oracle simple regret as SOracle\nbest arm can be identi\ufb01ed by oracle with high probability, using Lemma 1, we have SOracle\n\nt=1 gt (h). Further assuming that the\n\u2264\n\nT \u2211T\n1\n\nT \u2211T\n\nT\n\nT\n\nT\n\n\u2211\n\n1\nT\n\n\u2212\n\nf \u2217 \u2212 E\n\nf \u2217 \u2212 max\nh\u2208[C]\n\n1\nT\n\nt=1\n\nt=1\n\ngt (ht )\n\nt=1 gt (ht ) \u2264 gT (hT ), we have E(cid:2)SEXP3BO\n(cid:17)\n(cid:3) < O(cid:16)(cid:112)C lnC/T\n\nE(cid:2)SEXP3BO\n\n= f \u2217 \u2212 maxh\u2208[C]\n(cid:17)\n\nand thus\n\nT\n\nT\n\nO(cid:16)(cid:112)\u03b3T lnT /T\n\n+ O(cid:16)(cid:112)\u03b3T lnT /T\n\n(cid:17)\n\n.\n\nWe can see that the regret bound remains sublinear in T and is tighter than the regret bound of the\nRandom Categorical or the Round-robin algorithm.\n\n4 Experiments\n\nWe evaluate single and multi-task assurance using the two real world applications: (1) Alloy design,\nand (2) hand written digit recognition. In our algorithm, a squared exponential kernel is used for BO.\nAll our results are reported by aggregating results from 10 runs with each run initialized randomly.\n\n4.1 A neural network model predicting alloy-strengthening phases\n\nAlloys are mixtures of elements that are able to achieve properties that are not possible by a single\nelement. Laboriously collected experimental data elaborate how a mixture of elements form \u201cphases\u201d.\nA phase is a homogeneous part of the alloy that has uniform physical and chemical characteristics,\nand determines the alloy strength. Experimental data for alloys are contained in proprietary simulators\n(eg. Thermocalc) and experimenters query such simulators for computed phase characteristics. These\ncomplex computations are expensive.\nWe construct a proxy algorithm for Thermocalc using a neural network to predict phases. We then\napply our model to discover the test data point where the network prediction differs most from the\nThermocalc output. Our proxy network is trained on 1000 samples generated from Thermocalc for\n\n6\n\n\u001a7\u001a:\u001e\u0004%\u0004\u00033\u001e3$\u0004\r\u000f\u0011\u0013\u0001\u000e\r\b\u0003\u001a42548\u00049\u000443\u0003%7,\u00043\u0003#\u001e$\u001c\u0001\u0003\r\u000b\u000f\u0001$\u00043\u0004\u00040%,8\u0004\u0003\u0019 \u0003#\u001e$\u001c\u0001\u0003\r\u000b\u0012\u00010100200300400Iterations0.10.30.6RMSE\fAluminium 7000 series alloys that mainly consists of Aluminium and seven other elements (Cr, Cu,\nMg, Ti, Zn, Mn, Si) whose % compositions are in a de\ufb01ned range as shown by the shaded region in\nFig. 2 (Left). Input to the network is a 7 dimensional vector of element compositions. The output is\na vector of alloy phases. After consulting with domain experts, we model 16 relevant phases. Our\nneural network consists of 2 hidden layers with 14 and 36 nodes respectively. A 30% dropout was\nintroduced between the second layer and the output layer. The network was trained to minimize the\nerror averaged over 16 phases. The neural network was trained for 100 epochs using a batch size of 5.\nThe alloy composition corresponding to the maximal training RMSE of 0.27 was: Cr = 0.85 %, Cu=\n2.06 %, Mg=0.18 %, Ti= 0.88 %, Zn= 8.25 %, Mn=0.37 %, and Si= 0.56 % (Fig. 2 (Left)). We use\nthis neural network model for single and multi-task assurance. Single task measures the average error\nmade across all phases, whilst multi-task measures error in each phase individually. In our notation, x\ndenotes an elemental composition and y denotes the error in phase prediction.\n\n4.1.1 Single task assurance\n\nWe run BO for our single task assurance to discover the composition with the maximal deviation from\nThermocalc. The optimisation result is shown in Fig. 2 (Right). The element composition discovered\nby our method corresponding to the maximal error is Cr = 0.38 %, Cu = 6.04 %, Mg = 4.89 %, Ti =\n0.48 %, Zn = 6.95 %, Mn = 0.86 % and Si = 2.51 %. As seen from Fig. 2, our algorithm discovers a\nsigni\ufb01cantly different composition with a much larger error (0.59) in just about 200 iterations.\n\n4.1.2 Multi-task assurance\n\nWe use EXP3BO for multi-task assurance to discover the\ncomposition with the maximal deviation from Thermo-\ncalc for any alloy phase. Instead of measuring the error\naveraged across all phases, we consider the error of each\nphase separately. This gives rise 16 error functions where\nthe highest error needs to be found ef\ufb01ciently without\nexhaustively optimising all. To evaluate the optimisation\nef\ufb01ciency of our proposed EXP3BO algorithm we com-\npare it against the baselines - Round-robin BO, Random\nCategorical BO and SMAC [9]. To \ufb01nd the phase that\nhas the highest error (Oracle), we run BO for each phase\nseparately and identify the phase with the highest error\n(c\u2217). Fig. 4 (Left) shows that the EXP3BO outperforms\nother baselines and reaches close to the Oracle. Also\nit accurately identi\ufb01es the \u201cAL12MN\u201d phase that has\nthe highest error - see histogram in Fig. 4 (Right). We\nfound the maximal error for \u201cAL12MN\u201d phase for RMSE=1.05, at a substantially different element\ncomposition compared to the one found during the algorithm training stage (see Fig. 3).\n\nFigure 3 \u2013 Element compositions (parallel\ncoordinates) for maximal error found dur-\ning training and test stages by EXP3BO.\n\nFigure 4 \u2013 Alloy phase prediction using EXP3BO. Left: Performance comparison - RMSE vs\niterations. Right: Histogram of phases selected. It converges and exploits \u201cAL12MN\u201d phase more.\n\n4.2 A convolutional neural network for handwritten digit recognition\n\nWe construct a proxy algorithm for recognising digits and the task is to identify the level of distortion\ncausing the largest error in a transformed MNIST[11] dataset. In our notation, x denotes a visual\n\n7\n\n\u001a7\u001a:\u001e\u0004%\u0004\u00033\u001e3$\u0004\r\u000f\u0011\u0013\u0001\u000e\r\b\u0003\u001a42548\u00049\u000443\u0003%7,\u00043\u0003#\u001e$\u001c\u0001\u0003\r\u000b\u0001\u0012\u001c\u0003!\u0010\u0019 \u0003#\u001e$\u001c\u0001\u0003\u000e\u000b\r\u001202004006008001000Iteration0.60.70.80.91.01.1RMSEThermocalcD=7C=16EXP3BORoundrobinBORandomCategoricalBOOracleSMAC\u0018\u0002\u000e\u000f\u001e\u001f\u0018\u0002\u000e\u0010\u001a#\u0011$\u0002\u0011\u0018\u0002\u000e\u0012$\u0002\u000f\u001e\u0011\u0018\u0002\u000f\u001a&*\u001a\u000e\u0013\u0018\u0002\u0010%\u0002*\u001b\r\u000f\u000f\u0018\u0002\u0011\u0012'\u0001\u001a\u000e\u0011*\u0002\u0018'\u001c$\u001b\u0002\u0018\u001e \u001f\u001b*\u0018\u0011\u001d\u001a\u001a*\u0002\u000e\u000f\u0002\u001a!*\u0018\u0010\u001e\u0002\u000f$\u0002*\u001a\u000e\"*\u0018\u0002\u001a&\u001e\u0002$\u0002$\u0002\u000f%\u0002*\u001a\u0012\u0011%*!\u0002\u0018$\u001c'*!\u0002\u0018$\u001c%*!\u0002\u0018$\u001c\u0006\u000f\r\u000e\r\r\u000f\r\r\u001a4:39\fdistortion (shear and rotation) while y denotes recognition error. The training data consists of MNIST\nimages distorted with shear (shx, shy) and rotation (\u03b8 ). Our training dataset is created as follows:\nEach MNIST digit is \ufb01rst randomly sheared between shx, shy \u2208 [\u22120.2, 0.2], followed by a random\nrotation \u03b8 \u2208 [0,360]. We removed digit 9 from our data to avoid confusions with digit 6 when\nsubjected to rotation transform. 54,051 such sheared and rotated MNIST digits are used for training a\nCNN. We use the LENET-5 architecture (as in [11]) with learning rate = 10\u22123 and number of epochs\nset to 20. The mean training error was found to be 4.1%. Maximal error from grid search was found\nto be 5.7% at shear (shx = \u22120.2,shy = \u22120.2) and rotation \u03b8 = 3\u25e6.\n4.2.1 Single assurance task\n\nWe run BO for our single task assurance to discover the\ndistortion for maximal recognition error. The optimisa-\ntion result is shown in Fig. 5. BO discovered a highest\nerror of 7.1% at distortion parameters (shx = 0.088,shy =\n\u22120.2) and rotation \u03b8 = 175.7\u25e6.\n4.2.2 Multi-task assurance\n\nWe use EXP3BO for multi-task assurance to discover\nthe maximal recognition error for any digits. Instead\nof measuring the error averaged across all digits, we\nconsider error of each digit separately. This is important\nbecause the error in recognising each digit may differ\ndepending on its visual complexity and distortion. Once\nagain we compare EXP3BO with the baselines described\nin Section 4.1.2. The performance of EXP3BO is superior\nto SMAC (Fig. 6 (Left)). EXP3BO selects digit \u20182\u2019 as that with the highest error (Fig. 6 (Right)). The\nconfusion between digits 2, 7 and 4 from shearing and rotation causes comparable performance of\nother methods.\nOur\nAlgorithmicAssurance_NIPS2018.\n\nFigure 5 \u2013 Single task assurance for digit\nrecognition- optimisation results showing\nrecognition error vs iteration.\n\nhttps://github.com/shivapratap/\n\navailable\n\nat URL\n\nimplementation\n\nis\n\nFigure 6 \u2013 Multitask assurance using EXP3BO on digit recognition. Left: Performance comparison\nof recognition error vs iterations; Right: Histogram of digits selected.\n\n5 Conclusion\n\nWe have introduced a novel problem of algorithmic assurance to assess the deviation of an algorithm\nfrom its intended use. We have developed an ef\ufb01cient framework for algorithmic testing for single-\ntask and multi-task settings. The usefulness of our framework is demonstrated on two problems:\nprediction of strength-determining phases in alloy design and recognition of handwritten digits under\nshear and rotation distortions. In the modern era of arti\ufb01cial intelligence, algorithms are increasingly\ntaking decisions pertinent to our life, it is very timely to build the con\ufb01dence that algorithms can be\ntrusted and our proposed algorithmic assurance framework is an early attempt towards this goal.\n\n8\n\n0100200300400Iterations56Error %0100200300400500Iteration051015Error%CNNMNISTD=3C=9EXP3BORoundrobinBORandomCategoricalBOOracleSMAC\r\u000e\u000f\u0010\u0011\u0012\u0013\u0001\u0001\r\u000e\r\u000f\r\u0010\r\u0011\r\u0012\r\u0013\r\u0001\r\u001a4:39\fAcknowledgements\n\nThis research was partially funded by the Australian Government through the Australian Re-\nsearch Council (ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship\n(FL170100006).\n\nReferences\n[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM journal on computing, 32(1):48\u201377, 2002.\n\n[3] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of\n\nMachine Learning Research, 13(Feb):281\u2013305, 2012.\n\n[4] A. D. Bull. Convergence rates of ef\ufb01cient global optimization algorithms. The Journal of\n\nMachine Learning Research, 12:2879\u20132904, 2011.\n\n[5] T. Dai Nguyen, S. Gupta, S. Rana, and S. Venkatesh. Stable bayesian optimization.\n\nIn\nPaci\ufb01c-Asia Conference on Knowledge Discovery and Data Mining, pages 578\u2013591. Springer,\n2017.\n\n[6] P. Hennig and C. J. Schuler. Entropy search for information-ef\ufb01cient global optimization.\n\nJournal of Machine Learning Research, 13:1809\u20131837, 2012.\n\n[7] J. M. Hern\u00e1ndez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search\nfor ef\ufb01cient global optimization of black-box functions. In Advances in Neural Information\nProcessing Systems, pages 918\u2013926, 2014.\n\n[8] M. Hoffman, E. Brochu, and N. de Freitas. Portfolio allocation for bayesian optimization. In\nProceedings of the Twenty-Seventh Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n327\u2013336. AUAI Press, 2011.\n\n[9] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general\nalgorithm con\ufb01guration. In Learning and Intelligent Optimization, pages 507\u2013523. Springer,\n2011.\n\n[10] D. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global optimization, 13(4):455\u2013492, 1998.\n\n[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] J. Nogueira, R. Martinez-Cantin, A. Bernardino, and L. Jamone. Unscented bayesian opti-\nmization for safe robot grasping. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ\nInternational Conference on, pages 1967\u20131972. IEEE, 2016.\n\n[13] S. Rana, C. Li, S. Gupta, V. Nguyen, and S. Venkatesh. High dimensional Bayesian optimization\nwith elastic gaussian process. In Proceedings of the 34th International Conference on Machine\nLearning (ICML), pages 2883\u20132891, 2017.\n\n[14] C. E. Rasmussen. The in\ufb01nite gaussian mixture model. In NIPS, volume 12, pages 554\u2013560,\n\n1999.\n\n[15] C. E. Rasmussen. Gaussian processes for machine learning. Citeseer, 2006.\n[16] Y. Seldin, C. Szepesv\u00e1ri, P. Auer, and Y. Abbasi-Yadkori. Evaluation and analysis of the\nperformance of the exp3 algorithm in stochastic environments. In EWRL, pages 103\u2013116, 2012.\n[17] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning\n\nalgorithms. In Advances in neural information processing systems, pages 2951\u20132959, 2012.\n\n[18] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat,\nand R. Adams. Scalable bayesian optimization using deep neural networks. In Proceedings of\nthe 32nd International Conference on Machine Learning, pages 2171\u20132180, 2015.\n\n[19] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit\nsetting: No regret and experimental design. In Proceedings of the 27th International Conference\non Machine Learning, pages 1015\u20131022, 2010.\n\n[20] K. Swersky, J. Snoek, and R. P. Adams. Multi-task Bayesian optimization. In Advances in\n\nneural information processing systems, pages 2004\u20132012, 2013.\n\n9\n\n\f", "award": [], "sourceid": 2621, "authors": [{"given_name": "Shivapratap", "family_name": "Gopakumar", "institution": "Deakin University"}, {"given_name": "Sunil", "family_name": "Gupta", "institution": "Deakin University"}, {"given_name": "Santu", "family_name": "Rana", "institution": "Deakin University"}, {"given_name": "Vu", "family_name": "Nguyen", "institution": "Deakin University"}, {"given_name": "Svetha", "family_name": "Venkatesh", "institution": "Deakin University"}]}