{"title": "Streaming Weak Submodularity: Interpreting Neural Networks on the Fly", "book": "Advances in Neural Information Processing Systems", "page_first": 4044, "page_last": 4054, "abstract": "In many machine learning applications, it is important to explain the predictions of a black-box classifier. For example, why does a deep neural network assign an image to a particular class? We cast interpretability of black-box classifiers as a combinatorial maximization problem and propose an efficient streaming algorithm to solve it subject to cardinality constraints. By extending ideas from Badanidiyuru et al. [2014], we provide a constant factor approximation guarantee for our algorithm in the case of random stream order and a weakly submodular objective function. This is the first such theoretical guarantee for this general class of functions, and we also show that no such algorithm exists for a worst case stream order. Our algorithm obtains similar explanations of Inception V3 predictions 10 times faster than the state-of-the-art LIME framework of Ribeiro et al. [2016].", "full_text": "Streaming Weak Submodularity:\n\nInterpreting Neural Networks on the Fly\n\nEthan R. Elenberg\n\nDepartment of Electrical\nand Computer Engineering\n\nThe University of Texas at Austin\n\nelenberg@utexas.edu\n\nAlexandros G. Dimakis\nDepartment of Electrical\nand Computer Engineering\n\nThe University of Texas at Austin\ndimakis@austin.utexas.edu\n\nMoran Feldman\n\nDepartment of Mathematics\n\nand Computer Science\nOpen University of Israel\nmoranfe@openu.ac.il\n\nAmin Karbasi\n\nDepartment of Electrical Engineering\n\nDepartment of Computer Science\n\nYale University\n\namin.karbasi@yale.edu\n\nAbstract\n\nIn many machine learning applications, it is important to explain the predictions\nof a black-box classi\ufb01er. For example, why does a deep neural network assign\nan image to a particular class? We cast interpretability of black-box classi\ufb01ers\nas a combinatorial maximization problem and propose an ef\ufb01cient streaming\nalgorithm to solve it subject to cardinality constraints. By extending ideas from\nBadanidiyuru et al. [2014], we provide a constant factor approximation guarantee\nfor our algorithm in the case of random stream order and a weakly submodular\nobjective function. This is the \ufb01rst such theoretical guarantee for this general class\nof functions, and we also show that no such algorithm exists for a worst case stream\norder. Our algorithm obtains similar explanations of Inception V3 predictions 10\ntimes faster than the state-of-the-art LIME framework of Ribeiro et al. [2016].\n\nIntroduction\n\n1\nConsider the following combinatorial optimization problem. Given a ground set N of N elements\nand a set function f : 2N 7! R0, \ufb01nd the set S of size k which maximizes f (S). This formulation\nis at the heart of many machine learning applications such as sparse regression, data summarization,\nfacility location, and graphical model inference. Although the problem is intractable in general, if\nf is assumed to be submodular then many approximation algorithms have been shown to perform\nprovably within a constant factor from the best solution.\nSome disadvantages of the standard greedy algorithm of Nemhauser et al. [1978] for this problem are\nthat it requires repeated access to each data element and a large total number of function evaluations.\nThis is undesirable in many large-scale machine learning tasks where the entire dataset cannot \ufb01t in\nmain memory, or when a single function evaluation is time consuming. In our main application, each\nfunction evaluation corresponds to inference on a large neural network and can take a few seconds.\nIn contrast, streaming algorithms make a small number of passes (often only one) over the data and\nhave sublinear space complexity, and thus, are ideal for tasks of the above kind.\nRecent ideas, algorithms, and techniques from submodular set function theory have been used to\nderive similar results in much more general settings. For example, Elenberg et al. [2016a] used\nthe concept of weak submodularity to derive approximation and parameter recovery guarantees for\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fnonlinear sparse regression. Thus, a natural question is whether recent results on streaming algorithms\nfor maximizing submodular functions [Badanidiyuru et al., 2014, Buchbinder et al., 2015, Chekuri\net al., 2015] extend to the weakly submodular setting.\nThis paper answers the above question by providing the \ufb01rst analysis of a streaming algorithm\nfor any class of approximately submodular functions. We use key algorithmic components of\nSIEVE-STREAMING [Badanidiyuru et al., 2014], namely greedy thresholding and binary search,\ncombined with a novel analysis to prove a constant factor approximation for -weakly submodular\nfunctions (de\ufb01ned in Section 3). Speci\ufb01cally, our contributions are as follows.\n\n\u2022 An impossibility result showing that, even for 0.5-weakly submodular objectives, no rando-\nmized streaming algorithm which uses o(N ) memory can have a constant approximation\nratio when the ground set elements arrive in a worst case order.\n\n\u2022 STREAK: a greedy, deterministic streaming algorithm for maximizing -weakly submodular\nfunctions which uses O(\"1k log k) memory and has an approximation ratio of (1  \") \n2 \u00b7\n(3  e/2  2p2  e/2) when the ground set elements arrive in a random order.\n\u2022 An experimental evaluation of our algorithm in two applications: nonlinear sparse regres-\nsion using pairwise products of features and interpretability of black-box neural network\nclassi\ufb01ers.\n\nThe above theoretical impossibility result is quite surprising since it stands in sharp contrast to known\nstreaming algorithms for submodular objectives achieving a constant approximation ratio even for\nworst case stream order.\nOne advantage of our approach is that, while our approximation guarantees are in terms of , our\nalgorithm STREAK runs without requiring prior knowledge about the value of . This is important\nsince the weak submodularity parameter  is hard to compute, especially in streaming applications,\nas a single element can alter  drastically.\nWe use our streaming algorithm for neural network interpretability on Inception V3 [Szegedy et al.,\n2016]. For that purpose, we de\ufb01ne a new set function maximization problem similar to LIME [Ribeiro\net al., 2016] and apply our framework to approximately maximize this function. Experimentally,\nwe \ufb01nd that our interpretability method produces explanations of similar quality as LIME, but runs\napproximately 10 times faster.\n\n2 Related Work\n\nMonotone submodular set function maximization has been well studied, starting with the classical\nanalysis of greedy forward selection subject to a matroid constraint [Nemhauser et al., 1978, Fisher\net al., 1978]. For the special case of a uniform matroid constraint, the greedy algorithm achieves\nan approximation ratio of 1  1/e [Fisher et al., 1978], and a more involved algorithm obtains this\nratio also for general matroid constraints [C\u02d8alinescu et al., 2011]. In general, no polynomial-time\nalgorithm can have a better approximation ratio even for a uniform matroid constraint [Nemhauser\nand Wolsey, 1978, Feige, 1998]. However, it is possible to improve upon this bound when the data\nobeys some additional guarantees [Conforti and Cornu\u00e9jols, 1984, Vondr\u00e1k, 2010, Sviridenko et al.,\n2015]. For maximizing nonnegative, not necessarily monotone, submodular functions subject to\na general matroid constraint, the state-of-the-art randomized algorithm achieves an approximation\nratio of 0.385 [Buchbinder and Feldman, 2016b]. Moreover, for uniform matroids there is also\na deterministic algorithm achieving a slightly worse approximation ratio of 1/e [Buchbinder and\nFeldman, 2016a]. The reader is referred to Bach [2013] and Krause and Golovin [2014] for surveys\non submodular function theory.\nA recent line of work aims to develop new algorithms for optimizing submodular functions suit-\nable for large-scale machine learning applications. Algorithmic advances of this kind include\nSTOCHASTIC-GREEDY [Mirzasoleiman et al., 2015], SIEVE-STREAMING [Badanidiyuru et al.,\n2014], and several distributed approaches [Mirzasoleiman et al., 2013, Barbosa et al., 2015, 2016, Pan\net al., 2014, Khanna et al., 2017b]. Our algorithm extends ideas found in SIEVE-STREAMING and\nuses a different analysis to handle more general functions. Additionally, submodular set functions\nhave been used to prove guarantees for online and active learning problems [Hoi et al., 2006, Wei\net al., 2015, Buchbinder et al., 2015]. Speci\ufb01cally, in the online setting corresponding to our setting\n\n2\n\n\f(i.e., maximizing a monotone function subject to a cardinality constraint), Chan et al. [2017] achieve\na competitive ratio of about 0.3178 when the function is submodular.\nThe concept of weak submodularity was introduced in Krause and Cevher [2010], Das and Kempe\n[2011], where it was applied to the speci\ufb01c problem of feature selection in linear regression. Their\nmain results state that if the data covariance matrix is not too correlated (using either incoherence or\nrestricted eigenvalue assumptions), then maximizing the goodness of \ufb01t f (S) = R2\nS as a function of\nthe feature set S is weakly submodular. This leads to constant factor approximation guarantees for\nseveral greedy algorithms. Weak submodularity was connected with Restricted Strong Convexity\nin Elenberg et al. [2016a,b]. This showed that the same assumptions which imply the success of\nregularization also lead to guarantees on greedy algorithms. This framework was later used for\nadditional algorithms and applications [Khanna et al., 2017a,b]. Other approximate versions of\nsubmodularity were used for greedy selection problems in Horel and Singer [2016], Hassidim and\nSinger [2017], Altschuler et al. [2016], Bian et al. [2017]. To the best of our knowledge, this is the\n\ufb01rst analysis of streaming algorithms for approximately submodular set functions.\nIncreased interest in interpretable machine learning models has led to extensive study of sparse\nfeature selection methods. For example, Bahmani et al. [2013] consider greedy algorithms for logistic\nregression, and Yang et al. [2016] solve a more general problem using `1 regularization. Recently,\nRibeiro et al. [2016] developed a framework called LIME for interpreting black-box neural networks,\nand Sundararajan et al. [2017] proposed a method that requires access to the network\u2019s gradients with\nrespect to its inputs. We compare our algorithm to variations of LIME in Section 6.2.\n\n3 Preliminaries\n\nFirst we establish some de\ufb01nitions and notation. Sets are denoted with capital letters, and all big O\nnotation is assumed to be scaling with respect to N (the number of elements in the input stream).\nGiven a set function f, we often use the discrete derivative f (B | A) , f (A [ B)  f (A). f is\nmonotone if f (B | A)  0,8A, B and nonnegative if f (A)  0,8A. Using this notation one can\nde\ufb01ne weakly submodular functions based on the following ratio.\nDe\ufb01nition 3.1 (Weak Submodularity, adapted from Das and Kempe [2011]). A monotone nonnegative\nset function f : 2N 7! R0 is called -weakly submodular for an integer r if\n\n \uf8ff r , min\nL,S\u2713N :\n\n,\n\n|L|,|S\\L|\uf8ffrPj2S\\L f (j | L)\n\nf (S | L)\n\nwhere the ratio is considered to be equal to 1 when its numerator and denominator are both 0.\n\nThis generalizes submodular functions by relaxing the diminishing returns property of discrete\nderivatives. It is easy to show that f is submodular if and only if |N| = 1.\nDe\ufb01nition 3.2 (Approximation Ratio). A streaming maximization algorithm ALG which returns\na set S has approximation ratio R 2 [0, 1] if E[f (S)]  R \u00b7 f (OP T ), where OP T is the optimal\nsolution and the expectation is over the random decisions of the algorithm and the randomness of the\ninput stream order (when it is random).\nFormally our problem is as follows. Assume that elements from a ground set N arrive in a stream at\neither random or worst case order. The goal is then to design a one pass streaming algorithm that\ngiven oracle access to a nonnegative set function f : 2N 7! R0 maintains at most o(N ) elements in\nmemory and returns a set S of size at most k approximating\n\nf (T ) ,\n\nmax\n|T|\uf8ffk\n\nup to an approximation ratio R(k). Ideally, this approximation ratio should be as large as possible,\nand we also want it to be a function of k and nothing else. In particular, we want it to be independent\nof k and N.\nTo simplify notation, we use  in place of k in the rest of the paper. Additionally, proofs for all our\ntheoretical results are deferred to the Supplementary Material.\n\n3\n\n\f4\n\nImpossibility Result\n\nTo prove our negative result showing that no streaming algorithm for our problem has a constant\napproximation ratio against a worst case stream order, we \ufb01rst need to construct a weakly submodular\nset function fk. Later we use it to construct a bad instance for any given streaming algorithm.\nFix some k  1, and consider the ground set Nk = {ui, vi}k\nfor every subset S \u2713N k\n\ni=1. For ease of notation, let us de\ufb01ne\n\nu(S) = |S \\{ ui}k\nNow we de\ufb01ne the following set function:\n\ni=1| ,\n\nv(S) = |S \\{ vi}k\n\ni=1| .\n\nfk(S) = min{2 \u00b7 u(S) + 1, 2 \u00b7 v(S)}8\n\nS \u2713N k .\n\nLemma 4.1. fk is nonnegative, monotone and 0.5-weakly submodular for the integer |Nk|.\nSince |Nk| = 2k, the maximum value of fk is fk(Nk) = 2\u00b7 v(Nk) = 2k. We now extend the ground\nset of fk by adding to it an arbitrary large number d of dummy elements which do not affect fk at all.\nClearly, this does not affect the properties of fk proved in Lemma 4.1. However, the introduction\nof dummy elements allows us to assume that k is an arbitrary small value compared to N, which is\nnecessary for the proof of the next theorem. In a nutshell, this proof is based on the observation that\nthe elements of {ui}k\ni=1 are indistinguishable from the dummy elements as long as no element of\ni=1 has arrived yet.\n{vi}k\nTheorem 4.2. For every constant c 2 (0, 1] there is a large enough k such that no randomized\nstreaming algorithm that uses o(N ) memory to solve max|S|\uf8ff2k fk(S) has an approximation ratio\nof c for a worst case stream order.\n\nWe note that fk has strong properties.\nIn particular, Lemma 4.1 implies that it is 0.5-weakly\nsubmodular for every 0 \uf8ff r \uf8ff|N| .\nIn contrast, the algorithm we show later assumes weak\nsubmodularity only for the cardinality constraint k. Thus, the above theorem implies that worst\ncase stream order precludes a constant approximation ratio even for functions with much stronger\nproperties compared to what is necessary for getting a constant approximation ratio when the order is\nrandom.\nThe proof of Theorem 4.2 relies critically on the fact that each element is seen exactly once. In\nother words, once the algorithm decides to discard an element from its memory, this element is gone\nforever, which is a standard assumption for streaming algorithms. Thus, the theorem does not apply\nto algorithms that use multiple passes over N , or non-streaming algorithms that use o(N ) writable\nmemory, and their analysis remains an interesting open problem.\n\n5 Streaming Algorithms\n\nIn this section we give a deterministic streaming algorithm for our problem which works in a model\nin which the stream contains the elements of N in a random order. We \ufb01rst describe in Section 5.1\nsuch a streaming algorithm assuming access to a value \u2327 which approximates a \u00b7 f (OP T ), where a\nis a shorthand for a = (p2  e/2  1)/2. Then, in Section 5.2 we explain how this assumption\ncan be removed to obtain STREAK and bound its approximation ratio, space complexity, and running\ntime.\n\n5.1 Algorithm with access to \u2327\n\nConsider Algorithm 1.\nIn addition to the input instance, this algorithm gets a parameter \u2327 2\n[0, a \u00b7 f (OP T )]. One should think of \u2327 as close to a \u00b7 f (OP T ), although the following analysis\nof the algorithm does not rely on it. We provide an outline of the proof, but defer the technical details\nto the Supplementary Material.\nTheorem 5.1. The expected value of the set produced by Algorithm 1 is at least\n\n3  e/2  2p2  e/2\n\n2\n\n\u2327\na \u00b7\n\n= \u2327 \u00b7 (p2  e/2  1) .\n\n4\n\n\fAlgorithm 1 THRESHOLD GREEDY(f, k, \u2327 )\n\nLet S ?.\nwhile there are more elements do\n\nLet u be the next element.\nif |S| < k and f (u | S)  \u2327 /k then\nUpdate S S [{ u}.\nend if\nend while\nreturn: S\n\nAlgorithm 2 STREAK(f, k, \")\n\nLet m 0, and let I be an (originally empty) collection of instances of Algorithm 1.\nwhile there are more elements do\n\nUpdate m f (u) and um u.\n\nLet u be the next element.\nif f (u)  m then\nend if\nUpdate I so that it contains an instance of Algorithm 1 with \u2327 = x for every x 2{ (1  \")i | i 2\nZ and (1  \")m/(9k2) \uf8ff (1  \")i \uf8ff mk}, as explained in Section 5.2.\nPass u to all instances of Algorithm 1 in I.\n\nend while\nreturn: the best set among all the outputs of the instances of Algorithm 1 in I and the singleton\nset {um}.\n\nProof (Sketch). Let E be the event that f (S) <\u2327 , where S is the output produced by Algorithm 1.\nClearly f (S)  \u2327 whenever E does not occur, and thus, it is possible to lower bound the expected\nvalue of f (S) using E as follows.\nObservation 5.2. Let S denote the output of Algorithm 1, then E[f (S)]  (1  Pr[E]) \u00b7 \u2327.\nThe lower bound given by Observation 5.2 is decreasing in Pr[E]. Proposition 5.4 provides another\nlower bound for E[f (S)] which increases with Pr[E]. An important ingredient of the proof of this\nproposition is the next observation, which implies that the solution produced by Algorithm 1 is always\nof size smaller than k when E happens.\nObservation 5.3. If at some point Algorithm 1 has a set S of size k, then f (S)  \u2327.\nThe proof of Proposition 5.4 is based on the above observation and on the observation that the random\narrival order implies that every time that an element of OP T arrives in the stream we may assume it\nis a random element out of all the OP T elements that did not arrive yet.\nProposition 5.4. For the set S produced by Algorithm 1,\n\n1\n\nE[f (S)] \n\n2 \u00b7\u21e3 \u00b7 [Pr[E]  e/2] \u00b7 f (OP T )  2\u2327\u2318 .\nThe theorem now follows by showing that for every possible value of Pr[E] the guarantee of the\ntheorem is implied by either Observation 5.2 or Proposition 5.4. Speci\ufb01cally, the former happens\nwhen Pr[E] \uf8ff 2  p2  e/2 and the later when Pr[E]  2  p2  e/2.\n5.2 Algorithm without access to \u2327\n\nIn this section we explain how to get an algorithm which does not depend on \u2327. Instead, STREAK\n(Algorithm 2) receives an accuracy parameter \" 2 (0, 1). Then, it uses \" to run several instances of\nAlgorithm 1 stored in a collection denoted by I. The algorithm maintains two variables throughout its\nexecution: m is the maximum value of a singleton set corresponding to an element that the algorithm\nalready observed, and um references an arbitrary element satisfying f (um) = m.\n\n5\n\n\fThe collection I is updated as follows after each element arrival. If previously I contained an instance\nof Algorithm 1 with a given value for \u2327, and it no longer should contain such an instance, then the\ninstance is simply removed. In contrast, if I did not contain an instance of Algorithm 1 with a given\nvalue for \u2327, and it should now contain such an instance, then a new instance with this value for \u2327 is\ncreated. Finally, if I contained an instance of Algorithm 1 with a given value for \u2327, and it should\ncontinue to contain such an instance, then this instance remains in I as is.\nTheorem 5.5. The approximation ratio of STREAK is at least\n\n3  e/2  2p2  e/2\n\n.\n\n(1  \") \u00b7\n\n2\n\nThe proof of Theorem 5.5 shows that in the \ufb01nal collection I there is an instance of Algorithm 1\nwhose \u2327 provides a good approximation for a \u00b7 f (OP T ), and thus, this instance of Algorithm 1\nshould (up to some technical details) produce a good output set in accordance with Theorem 5.1.\nIt remains to analyze the space complexity and running time of STREAK. We concentrate on bounding\nthe number of elements STREAK keeps in its memory at any given time, as this amount dominates\nthe space complexity as long as we assume that the space necessary to keep an element is at least as\nlarge as the space necessary to keep each one of the numbers used by the algorithm.\nTheorem 5.6. The space complexity of STREAK is O(\"1k log k) elements.\nThe running time of Algorithm 1 is O(N f ) where, abusing notation, f is the running time of a single\noracle evaluation of f. Therefore, the running time of STREAK is O(N f\"1 log k) since it uses at\nevery given time only O(\"1 log k) instances of the former algorithm. Given multiple threads, this\ncan be improved to O(N f + \"1 log k) by running the O(\"1 log k) instances of Algorithm 1 in\nparallel.\n\n6 Experiments\n\nWe evaluate the performance of our streaming algorithm on two sparse feature selection applications.1\nFeatures are passed to all algorithms in a random order to match the setting of Section 5.\n\n(a) Performance\n\n(b) Cost\n\nFigure 1: Logistic Regression, Phishing dataset with pairwise feature products. Our algorithm is\ncomparable to LOCALSEARCH in both log likelihood and generalization accuracy, with much lower\nrunning time and number of model \ufb01ts in most cases. Results averaged over 40 iterations, error bars\nshow 1 standard deviation.\n6.1 Sparse Regression with Pairwise Features\n\nIn this experiment, a sparse logistic regression is \ufb01t on 2000 training and 2000 test observations from\nthe Phishing dataset [Lichman, 2013]. This setup is known to be weakly submodular under mild data\nassumptions [Elenberg et al., 2016a]. First, the categorical features are one-hot encoded, increasing\n\n1Code for these experiments is available at https://github.com/eelenberg/streak.\n\n6\n\nRandomStreak(0.75)Streak(0.1)LocalSearch0200400600LogLikelihoodRandomStreak(0.75)Streak(0.1)LocalSearch0.700.750.800.850.900.951.00GeneralizationAccuracyk=20k=40k=80RandomStreak(0.75)Streak(0.1)LocalSearch050001000015000RunningTime(s)RandomStreak(0.75)Streak(0.1)LocalSearch0100000200000300000400000OracleEvaluationsk=20k=40k=80\f(a) Sparse Regression\n\n(b) Interpretability\n\nFigure 2:\n2(a): Logistic Regression, Phishing dataset with pairwise feature products, k = 80\nfeatures. By varying the parameter \", our algorithm captures a time-accuracy tradeoff between\nRANDOMSUBSET and LOCALSEARCH. Results averaged over 40 iterations, standard deviation\nshown with error bars. 2(b): Running times of interpretability algorithms on the Inception V3\nnetwork, N = 30, k = 5. Streaming maximization runs 10 times faster than the LIME framework.\nResults averaged over 40 total iterations using 8 example explanations, error bars show 1 standard\ndeviation.\n\nthe feature dimension to 68. Then, all pairwise products are added for a total of N = 4692 features.\nTo reduce computational cost, feature products are generated and added to the stream on-the-\ufb02y as\nneeded. We compare with 2 other algorithms. RANDOMSUBSET selects the \ufb01rst k features from\nthe random stream. LOCALSEARCH \ufb01rst \ufb01lls a buffer with the \ufb01rst k features, and then swaps each\nincoming feature with the feature from the buffer which yields the largest nonnegative improvement.\nFigure 1(a) shows both the \ufb01nal log likelihood and the generalization accuracy for RANDOMSUBSET,\nLOCALSEARCH, and our STREAK algorithm for \" = {0.75, 0.1} and k = {20, 40, 80}. As expected,\nthe RANDOMSUBSET algorithm has much larger variation since its performance depends highly on\nthe random stream order. It also performs signi\ufb01cantly worse than LOCALSEARCH for both metrics,\nwhereas STREAK is comparable for most parameter choices. Figure 1(b) shows two measures of\ncomputational cost: running time and the number of oracle evaluations (regression \ufb01ts). We note\nSTREAK scales better as k increases; for example, STREAK with k = 80 and \" = 0.1 (\" = 0.75)\nruns in about 70% (5%) of the time it takes to run LOCALSEARCH with k = 40. Interestingly, our\nspeedups are more substantial with respect to running time. In some cases STREAK actually \ufb01ts\nmore regressions than LOCALSEARCH, but still manages to be faster. We attribute this to the fact\nthat nearly all of LOCALSEARCH\u2019s regressions involve k features, which are slower than many of\nthe small regressions called by STREAK.\nFigure 2(a) shows the \ufb01nal log likelihood versus running time for k = 80 and \" 2 [0.05, 0.75]. By\nvarying the precision \", we achieve a gradual tradeoff between speed and performance. This shows\nthat STREAK can reduce the running time by over an order of magnitude with minimal impact on the\n\ufb01nal log likelihood.\n\n6.2 Black-Box Interpretability\n\nOur next application is interpreting the predictions of black-box machine learning models. Speci\ufb01cally,\nwe begin with the Inception V3 deep neural network [Szegedy et al., 2016] trained on ImageNet. We\nuse this network for the task of classifying 5 types of \ufb02owers via transfer learning. This is done by\nadding a \ufb01nal softmax layer and retraining the network.\nWe compare our approach to the LIME framework [Ribeiro et al., 2016] for developing sparse,\ninterpretable explanations. The \ufb01nal step of LIME is to \ufb01t a k-sparse linear regression in the space of\ninterpretable features. Here, the features are superpixels determined by the SLIC image segmentation\nalgorithm [Achanta et al., 2012] (regions from any other segmentation would also suf\ufb01ce). The\nnumber of superpixels is bounded by N = 30. After a feature selection step, a \ufb01nal regression is\nperformed on only the selected features. The following feature selection methods are supplied by\n\n7\n\n100101102103104105RunningTime(s)500550600650700LogLikelihoodRandomStreak(0.75)Streak(0.5)Streak(0.2)Streak(0.1)Streak(0.05)LocalSearchLIME+MaxWtsLIME+FSLIME+LassoStreak05001000150020002500RunningTime(s)\fLIME: 1. Highest Weights: \ufb01ts a full regression and keep the k features with largest coef\ufb01cients. 2.\nForward Selection: standard greedy forward selection. 3. Lasso: `1 regularization.\nWe introduce a novel method for black-box interpretability that is similar to but simpler than LIME.\nAs before, we segment an image into N superpixels. Then, for a subset S of those regions we can\ncreate a new image that contains only these regions and feed this into the black-box classi\ufb01er. For a\ngiven model M, an input image I, and a label L1 we ask for an explanation: why did model M label\nimage I with label L1. We propose the following solution to this problem. Consider the set function\nf (S) giving the likelihood that image I(S) has label L1. We approximately solve\n\nf (S) ,\n\nmax\n|S|\uf8ffk\n\nusing STREAK. Intuitively, we are limiting the number of superpixels to k so that the output will\ninclude only the most important superpixels, and thus, will represent an interpretable explanation. In\nour experiments we set k = 5.\nNote that the set function f (S) depends on the black-box classi\ufb01er and is neither monotone nor\nsubmodular in general. Still, we \ufb01nd that the greedy maximization algorithm produces very good\nexplanations for the \ufb02ower classi\ufb01er as shown in Figure 3 and the additional experiments in the\nSupplementary Material. Figure 2(b) shows that our algorithm is much faster than the LIME approach.\nThis is primarily because LIME relies on generating and classifying a large set of randomly perturbed\nexample images.\n\n7 Conclusions\n\nWe propose STREAK, the \ufb01rst streaming algorithm for maximizing weakly submodular functions,\nand prove that it achieves a constant factor approximation assuming a random stream order. This\nis useful when the set function is not submodular and, additionally, takes a long time to evaluate or\nhas a very large ground set. Conversely, we show that under a worst case stream order no algorithm\nwith memory sublinear in the ground set size has a constant factor approximation. We formulate\ninterpretability of black-box neural networks as set function maximization, and show that STREAK\nprovides interpretable explanations faster than previous approaches. We also show experimentally\nthat STREAK trades off accuracy and running time in nonlinear sparse regression.\nOne interesting direction for future work is to tighten the bounds of Theorems 5.1 and 5.5, which\nare nontrivial but somewhat loose. For example, there is a gap between the theoretical guarantee\nof the state-of-the-art algorithm for submodular functions and our bound for  = 1. However, as\nour algorithm performs the same computation as that state-of-the-art algorithm when the function\nis submodular, this gap is solely an analysis issue. Hence, the real theoretical performance of our\nalgorithm is better than what we have been able to prove in Section 5.\n\n8 Acknowledgments\n\nThis research has been supported by NSF Grants CCF 1344364, 1407278, 1422549, 1618689, ARO\nYIP W911NF-14-1-0258, ISF Grant 1357/16, Google Faculty Research Award, and DARPA Young\nFaculty Award (D16AP00046).\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Comparison of interpretability algorithms for the Inception V3 deep neural network. We\nhave used transfer learning to extract features from Inception and train a \ufb02ower classi\ufb01er. In these\nfour input images the \ufb02ower types were correctly classi\ufb01ed (from (a) to (d): rose, sun\ufb02ower, daisy,\nand daisy). We ask the question of interpretability: why did this model classify this image as rose.\nWe are using our framework (and the recent prior work LIME [Ribeiro et al., 2016]) to see which\nparts of the image the neural network is looking at for these classi\ufb01cation tasks. As can be seen\nSTREAK correctly identi\ufb01es the \ufb02ower parts of the images while some LIME variations do not. More\nimportantly, STREAK is creating subsampled images on-the-\ufb02y, and hence, runs approximately 10\ntimes faster. Since interpretability tasks perform multiple calls to the black-box model, the running\ntimes can be quite signi\ufb01cant.\n\n9\n\n\fReferences\nRadhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine S\u00fcsstrunk.\nSLIC Superpixels Compared to State-of-the-art Superpixel Methods. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence, 34(11):2274\u20132282, 2012.\n\nJason Altschuler, Aditya Bhaskara, Gang (Thomas) Fu, Vahab Mirrokni, Afshin Rostamizadeh,\nand Morteza Zadimoghaddam. Greedy Column Subset Selection: New Bounds and Distributed\nAlgorithms. In ICML, pages 2539\u20132548, 2016.\n\nFrancis R. Bach. Learning with Submodular Functions: A Convex Optimization Perspective. Foun-\n\ndations and Trends in Machine Learning, 6, 2013.\n\nAshwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming\nSubmodular Maximization: Massive Data Summarization on the Fly. In KDD, pages 671\u2013680,\n2014.\n\nSohail Bahmani, Bhiksha Raj, and Petros T. Boufounos. Greedy Sparsity-Constrained Optimization.\n\nJournal of Machine Learning Research, 14:807\u2013841, 2013.\n\nRafael da Ponte Barbosa, Alina Ene, Huy L. Nguyen, and Justin Ward. The Power of Randomization:\nDistributed Submodular Maximization on Massive Datasets. In ICML, pages 1236\u20131244, 2015.\nRafael da Ponte Barbosa, Alina Ene, Huy L. Nguyen, and Justin Ward. A New Framework for\n\nDistributed Submodular Maximization. In FOCS, pages 645\u2013654, 2016.\n\nAndrew An Bian, Baharan Mirzasoleiman, Joachim M. Buhmann, and Andreas Krause. Guaranteed\nNon-convex Optimization: Submodular Maximization over Continuous Domains. In AISTATS,\npages 111\u2013120, 2017.\n\nNiv Buchbinder and Moran Feldman. Deterministic Algorithms for Submodular Maximization\n\nProblems. In SODA, pages 392\u2013403, 2016a.\n\nNiv Buchbinder and Moran Feldman. Constrained Submodular Maximization via a Non-symmetric\n\nTechnique. CoRR, abs/1611.03253, 2016b. URL http://arxiv.org/abs/1611.03253.\n\nNiv Buchbinder, Moran Feldman, and Roy Schwartz. Online Submodular Maximization with\n\nPreemption. In SODA, pages 1202\u20131216, 2015.\n\nGruia C\u02d8alinescu, Chandra Chekuri, Martin P\u00e1l, and Jan Vondr\u00e1k. Maximizing a Monotone Submodu-\n\nlar Function Subject to a Matroid Constraint. SIAM J. Comput., 40(6):1740\u20131766, 2011.\n\nT-H. Hubert Chan, Zhiyi Huang, Shaofeng H.-C. Jiang, Ning Kang, and Zhihao Gavin Tang. Online\nSubmodular Maximization with Free Disposal: Randomization Beats 1/4 for Partition Matroids. In\nSODA, pages 1204\u20131223, 2017.\n\nChandra Chekuri, Shalmoli Gupta, and Kent Quanrud. Streaming Algorithms for Submodular\n\nFunction Maximization. In ICALP, pages 318\u2013330, 2015.\n\nMichele Conforti and G\u00e9rard Cornu\u00e9jols. Submodular set functions, matroids and the greedy\nalgorithm: Tight worst-case bounds and some generalizations of the Rado-Edmonds theorem.\nDiscrete Applied Mathematics, 7(3):251\u2013274, March 1984.\n\nAbhimanyu Das and David Kempe. Submodular meets Spectral: Greedy Algorithms for Subset\n\nSelection, Sparse Approximation and Dictionary Selection. In ICML, pages 1057\u20131064, 2011.\n\nEthan R. Elenberg, Rajiv Khanna, Alexandros G. Dimakis, and Sahand Negahban. Restricted\nStrong Convexity Implies Weak Submodularity. CoRR, abs/1612.00804, 2016a. URL http:\n//arxiv.org/abs/1612.00804.\n\nEthan R. Elenberg, Rajiv Khanna, Alexandros G. Dimakis, and Sahand Negahban. Restricted Strong\nConvexity Implies Weak Submodularity. In NIPS Workshop on Learning in High Dimensions with\nStructure, 2016b.\n\nUriel Feige. A Threshold of ln n for Approximating Set Cover. Journal of the ACM (JACM), 45(4):\n\n634\u2013652, 1998.\n\n10\n\n\fMarshall L. Fisher, George L. Nemhauser, and Laurence A. Wolsey. An analysis of approximations\nfor maximizing submodular set functions\u2013II.\nIn M. L. Balinski and A. J. Hoffman, editors,\nPolyhedral Combinatorics: Dedicated to the memory of D.R. Fulkerson, pages 73\u201387. Springer\nBerlin Heidelberg, Berlin, Heidelberg, 1978.\n\nAvinatan Hassidim and Yaron Singer. Submodular Optimization Under Noise. In COLT, pages\n\n1069\u20131122, 2017.\n\nSteven C. H. Hoi, Rong Jin, Jianke Zhu, and Michael R. Lyu. Batch Mode Active Learning and its\n\nApplication to Medical Image Classi\ufb01cation. In ICML, pages 417\u2013424, 2006.\n\nThibaut Horel and Yaron Singer. Maximization of Approximately Submodular Functions. In NIPS,\n\n2016.\n\nRajiv Khanna, Ethan R. Elenberg, Alexandros G. Dimakis, Joydeep Ghosh, and Sahand Negahban.\nOn Approximation Guarantees for Greedy Low Rank Optimization. In ICML, pages 1837\u20131846,\n2017a.\n\nRajiv Khanna, Ethan R. Elenberg, Alexandros G. Dimakis, Sahand Negahban, and Joydeep Ghosh.\nScalable Greedy Support Selection via Weak Submodularity. In AISTATS, pages 1560\u20131568,\n2017b.\n\nAndreas Krause and Volkan Cevher. Submodular Dictionary Selection for Sparse Representation. In\n\nICML, pages 567\u2013574, 2010.\n\nAndreas Krause and Daniel Golovin. Submodular Function Maximization. Tractability: Practical\n\nApproaches to Hard Problems, 3:71\u2013104, 2014.\n\nMoshe Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/\n\nml.\n\nBaharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed Submodular\nMaximization: Identifying Representative Elements in Massive Data. NIPS, pages 2049\u20132057,\n2013.\n\nBaharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondr\u00e1k, and Andreas\n\nKrause. Lazier Than Lazy Greedy. In AAAI, pages 1812\u20131818, 2015.\n\nGeorge L. Nemhauser and Laurence A. Wolsey. Best Algorithms for Approximating the Maximum\n\nof a Submodular Set Function. Math. Oper. Res., 3(3):177\u2013188, August 1978.\n\nGeorge L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approximations\nfor maximizing submodular set functions\u2013I. Mathematical Programming, 14(1):265\u2013294, 1978.\nXinghao Pan, Stefanie Jegelka, Joseph E. Gonzalez, Joseph K. Bradley, and Michael I. Jordan.\n\nParallel Double Greedy Submodular Maximization. In NIPS, pages 118\u2013126, 2014.\n\nMarco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. \u201cWhy Should I Trust You?\u201d Explaining\n\nthe Predictions of Any Classi\ufb01er. In KDD, pages 1135\u20131144, 2016.\n\nMukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribution for Deep Networks. In\n\nICML, pages 3319\u20133328, 2017.\n\nMaxim Sviridenko, Jan Vondr\u00e1k, and Justin Ward. Optimal approximation for submodular and\n\nsupermodular optimization with bounded curvature. In SODA, pages 1134\u20131148, 2015.\n\nChristian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking\n\nthe Inception Architecture for Computer Vision. In CVPR, pages 2818\u20132826, 2016.\n\nJan Vondr\u00e1k. Submodularity and curvature: the optimal algorithm. RIMS K\u00f4ky\u00fbroku Bessatsu B23,\n\npages 253\u2013266, 2010.\n\nKai Wei, Iyer Rishabh, and Jeff Bilmes. Submodularity in Data Subset Selection and Active Learning.\n\nICML, pages 1954\u20131963, 2015.\n\nZhuoran Yang, Zhaoran Wang, Han Liu, Yonina C. Eldar, and Tong Zhang. Sparse Nonlinear\n\nRegression: Parameter Estimation and Asymptotic Inference. ICML, pages 2472\u20132481, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2145, "authors": [{"given_name": "Ethan", "family_name": "Elenberg", "institution": "University of Texas at Austin"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "University of Texas, Austin"}, {"given_name": "Moran", "family_name": "Feldman", "institution": "Open University of Israel"}, {"given_name": "Amin", "family_name": "Karbasi", "institution": "Yale"}]}