{"title": "Deep Submodular Functions: Definitions and Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3404, "page_last": 3412, "abstract": "We propose and study a new class of submodular functions called deep submodular functions (DSFs). We define DSFs and situate them within the broader context of classes of submodular functions in relationship both to various matroid ranks and sums of concave composed with modular functions (SCMs). Notably, we find that DSFs constitute a strictly broader class than SCMs, thus motivating their use, but that they do not comprise all submodular functions. Interestingly, some DSFs can be seen as special cases of certain deep neural networks (DNNs), hence the name. Finally, we provide a method to learn DSFs in a max-margin framework, and offer preliminary results applying this both to synthetic and real-world data instances.", "full_text": "Deep Submodular Functions: De\ufb01nitions & Learning\n\nBrian Dolhansky\u2021 \nDept. of Computer Science and Engineering\u2021\n\nUniversity of Washington\n\nSeattle, WA 98105\n\nJeff Bilmes\u2020\u2021 \n\nDept. of Electrical Engineering\u2020\n\nUniversity of Washington\n\nSeattle, WA 98105\n\nAbstract\n\nWe propose and study a new class of submodular functions called deep submodular\nfunctions (DSFs). We de\ufb01ne DSFs and situate them within the broader context of\nclasses of submodular functions in relationship both to various matroid ranks and\nsums of concave composed with modular functions (SCMs). Notably, we \ufb01nd that\nDSFs constitute a strictly broader class than SCMs, thus motivating their use, but\nthat they do not comprise all submodular functions. Interestingly, some DSFs can\nbe seen as special cases of certain deep neural networks (DNNs), hence the name.\nFinally, we provide a method to learn DSFs in a max-margin framework, and offer\npreliminary results applying this both to synthetic and real-world data instances.\n\n1\n\nIntroduction\n\nSubmodular functions are attractive models of many physical processes primarily because they\npossess an inherent naturalness to a wide variety of problems (e.g., they are good models of diversity,\ninformation, and cooperative costs) while at the same time they enjoy properties suf\ufb01cient for ef\ufb01cient\noptimization. For example, submodular functions can be minimized without constraints in polynomial\ntime [12] even though they lie within a 2n-dimensional cone in R2n. Moreover, while submodular\nfunction maximization is NP-hard, submodular maximization is one of the easiest of the NP-hard\nproblems since constant factor approximation algorithms are often available \u2014 e.g., in the cardinality\nconstrained case, the classic 1 \u2212 1/e result of Nemhauser [21] via the greedy algorithm. Other\nproblems also have guarantees, such as submodular maximization subject to knapsack or multiple\nmatroid constraints [8, 7, 18, 15, 16].\nOne of the critical problems associated with utilizing submodular functions in machine learning\ncontexts is selecting which submodular function to use, and given that submodular functions lie\nin such a vast space with 2n degrees of freedom, it is a non-trivial task to \ufb01nd one that works\nwell, if not optimally. One approach is to attempt to learn the submodular function based on either\nqueries of some form or based on data. This has led to results, mostly in the theory community,\nshowing how learning submodularity can be harder or easier depending on how we judge what is\nbeing learnt. For example, it was shown that learning submodularity in the PMAC setting is fairly\nhard [2] although in some cases things are a bit easier [11]. In both of these cases, learning is over\nall points in the hypercube. Learning can be made easier if we restrict ourselves to learn within\nonly a subfamily of submodular functions. For example, in [24, 19], it is shown that one can learn\nmixtures of submodular functions using a max-margin learning framework \u2014 here the components\nof the mixture are \ufb01xed and it is only the mixture parameters that are learnt, leading often to a convex\noptimization problem. In some cases, computing gradients of the convex problem can be done using\nsubmodular maximization [19], while in other cases, even a gradient requires minimizing a difference\nof two submodular functions [27].\nLearning over restricted families rather than over the entire cone is desirable for the same reasons that\nany form of regularization in machine learning is useful. By restricting the family over which learning\noccurs, it decreases the complexity of the learning problem, thereby increasing the chance that one\n\ufb01nds a good model within that family. This can be seen as a classic bias-variance tradeoff, where\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Left: A layered DSF with K = 3 layers. Right: a 3-block DSF allowing layer skipping.\n\nincreasing bias can reduce variance. Up to now, learning over restricted families has apparently (to\nthe authors knowledge) been limited to learning mixtures over \ufb01xed components. This can be quite\nlimited if the components are restricted, and if not might require a very large number of components.\nTherefore, there is a need for a richer and more \ufb02exible family of submodular functions over which\nlearning is still possible.\nIn this paper (and in [5]), we introduce a new family of submodular functions that we term \u201cdeep\nsubmodular functions,\u201d or DSFs. DSFs strictly generalize, as we show below, many of the kinds\nof submodular functions that are useful in machine learning contexts. These include the so-called\n\u201cdecomposable\u201d submodular functions, namely those that can be represented as a sum of concave\ncomposed with modular functions [25].\nWe describe the family of DSFs and place them in the context of the general submodular family. In\nparticular, we show that DSFs strictly generalize standard decomposable functions, thus theoretically\nmotivating the use of deeper networks as a family over which to learn. Moreover, DSFs can represent\na variety of complex submodular functions such as laminar matroid rank functions. These matroid\nrank functions include the truncated matroid rank function [13] that is often used to show theoretical\nworst-case performance for many constrained submodular minimization problems. We also show,\nsomewhat surprisingly, that like decomposable functions, DSFs are unable to represent all possible\ncycle matroid rank functions. This is interesting in and of itself since there are laminar matroids\nthat are not cycle matroids. On the other hand, we show that the more general DSFs share a variety\nof useful properties with decomposable functions. Namely, that they: (1) can leverage the vast\namount of practical work on feature engineering that occurs in the machine learning community\nand its applications; (2) can operate on multi-modal data if the data can be featurized in the same\nspace; (3) allow for training and testing on distinct sets since we can learn a function from the\nfeature representation level on up, similar to the work in [19]; and (4) are useful for streaming\napplications since functions can be evaluated without requiring knowledge of the entire ground set.\nThese advantages are made apparent in Section 2.\nInterestingly, DSFs also share certain properties with deep neural networks (DNNs), which have\nbecome widely popular in the machine learning community. For example, DNNs with weights\nthat are strictly non-negative correspond to a DSF. This suggests, as we show in Section 5, that\nit is possible to develop a learning framework over DSFs leveraging DNN learning frameworks.\nUnlike standard deep neural networks, which typically are trained either in classi\ufb01cation or regression\nframeworks, however, learning submodularity often takes the form of trying to adjust the parameters\nso that a set of \u201csummary\u201d data sets are offered a high value. We therefore extend the max-margin\nlearning framework of [24, 19] to apply to DSFs. Our approach can be seen as a max-margin learning\napproach for DNNs but restricted to DSFs. We show that DSFs can be learnt effectively in a variety\nof contexts (Section 6). In the below, we discuss basic de\ufb01nitions and an initial implementation of\nlearning DSFS while in [5] we provide complete de\ufb01nitions, properties, relationships to concavity,\nproofs, and a set of applications.\n\n2 Background\n\nSubmodular functions are discrete set functions that have the property of diminishing returns. Assume\na given \ufb01nite size n set of objects V (the large ground set of data items), where each v \u2208 V is a\ndistinct data sample (e.g., a sentence, training pair, image, video, or even a highly structured object\nsuch as a tree or a graph). A valuation set function f : 2V \u2192 R that returns a real value for any\nsubset X \u2286 V is said to be submodular if for all X \u2286 Y and v /\u2208 Y the following inequality holds:\n\n2\n\nV=V(0)V(1)V(2)V(3)v14v13v12v11v3v21v22v23v06v05v04v03v02v01ground setfeaturesmeta features(cid:31)nalfeaturew(1)w(2)w(3)V=V(0)V(1)V(2)V(3)v14v13v12v11v3v21v22v23v06v05v04v03v02v01ground setfeaturesmeta features(cid:31)nalfeature(a)(b)\fi \u03c6i((cid:80)\n\ni \u03c6i(mi(X)) =(cid:80)\n\nde\ufb01ned as f (X) =(cid:80)\n\nf (X \u222a {v}) \u2212 f (X) \u2265 f (Y \u222a {v}) \u2212 f (Y ). This means that the incremental value (or gain) of\nadding another sample v to a subset decreases when the context in which v is considered grows\nfrom X to Y . We can de\ufb01ne the gain of v in the context of X as f (v|X) (cid:44) f (X \u222a {v}) \u2212 f (X).\nThus, f is submodular if f (v|X) \u2265 f (v|Y ). If the gain of v is identical for all different contexts\ni.e., f (v|X) = f (v|Y ),\u2200X, Y \u2286 V , then the function is said to be modular. A function might also\nhave the property of being normalized (f (\u2205) = 0) and monotone non-decreasing (f (X) \u2264 f (Y )\nwhenever X \u2286 Y ). If the negation of f, \u2212f, is submodular, then f is called supermodular.\nA useful class of submodular functions in machine learning are decomposable functions [25], and one\nexample of useful instances of these for applications are called feature-based functions. Given a set\nof non-negative monotone non-decreasing normalized (\u03c6(0) = 0) concave functions \u03c6i : R+ \u2192 R+\nand a corresponding set of non-negative modular functions mi : V \u2192 R+, the function f : 2V \u2192 R+\nx\u2208X mi(x)) is known to be submodular. Such func-\ntions have been called \u201cdecomposable\u201d in the past, but in this work we will refer to them as the family\nof sums of concave over modular functions (SCMs). SCMs have been shown to be quite \ufb02exible [25],\nbeing able to represent a diverse set of functions such as graph cuts, set cover functions, and multiclass\nqueuing system functions and yield ef\ufb01cient algorithms for minimization [25, 22]. Such functions are\nuseful also for applications involving maximization. Suppose that each element v \u2208 V is associated\nwith a set of \u201cfeatures\u201d U in the sense of, say, how TFIDF is used in natural language processing (NLP).\nFeature based submodular functions are those de\ufb01ned via the set of features. Feature based func-\nu\u2208U wu\u03c6u(mu(X)), where \u03c6u is a non-decreasing non-negative\nunivariate normalized concave function, mu(X) is a feature-speci\ufb01c non-negative modular function,\nand wu is a non-negative feature weight. The result is the class of feature-based submodular functions\n(instances of SCMs). Such functions have been successfully used for data summarization [29].\nAnother advantage of such functions is that they do not require the construction of a pairwise\ngraph and therefore do not have quadratic cost as would, say a facility location function (e.g.,\nv\u2208V maxx\u2208X wxv), or any function based on pair-wise distances, all of which have cost\nO(n2) to evaluate. Feature functions have an evaluation cost of O(n|U|), linear in the ground set V\nsize and therefore are more scalable to large data set sizes. Finally, unlike the facility location and\nother graph-based functions, feature-based functions do not require the use of the entire ground set\nfor each evaluation and hence are appropriate for streaming algorithms [1, 9] where future ground\ni \u03c6i((cid:104)mi, x(cid:105)), we get a monotone\n\nf (X) =(cid:80)\nelements are unavailable. De\ufb01ning \u03c8 : Rn \u2192 R as \u03c8(x) =(cid:80)\n\ntions take the the form f (X) =(cid:80)\n\nnon-decreasing concave function, which we refer to as univariate sum of concaves (USCs).\n\n3 Deep Submodular Functions\n\nf (X) = (cid:80)\n\ns\u2208S \u03c9s\u03c6s((cid:80)\n\nWhile feature-based submodular functions are indisputably useful, their weakness lies in that features\nthemselves may not interact, although one feature u(cid:48) might be partially redundant with another\nfeature u(cid:48)(cid:48). For example, when describing a sentence via its component n-grams features, higher-\norder n-grams always include lower-order n-grams, so n-gram features are partially redundant. We\nmay address this problem by utilizing an additional \u201clayer\u201d of nested concave functions as in\nu\u2208U ws,u\u03c6u(mu(X))), where S is a set of meta-features, \u03c9s is a meta-\nfeature weight, \u03c6s is a non-decreasing concave function associated with meta-feature s, and ws,u\nis now a meta-feature speci\ufb01c feature weight. With this construct, \u03c6s assigns a discounted value to\nthe set of features in U, which can be used to represent feature redundancy. Interactions between\nthe meta-features might be needed as well, and this can be done via meta-meta-features, and so on,\nresulting in a hierarchy of increasingly higher-level features.\nWe hence propose a new class of submodular functions that we call deep submodular functions\n(DSFs). They may make use of a series of disjoint sets (see Figure 1-(a)): V = V (0), which is\nthe function\u2019s ground set, and additional sets V (1), V (2), . . . , V (K). U = V (1) can be seen as a set\nof \u201cfeatures\u201d, V (2) as a set of meta-features, V (3) as a set of meta-meta features, etc. up to V (K).\nThe size of V (i) is di = |V (i)|. Two successive sets (or \u201clayers\u201d) i \u2212 1 and i are connected by a\nmatrix w(i) \u2208 Rdi\u00d7di\u22121\nto be the row of w(i)\nvi (vi\u22121) is the element of matrix w(i) at row vi and column vi\u22121.\ncorresponding to element vi, and w(i)\n: V (i\u22121) \u2192 R+ as a modular function de\ufb01ned on set V (i\u22121). Thus, this matrix\nWe may think of w(i)\ncontains di such modular functions. Further, let \u03c6vk : R+ \u2192 R+ be a non-negative non-decreasing\nvi\n\n, for i \u2208 {1, . . . , K}. Given vi \u2208 V (i), de\ufb01ne w(i)\n\nvi\n\n+\n\n3\n\n\fconcave function. Then, a K-layer DSF f : 2V \u2192 R+ can be expressed as follows, for any A \u2286 V :\n(1)\n\n(cid:16) (cid:88)\n\n(cid:0) (cid:88)\n\n(cid:88)\n\nf (A) = \u03c6\n\nK\u22121\n\n(cid:18)\n\n. . .\n\n)\u03c6\n\n)\u03c6\n\n(v\n\nw\n\nv1 (a)(cid:1)(cid:17)(cid:19)(cid:33)\n\n(1)\n\nw\n\n2\n\n(3)\nv3 (v\n\n)\u03c6\n\nv2\n\nvK\u22121\n\n(cid:32)\nvK\u22121\u2208V (K\u22121)\n\n(cid:88)\n\nvK\n\nw\n\n(K)\nvK\n\nv2\u2208V (2)\n\nw\n\n1\n\n(2)\nv2 (v\n\nv1\n\na\u2208A\n\nv1\u2208V (1)\n\n(2)\n\nSubmodularity follows since a composition of a monotone non-decreasing function f and a monotone\nnon-decreasing concave function \u03c6 (g(\u00b7) = \u03c6(f (\u00b7))) is submodular (Theorem 1 in [20]) \u2014 a DSF is\nsubmodular via recursive application and since submodularity is closed under conic combinations.\nA more general way to de\ufb01ne a DSF (useful for the theorems below) uses recursion directly. We\nare given a directed acyclic graph (DAG) G = (V, E) where for any given node v \u2208 V, we say\npa(v) \u2282 V are the parents of (vertices pointing towards) v. A given size n subset of nodes V \u2282 V\ncorresponds to the ground set and for any v \u2208 V , pa(v) = \u2205. A particular \u201croot\u201d node r \u2208 V \\ V\nhas the distinction that r /\u2208 pa(q) for any q \u2208 V. Given a non-ground node v \u2208 V \\ V , we de\ufb01ne\nthe concave function \u03c8v : RV \u2192 R+\n\u03c8v(x) = \u03c6v\n\nwuv\u03c8u(x) + (cid:104)mv, x(cid:105)(cid:17)\nR+ is a non-negative linear function that evaluates as (cid:104)mv, x(cid:105) = (cid:80)\n\n(cid:16) (cid:88)\n\nu\u2208pa(v)\\V\n\n2, . . . , v0\n\n1 and V = {v0\n\nwhere \u03c6v : R+ \u2192 R+ is a non-decreasing univariate concave function, wuv \u2208 R+, mv : Rpa(v)\u2229V \u2192\nu\u2208pa(v)\u2229V mv(u)x(u)) (i.e.,\n(cid:104)mv, x(cid:105) is a sparse dot-product over elements pa(v)\u2229V \u2286 V ). The base case, where pa(v) \u2286 V there-\nfore has \u03c8v(x) = \u03c6v((cid:104)mv, x(cid:105)), so \u03c8v(1A) is a SCM function with only one term in the sum (1A is the\ncharacteristic vector of set A). A general DSF is de\ufb01ned as follows: for all A \u2286 V , f (A) = \u03c8r(1A) +\nm\u00b1(A), where m\u00b1 : V \u2192 R is an arbitrary modular function (i.e., it may include positive and nega-\ntive elements). From the perspective of de\ufb01ning a submodular function, there is no loss of generality\nby adding the \ufb01nal modular function m\u00b1 to a monotone non-decreasing function \u2014 this is because any\nsubmodular function can be expressed as a sum of a monotone non-decreasing submodular function\nand a modular function [10]. This form of DSF is more general than the layered approach mentioned\nabove which, in the current form, would partition V = {V (0), V (1), . . . , V (K)} into layers, and\nwhere for any v \u2208 V (i), pa(v) \u2286 V (i\u22121). Figure 1-(a) corresponds to a layered graph G = (V, E)\n6}. Figure 1-(b) uses the same partitioning but where units are\nwhere r = v3\nallowed to skip by more than one layer at a time. More generally, we can order the vertices in V with\norder \u03c3 so that {\u03c31, \u03c32, . . . , \u03c3n} = V where n = |V |, \u03c3m = r = vK where m = |V| and where\n\u03c3i \u2208 pa(\u03c3j) iff i < j. This allows an arbitrary pattern of skipping while maintaining submodularity.\nThe layered de\ufb01nition in Equation (1) is reminiscent of feed-forward deep neural networks (DNNs)\nowing to its multi-layered architecture. Interestingly, if one restricts the weights of a DNN at every\nlayer to be non-negative, then for many standard hidden-unit activation functions the DNN constitutes\na submodular function when given Boolean input vectors. The result follows for any activation\nfunction that is monotone non-decreasing concave for non-negative reals, such as the sigmoid, the\nhyperbolic tangent, and the recti\ufb01ed linear functions. This suggests that DSFs can be trained in a\nfashion similar to DNNs, as is further developed in Section 5. The recursive de\ufb01nition of DSFs, in\nEquation (2) is more general and, moreover, useful for the analysis in Section 4.\nDSFs should be useful for many applications in machine learning. First, they retain the advantages\nof feature-based functions (i.e., they require neither O(n2) nor access to the entire ground set for\nevaluation). Hence, DSFs can be both fast, and useful for streaming applications. Second, they allow\nfor a nested hierarchy of features, similar to advantages a deep model has over a shallow model. For\nexample, a one-layer DSF must construct a valuation over a set of objects from a large number of\nlow-level features which can lead to fewer opportunities for feature sharing while a deeper network\nfosters distributed representations, analogous to DNNs [3, 4]. Below, we show that DSFs constitute\na strictly larger family of submodular functions than SCMs.\n4 Analysis of the DSF family\n\n1, v0\n\nDSFs represent a family that, at the very least, contain the family of SCMs. We argued intuitively that\nDSFs might extend SCMs as they allow components themselves to interact, and the interactions may\npropagate up a many-layered hierarchy. In this section, we formally place DSF within the context of\nmore general submodular functions. We show that DSFs strictly generalize SCMs while preserving\nmany of their attractive attributes. We summarize the results of this section in Figure 2, and that\nincludes familial relationships amongst other classes of submodular functions (e.g., various matroid\nrank functions), useful for our main theorems.\n\n4\n\n\fFigure 2: Containment prop-\nerties of the set of functions\nstudied in this paper.\n\nThanks to concave composition closure rules [6], the root function\n\u03c8r(x) : Rn \u2192 R in Eqn. (2) is a monotone non-decreasing multivari-\nate concave function that, by the concave-submodular composition\nrule [20], yields a submodular function \u03c8r(1A). It is widely known\nthat any univariate concave function composed with non-negative\nmodular functions yields a submodular function. However, given an\narbitrary multivariate concave function this is not the case. Consider,\nfor example, any concave function \u03c8 over R2 that offers the follow\nevaluations: \u03c8(0, 0) = \u03c8(1, 1) = 1, \u03c8(0, 1) = \u03c8(1, 0) = 0. Then\nf (A) = \u03c8(1A) is not submodular. Given a multivariate concave\nfunction \u03c8 : Rn \u2192 R, the superdifferential \u2202\u03c8(x) at x is de\ufb01ned\nas: \u2202\u03c8(x) = {h \u2208 Rn : \u03c8(y) \u2212 \u03c8(x) \u2264 (cid:104)h, y(cid:105) \u2212 (cid:104)h, x(cid:105),\u2200y \u2208 Rn}\nand a particular supergradient hx is a member hx \u2208 \u2202\u03c8(x). If \u03c8(x) is differentiable at x then\n\u2202\u03c8(x) = {\u2207\u03c8(x)}. A concave function is said to have an antitone superdifferential if for all x \u2264 y\nwe have that hx \u2265 hy for all hx \u2208 \u2202\u03c8(x) and hy \u2208 \u2202\u03c8(y) (here, x \u2264 y \u21d4 x(v) \u2264 y(v)\u2200v).\nApparently, the following result has not been previously reported.\nTheorem 4.1. Let \u03c8 : Rn \u2192 R be a concave function. Then if \u03c8 has an antitone superdifferential,\nthen the set function f : 2V \u2192 R de\ufb01ned as f (A) = \u03c8(1A) for all A \u2286 V is submodular.\nA DSF\u2019s associated concave function has an antitone superdifferential. Concavity is not necessary\nin general (e.g., multilinear extensions of submodular functions are not concave but have properties\nanalogous to an antitone superdifferential, see [5]).\nLemma 4.3. Composition of monotone non-decreasing scalar concave and antitone superdifferential\nconcave functions, and conic combinations thereof, preserves superdifferential antitonicity.\nCorollary 4.3.1. The concave function \u03c8r associated with a DSF has an antitone superdifferential.\nA matroid M [12] is a set system M = (V,I) where I = {I1, I2, . . .} is a set of subsets Ii \u2286 V\nthat are called independent. A matroid has the property that \u2205 \u2208 I, that I is subclusive (i.e., given\nI \u2208 I and I(cid:48) \u2282 I then I(cid:48) \u2208 I) and that all maximally independent sets have the same size (i.e., given\nA, B \u2208 I with |A| < |B|, there exists a b \u2208 B \\ A such that A + b \u2208 I). The rank of a matroid, a\nset function r : 2V \u2192 Z+ de\ufb01ned as r(A) = maxI\u2208I |I \u2229 A|, is a powerful class of submodular\nfunctions. All matroids are uniquely de\ufb01ned by their rank function. All monotone non-decreasing\nnon-negative rational submodular functions can be represented by grouping and then evaluating\ngrouped ground elements in a matroid [12].\n(cid:80)(cid:96)\nA particularly useful matroid is the partition matroid, where a partition (V1, V2, . . . , V(cid:96)) of V is\nformed, along with a set of capacities k1, k2, . . . , k(cid:96) \u2208 Z+. It\u2019s rank function is de\ufb01ned as: r(X) =\ni=1 min(|X \u2229 Vi|, ki) and, therefore, is an SCM, owing to the fact that \u03c6(x) = min((cid:104)x, 1Vi(cid:105), ki)\nis USC. A cycle matroid is a different type of matroid based on a graph G = (V, E) where the rank\nfunction r(A) for A \u2286 E is de\ufb01ned as the size of the maximum spanning forest (i.e., a spanning tree\nfor each connected component) in the edge-induced subgraph GA = (V, A). From the perspective of\nmatroids, we can consider classes of submodular functions via their rank. If a given type of matroid\ncannot represent another kind, their ranks lie in distinct families. To study where DSFs are situated\nin the space of all submodular functions, it is useful \ufb01rst to study results regarding matroid rank\nfunctions.\nLemma 4.4. There are partition matroids that are not cycle matroids.\nIn a laminar matroid, a generalization of a partition matroid, we start with a set V and a family\nF = {F1, F2, . . . ,} of subsets Fi \u2286 V that is laminar, namely that for all i (cid:54)= j either Fi \u2229 Fj = \u2205\nor Fi \u2286 Fj or Fj \u2286 Fi (i.e., sets in F are either non-intersecting or comparable). In a laminar\nmatroid, we also have for every F \u2208 F an associated capacity kF \u2208 Z+. A set I is independent\nif |I \u2229 F| \u2264 kF for all F \u2208 F. A laminar family of sets can be organized in a tree, where\nthere is one root R \u2208 F in the tree that, w.l.o.g., can be V itself. Then the immediate parents\npa(F ) \u2282 F of a set F \u2208 F in the tree are the set of maximal subsets of F in F, i.e., pa(F ) =\n{F (cid:48) \u2208 F : F (cid:48) \u2282 F and (cid:54) \u2203F (cid:48)(cid:48) \u2208 F s.t. F (cid:48) \u2282 F (cid:48)(cid:48) \u2282 F}. We then de\ufb01ne the following for all F \u2208 F:\n(3)\nA laminar matroid rank has a recursive de\ufb01nition r(A) = rR(A) = rV (A). Hence, if the family F\nforms a partition of V , we have a partition matroid. More interestingly, when compared to Eqn. (2),\nwe see that a laminar matroid rank function is an instance of a DSF with a tree-structured DAG rather\n\n) + |A \\ (cid:91)\n\nrF (cid:48)(A \u2229 F\n\n(cid:48)\n\n(cid:88)\n\nF (cid:48)\u2208pa(F )\n\nrF (A) = min(\n\nF|, kF ).\n\nF (cid:48)\u2208pa(F )\n\n5\n\nAll Submodular FunctionsLaminarMatroid RankDSFsSCMsPartitionMatroid RankCycleMatroidRank\fthan the non-tree DAGs in Figure 1. Thus, within the family of DSFs lie the truncated matroid rank\nfunctions used to show information theoretic hardness for many constrained submodular optimization\nproblems [13]. Moreover, laminar matroids strictly generalize partition matroids.\nLemma 4.5. Laminar matroids strictly generalize partition matroids\nSince a laminar matroid generalizes a partition matroid, this portends well for DSFs generalizing\nSCMs. Before considering that, we already are up against some limits of laminar matroids, i.e.:\nLemma 4.6 (peeling proof). Laminar matroid cannot represent all cycle matroids.\nWe call this proof a \u201cpeeling proof\u201d since it recursively peels off each layer (in the sense of a DSF) of\na laminar matroid rank until it boils down to a partition matroid rank function, where the base case is\nclear. The proof is elucidating, moreover, since it motivates the proof of Theorem 4.14 showing that\nDSFs extend SCMs. We also have the immediate corollary.\nCorollary 4.6.1. Partition matroids cannot represent all cycle matroids.\n\n(SCMMs), taking the form: f (A) =(cid:80)\n\nWe see that SCMs generalize partition matroid rank functions and DSFs generalize laminar matroid\nrank functions. We might expect, from the above results, that DSFs might generalize SCMs \u2014 this\nis not immediately obvious since SCMs are signi\ufb01cantly more \ufb02exible than partition matroid rank\nfunctions because: (1) the concave functions need not be simple truncations at integers, (2) each term\ncan have its own non-negative modular function, (3) there is no requirement to partition the ground\nelements over terms in an SCM, and (4) we may with relative impunity extend the family of SCMs\nto ones where we add an additional arbitrary modular function (what we will call SCMMs below).\nWe see, however, that SCMMs are also unable to represent the cycle matroid rank function over K4,\nvery much like the partition matroid rank function. Hence the above \ufb02exibility does not help in this\ncase. We then show that DSFs strictly generalize SCMMs, which means that DSFs indeed provide\na richer family of submodular functions to utilize, ones that as discussed above, retain many of the\nadvantages of SCMMs. We end the section by showing that DSFs, even with an additional arbitrary\nmodular function, are still unable to represent matroid rank over K4, implying that although DSFs\nextend SCMMs, they cannot express all monotone non-decreasing submodular functions.\nWe de\ufb01ne a family of sums of concave over modular functions with an additional modular term\ni \u03c6i(mi(A)) + m\u00b1(A) where each \u03c6i and mi as in an SCM,\nbut where m\u00b1 : 2V \u2192 R is an arbitrary modular function, so if m\u00b1(\u00b7) = 0 the SCMM is an SCM.\nBefore showing that DSFs extend SCMMs, we include a result showing that SCMMs are strictly\nsmaller than the set of all submodular functions. We include an unpublished result [28] showing that\nSCMMs can not represent the cycle matroid rank function, as described above, over the graph K4.\nTheorem 4.11 (Vondrak[28]). SCMMs \u2282 Space of Submodular Functions\nWe next show that DSFs strictly generalize SCMMs, thus providing justi\ufb01cation for using DSFs over\nSCMMs and, moreover, generalizing Lemma 4.5. The DSF we choose is, again, a laminar matroid,\nso SCMMs are unable to represent laminar matroid rank functions. Since DSFs generalize laminar\nmatroid rank functions, the result follows.\nTheorem 4.14. The DSF family is strictly larger than that of SCMs.\ni min(|X \u2229\nBi|, ki), k) using an SCMM. Theorem 4.15 also has an immediate consequence for concave\nfunctions.\nCorollary 4.14.1. There exists a non-USC concave function with an antitone superdifferential.\nThe corollary follows since, as mentioned above, DSF functions have at their core a multivarate\nconcave function with an antitone superdifferential, and thanks to Theorem 4.14 it is not always\npossible to represent this as a sum of concave over linear functions. It is currently an open problem if\nDSFs with (cid:96) layers extend the family of DSFs with (cid:96)(cid:48) < (cid:96) layers, for (cid:96)(cid:48) \u2265 2. Our \ufb01nal result shows\nthat, while DSFs are richer than SCMMs, they still do not encompass all polymatroid functions. We\nshow this by proving that the cycle matroid rank function on K4 is not achievable with DSFs.\nTheorem 4.15. DSFs \u2282 Polymatroids\nProofs of these theorems and more may be found in [5].\n\nThe proof shows that it is not possible to represent a function of the form f (X) = min((cid:80)\n\n6\n\n\f5 Learning DSFs\n\nAs mentioned above, learning submodular functions is generally dif\ufb01cult [11, 13]. Learning mixtures\nof \ufb01xed submodular component functions [24, 19], however, can give good empirical results on\nseveral tasks, including image [27] and document [19] summarization. In these examples, rather than\nattempting to learn a function at all 2n points, a max-margin approach is used only to approximate\na submodular function on its large values. Typically when training a summarizer, one is given a\nground set of items, and a set of representative sets of excerpts (usually human generated) each of\nwhich summarizes the ground set. Within this setting, access to an oracle function h(A) \u2014 that, if\navailable, could be used in a regression-style learning approach \u2014 might not be available. Even if\navailable, such learning is often overkill. Thus, instead of trying to learn h everywhere, we only\nseek to learn the parameters w of a function fw that lives within some family, parameterized by\nw, so that if B \u2208 argmaxA\u2286V :|A|\u2264k fw(A), then h(B) \u2265 \u03b1h(A\u2217) for some \u03b1 \u2208 [0, 1] where\nA\u2217 \u2208 argmaxA\u2286V :|A|\u2264k h(A). In practice, this corresponds to selecting the best summary for a\ngiven document based on the learnt function, in the hope that it mirrors what a human believes to be\nbest. Fortunately, the max-margin training approach directly addresses the above and is immediately\napplicable to learning DSFs. Also, given the ongoing research on learning DNNs, which have\nachieved state-of-the-art results on a plethora of machine learning tasks [17], and given the similarity\nbetween DSFs and DNNs, we may leverage the DNN learning techniques (such as dropout, AdaGrad,\nlearning rate scheduling, etc.) to our bene\ufb01t.\n5.1 Using max-margin learning for DSFs\nGiven an unknown but desired function h : 2V \u2192 R+ and a set of representative sets S =\n{S1, S2, . . .}, with Si \u2286 V and where for each S \u2208 S, h(S) is highly scored, a max-margin\nlearning approach may be used to train a DSF f so that if A \u2208 argmaxA\u2286V f (A), h(A) is also\nhighly scored by h. Under the large-margin approach [24, 19, 27], we learn the parameters w of\nfw such that for all S \u2208 S, fw(S) is high, while for A \u2208 2V , fw(A) is lower by some given\nloss. This may be performed by maximizing the loss-dependent margin so that for all S \u2208 S and\nA \u2208 2V , fw(S) \u2265 fw(A) + (cid:96)S(A). For a given loss function (cid:96)S(A), optimization reduces to \ufb01nding\nparameters so that fw(S) \u2265 maxA\u22082V [fw(A)+(cid:96)S(A)] is satis\ufb01ed for S \u2208 S. The task of \ufb01nding the\nmaximizing set is known as loss-augmented inference (LAI) [26], which for general (cid:96)(A) is NP-hard.\nWith regularization, and de\ufb01ning the hinge operator (x)+ = max(0, x), the optimization becomes:\n\n[f (A) + (cid:96)S(A)] \u2212 f (S)\n\n||w||2\n2.\n\n(4)\n\nf (S) + \u03bbwi, and in the case\nGiven a LAI procedure, the subgradient of weight wi is \u2202\n\u2202wi\nof a DSF, each subgradient can be computed ef\ufb01ciently with backpropagation, similar to the approach\nof [23], but to retain polymatroidality of f, projected gradient descent is used ensure w (cid:23) 0\nFor arbitrary set functions, f (A) and (cid:96)S(A), LAI is generally intractable. Even if f (A) is submodular,\nthe choice of loss can affect the computational feasibility of LAI. For submodular (cid:96)(A), the greedy\nalgorithm can \ufb01nd an approximately maximizing set [21]. For supermodular (cid:96)(A), the task of solving\nmaxA\u22082V \\S [f (A) + (cid:96)(A)] involves maximizing the difference of two submodular functions and the\nsubmodular-supermodular procedure [14] can be used.\nOnce a DSF is learnt, we may wish to \ufb01nd maxA\u2286V :|A|\u2264k f (A) and this can be done, e.g., using\nthe greedy algorithm when m\u00b1 \u2265 0. The task of summarization, however, might involve learning\nbased on one set of ground sets and testing via a different (set of) ground set(s) [19, 27]. To do\nthis, any particular element v \u2208 V may be represented by a vector of non-negative feature weights\n(m1(v), m2(v), . . . ) (e.g., mi(v) counts the number of times a unigram i appears in sentence v),\nand the feature i weight for any set A \u2286 V can is represented as the i-speci\ufb01c modular evaluation\na\u2208A mi(a). We can treat the set of modular functions {mi : V \u2192 R+}i as a matrix\nto be used as the \ufb01rst layer in DSF (e.g., w(1) in Figure 1 (left)) that is \ufb01xed during the training of\nsubsequent layers. This preserves submodularity, and allows all later layers (i.e., w(2), w(3), . . . ) to\nbe learnt generically over any set of objects that can be represented in the same feature space \u2014 this\nalso allows training over one set of ground sets, and testing on a totally separate set of ground sets.\nMax-margin learning, in fact, remains ignorant that this is happening since it sees the data only post\nfeature representation. In fact, learning can be cross modal \u2014 e.g., images and sentences, represented\nin the same feature space, can be learnt simultaneously. This is analogous to the \u201cshells\u201d of [19]. In\n\nmi(A) = (cid:80)\n\n(cid:18)\n\n(cid:88)\n\nS\u2208S\n\nmin\nw\u22650\n\nmax\nA\u22082V\n\n(cid:19)+\n\n\u03bb\n+\n2\nf (A) \u2212 \u2202\n\n\u2202wi\n\n7\n\n\f(a) Cycle matroid\n\n(b) Laminar matroid\n\n(c) Image summarization\n\nFigure 3: (a),(b) show matroid learning via a DSF is possible in a max-margin setting; (c) shows that\nlearning and generalization of a DSF can happen, via featurization, on real image data.\n\nthat case, however, mixtures were learnt over \ufb01xed components, some of which required a O(n2)\ncalculation for element-pair similarity scores. Via featurization in the \ufb01rst layer of a DSF, however,\nwe may learn a DSF over a training set, preserving submodularity, avoid any O(n2) cost, and test on\nany new data represented in the same feature space.\n\n6 Empirical Experiments on Learning DSFs\n\nWe offer preliminary feasibility results showing it is possible to train a DSF on synthetic datasets and,\nvia featurization, on a real image summarization dataset.\nThe \ufb01rst synthetic experiment trains a DSF to learn a cycle matroid rank function on K4. Although\nTheorem 4.15 shows a DSF cannot represent such a rank function everywhere, we show that the\nmax-margin framework can learn a DSF that, when maximized via maxA\u2286V :|A|\u22643 f (A) does not\nreturn a 3-cycle (as is desirable). We used a simple two-layer DSF, where the \ufb01rst hidden layer\nconsisted of four hidden units with square root activation functions, and a normalized sigmoid\n\u02c6\u03c3(x) = 2 \u00b7 (\u03c3(x) \u2212 0.5) at the output. Figure 3a shows that after suf\ufb01cient learning iterations, greedy\napplied to the DSF returns independent sized-three sets. Further analysis shows that the function is\nnot learnt everywhere, as predicted by Theorem 4.15.\n\nWe next tested a scaled laminar matroid rank r(A) = (1/8) min(cid:0)(cid:80)10\n\ni=1 min(|A \u2229 Bi|, 1), 8(cid:1) where\n\nthe Bi\u2019s are each size 10 and form a partition of V , with |V | = 100. Thus maximal independent\nsets argmaxI\u2208I |I| have r(I) = 1 with |I| = 8. A DSF is trained with a hidden layer of 10 units of\nactivation g(x) = max(x, 1), and a normalized sigmoid \u02c6\u03c3 at the output. We randomly generated 200\nmatroid bases, and trained the network. The greedy solution to maxA\u2286V :|A|\u22648 f (A) on the learnt\nDSF produces sets that are maximally independent (Figure 3b).\nFor our real-world instance of learning DSFs, we use the dataset of [27], which consists of 14 distinct\nimage sets, 100 images each. The task is to select the best 10-image summary in terms of a visual\nROUGE-like function that is de\ufb01ned over a bag of visual features. For each of the 14 ground sets,\nwe trained on the other 13 sets and evaluated the performance of the trained DSF on the test set.\n\nWe use a simple DSF of the form f (A) = \u02c6\u03c3(cid:0)(cid:80)\n\nmu(A)(cid:1), where mu(A) is modular for\n\nfeature u, and \u02c6\u03c3 is a sigmoid. We used (diagonalized) Adagrad, a decaying learning rate, weight\ndecay, and dropout (which was critical for test-set performance). We compared to an SCMM of\ncomparable complexity and number of parameters (i.e., the same form and features but a linear\noutput), and performance of the SCMM is much worse (Figure 3c) perhaps because of a DSF\u2019s\n\u201cdepth.\u201d Notably, we only require |U| = 628 visual-word features (as covered in Section 5 of [27]),\nwhile the approach in [27] required 594 components of O(n2) graph values, or roughly 5.94 million\nprecomputed values. The loss function is (cid:96)(A) = 1 \u2212 R(A), where R(A) is a ROUGE-like function\nde\ufb01ned over visual-words. During training, we achieve numbers comparable to [27]. We do not yet\nmatch the generalization results in [27], but we do not use strong O(n2) graph components, and we\nexpect better results perhaps with a deeper network and/or better base features.\nAcknowledgments: Thanks to Reza Eghbali and Kai Wei for useful discussions. This material is based upon\nwork supported by the National Science Foundation under Grant No. IIS-1162606, the National Institutes of\nHealth under award R01GM103544, and by a Google, a Microsoft, a Facebook, and an Intel research award.\nThis work was supported in part by TerraSwarm, one of six centers of STARnet, a Semiconductor Research\nCorporation program sponsored by MARCO and DARPA.\n\n(cid:112)\n\nu\u2208U wu\n\n8\n\n050100150200Iteration0.00.51.01.52.02.53.0Cycle matroid rankGreedy set value050100150200Iteration0.00.20.40.60.81.0Laminar rankGreedy set value020406080100Epoch0.00.51.01.52.0Normalized VROUGE valueAverage greedy VROUGE (Train)DSFSCMMHuman avg.Random avg.0200400600800Epoch0.00.51.0Average greedy VROUGE (Test)\fReferences\n\n[1] A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization:\nMassive data summarization on the \ufb02y. In Proceedings of the 20th ACM SIGKDD international conference\non Knowledge discovery and data mining, pages 671\u2013680. ACM, 2014.\n\n[2] M. Balcan and N. Harvey. Learning submodular functions. Technical report, arXiv:1008.2159, 2010.\n[3] Y. Bengio. Learning Deep Architectures for AI. Foundations and Trends R(cid:13) in Machine Learning,\n\n2(1):1\u2013127, 2009.\n\n[4] Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 35(8):1798\u20131828, 2013.\n\n[5] Jeffrey Bilmes and Wenruo Bai. Deep Submodular Functions. Arxiv, abs/1701.08939, Jan 2017. http:\n\n//arxiv.org/abs/1701.08939.\n\n[6] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr, 2004.\n[7] Niv Buchbinder, Moran Feldman, Joseph Sef\ufb01 Naor, and Roy Schwartz. Submodular maximization with\ncardinality constraints. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, pages 1433\u20131452. Society for Industrial and Applied Mathematics, 2014.\n\n[8] Gruia Calinescu, Chandra Chekuri, Martin P\u00e1l, and Jan Vondr\u00e1k. Maximizing a monotone submodular\n\nfunction subject to a matroid constraint. SIAM Journal on Computing, 40(6):1740\u20131766, 2011.\n\n[9] Chandra Chekuri, Shalmoli Gupta, and Kent Quanrud. Streaming algorithms for submodular function\nmaximization. In International Colloquium on Automata, Languages, and Programming, pages 318\u2013330.\nSpringer, 2015.\n\n[10] W. H. Cunningham. Testing membership in matroid polyhedra. J Combinatorial Theory B, 36:161\u2013188,\n\n1984.\n\n[11] V. Feldman and J. Vondr\u00e1k. Optimal bounds on approximation of submodular and XOS functions by\n\njuntas. CoRR, abs/1307.3301, 2013.\n\n[12] S. Fujishige. Submodular Functions and Optimization. Number 58 in Annals of Discrete Mathematics.\n\nElsevier Science, 2nd edition, 2005.\n\n[13] M.X. Goemans, N.J.A. Harvey, S. Iwata, and V. Mirrokni. Approximating submodular functions every-\n\nwhere. In SODA, pages 535\u2013544, 2009.\n\n[14] R. Iyer and J. Bilmes. Algorithms for approximate minimization of the difference between submodular\n\nfunctions, with applications. Uncertainty in Arti\ufb01cial Intelligence (UAI), 2012.\n\n[15] Rishabh Iyer and Jeff Bilmes. Submodular optimization with submodular cover and submodular knapsack\n\nconstraints. In Neural Information Processing Society (NIPS), Lake Tahoe, CA, December 2013.\n\n[16] Rishabh Iyer, Stefanie Jegelka, and Jeff A. Bilmes. Fast semidifferential-based submodular function\n\noptimization. In International Conference on Machine Learning (ICML), Atlanta, Georgia, 2013.\n\n[17] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, may 2015.\n[18] Jon Lee, Vahab S Mirrokni, Viswanath Nagarajan, and Maxim Sviridenko. Non-monotone submodular\nmaximization under matroid and knapsack constraints. In Proceedings of the forty-\ufb01rst annual ACM\nsymposium on Theory of computing, pages 323\u2013332. ACM, 2009.\n\n[19] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization.\n\nIn Uncertainty in Arti\ufb01cial Intelligence (UAI), Catalina Island, USA, July 2012. AUAI.\n\n[20] Hui Lin and Jeff Bilmes. A Class of Submodular Functions for Document Summarization, 2011.\n[21] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing\n\nsubmodular functions-I. Math. Program., 14:265\u2013294, 1978.\n\n[22] R. Nishihara, S Jegelka, and M. I. Jordan. On the convergence rate of decomposable submodular function\n\nminimization. In Advances in Neural Information Processing Systems, pages 640\u2013648, 2014.\n\n[23] W. Pei. Max-Margin Tensor Neural Network for Chinese Word Segmentation. Transactions of the\n\nAssociation of Computational Linguistics, pages 293\u2013303, 2014.\n\n[24] R. Sipos, P. Shivaswamy, and T. Joachims. Large-margin learning of submodular summarization models.\nIn Proceedings of the 13th Conference of the European Chapter of the Association for Computational\nLinguistics, pages 224\u2013233. Association for Computational Linguistics, 2012.\n\n[25] P. Stobbe and A. Krause. Ef\ufb01cient minimization of decomposable submodular functions. In NIPS, 2010.\n[26] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: A large\nmargin approach. Proceedings of the 22nd international conference on Machine learning, (December):896\u2013\n903, 2005.\n\n[27] S. Tschiatschek, R. Iyer, H. Wei, and J. Bilmes. Learning mixtures of submodular functions for image\ncollection summarization. In Neural Information Processing Society (NIPS), Montreal, Canada, December\n2014.\n\n[28] J. Vondrak. Personal Communication, 2011.\n[29] K. Wei, Y. Liu, K. Kirchhoff, and J. Bilmes. Unsupervised submodular subset selection for speech data. In\n\nProc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Florence, Italy, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1687, "authors": [{"given_name": "Brian", "family_name": "Dolhansky", "institution": "University of Washington"}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": "University of Washington"}]}