{"title": "Learning Mixtures of Submodular Functions for Image Collection Summarization", "book": "Advances in Neural Information Processing Systems", "page_first": 1413, "page_last": 1421, "abstract": "We address the problem of image collection summarization by learning mixtures of submodular functions. We argue that submodularity is very natural to this problem, and we show that a number of previously used scoring functions are submodular \u2014 a property not explicitly mentioned in these publications. We provide classes of submodular functions capturing the necessary properties of summaries, namely coverage, likelihood, and diversity. To learn mixtures of these submodular functions as scoring functions, we formulate summarization as a supervised learning problem using large-margin structured prediction. Furthermore, we introduce a novel evaluation metric, which we call V-ROUGE, for automatic summary scoring. While a similar metric called ROUGE has been successfully applied to document summarization [14], no such metric was known for quantifying the quality of image collection summaries. We provide a new dataset consisting of 14 real-world image collections along with many human-generated ground truth summaries collected using mechanical turk. We also extensively compare our method with previously explored methods for this problem and show that our learning approach outperforms all competitors on this new dataset. This paper provides, to our knowledge, the first systematic approach for quantifying the problem of image collection summarization, along with a new dataset of image collections and human summaries.", "full_text": "Learning Mixtures of Submodular Functions for\n\nImage Collection Summarization\n\nSebastian Tschiatschek\n\nDepartment of Electrical Engineering\n\nGraz University of Technology\ntschiatschek@tugraz.at\n\nRishabh Iyer\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nrkiyer@u.washington.edu\n\nHaochen Wei\n\nJeff Bilmes\n\nLinkedIn & Department of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\nweihch90@gmail.com\n\nUniversity of Washington\n\nbilmes@u.washington.edu\n\nAbstract\n\nWe address the problem of image collection summarization by learning mixtures of\nsubmodular functions. Submodularity is useful for this problem since it naturally\nrepresents characteristics such as \ufb01delity and diversity, desirable for any summary.\nSeveral previously proposed image summarization scoring methodologies, in fact,\ninstinctively arrived at submodularity. We provide classes of submodular compo-\nnent functions (including some which are instantiated via a deep neural network)\nover which mixtures may be learnt. We formulate the learning of such mixtures as a\nsupervised problem via large-margin structured prediction. As a loss function, and\nfor automatic summary scoring, we introduce a novel summary evaluation method\ncalled V-ROUGE, and test both submodular and non-submodular optimization\n(using the submodular-supermodular procedure) to learn a mixture of submodular\nfunctions. Interestingly, using non-submodular optimization to learn submodular\nfunctions provides the best results. We also provide a new data set consisting of\n14 real-world image collections along with many human-generated ground truth\nsummaries collected using Amazon Mechanical Turk. We compare our method\nwith previous work on this problem and show that our learning approach outper-\nforms all competitors on this new data set. This paper provides, to our knowledge,\nthe \ufb01rst systematic approach for quantifying the problem of image collection sum-\nmarization, along with a new data set of image collections and human summaries.\n\n1\n\nIntroduction\n\nThe number of photographs being uploaded online is growing at an unprecedented rate. A recent\nestimate is that 500 million images are uploaded to the internet every day (just considering Flickr,\nFacebook, Instagram and Snapchat), a \ufb01gure which is expected to double every year [22]. Organizing\nthis vast amount of data is becoming an increasingly important problem. Moreover, the majority\nof this data is in the form of personal image collections, and a natural problem is to summarize\nsuch vast collections. For example, one may have a collection of images taken on a holiday trip,\nand want to summarize and arrange this collection to send to a friend or family member or to post\non Facebook. Here the problem is to identify a subset of the images which concisely represents\nall the diversity from the holiday trip. Another example is scene summarization [28], where one\nwants to concisely represent a scene, like the Vatican or the Colosseum. This is relevant for creating\na visual summary of a particular interest point, where we want to identify a representative set of\nviews. Another application that is gaining importance is summarizing video collections [26, 13] in\norder to enable ef\ufb01cient navigation of videos. This is particularly important in security applications,\nwhere one wishes to quickly identify representative and salient images in massive amounts of video.\n\n1\n\n\fThese problems are closely related and can be uni\ufb01ed via the problem of \ufb01nding the most repre-\nsentative subset of images from an entire image collection. We argue that many formulations of\nthis problem are naturally instances of submodular maximization, a statement supported by the fact\nthat a number of scoring functions previously investigated for image summarization are (apparently\nunintentionally) submodular [30, 28, 5, 29, 8].\nA set function f (\u00b7) is said to be submodular if for any element v and sets A \u2286 B \u2286 V \\{v}, where\nV represents the ground set of elements, f (A \u222a {v}) \u2212 f (A) \u2265 f (B \u222a {v}) \u2212 f (B). This is\ncalled the diminishing returns property and states, informally, that adding an element to a smaller\nset increases the function value more than adding that element to a larger set. Submodular functions\nnaturally model notions of coverage and diversity in applications, and therefore, a number of machine\nlearning problems can be modeled as forms of submodular optimization [11, 20, 18]. In this paper,\nwe investigate structured prediction methods for learning weighted mixtures of submodular functions\nfor image collection summarization.\nRelated Work:\nPrevious work on image summarization can broadly be categorized into (a)\nclustering-based approaches, and (b) approaches which directly optimize certain scoring functions.\nThe clustering papers include [12, 8, 16]. For example, [12] proposes a hierarchical clustering-based\nsummarization approach, while [8] uses k-medoids-based clustering to generate summaries. Sim-\nilarly [16] proposes top-down based clustering. A number of other methods attempt to directly\noptimize certain scoring functions. For example, [28] focuses on scene summarization and poses an\nobjective capturing important summarization metrics such as likelihood, coverage, and orthogonality.\nWhile they do not explicitly mention this, their objective function is in fact a submodular function.\nFurthermore, they propose a greedy algorithm to optimize their objective. A similar approach was pro-\nposed by [30, 29], where a set cover function (which incidentally also is submodular) is used to model\ncoverage, and a minimum disparity formulation is used to model diversity. Interestingly, they optimize\ntheir objective using the same greedy algorithm. Similarly, [15] models the problem of diverse image\nretrieval via determinantal point processes (DPPs). DPPs are closely related to submodularity, and in\nfact, the MAP inference problem is an instance of submodular maximization. Another approach for\nimage summarization was posed by [5], where they de\ufb01ne an objective function using a graph-cut func-\ntion, and attempt to solve it using a semide\ufb01nite relaxation. They unintentionally use an objective that\nis submodular (and approximately monotone [18]) that can be optimized using the greedy algorithm.\nOur Contributions: We introduce a family of submodular function components for image collection\nsummarization over which a convex mixture can be placed, and we propose a large margin formulation\nfor learning the mixture. We introduce a novel data set of fourteen personal image collections, along\nwith ground truth human summaries collected via Amazon mechanical Turk, and then subsequently\ncleaned via methods described below. Moreover, in order to automatically evaluate the quality of\nnovel summaries, we introduce a recall-based evaluation metric, which we call V-ROUGE, to compare\nautomatically generated summaries to the human ones. We are inspired by ROUGE [17], a well-\nknown evaluation criterion for evaluating summaries in the document summarization community, but\nwe are unaware of any similar efforts in the computer vision community for image summarization. We\nshow evidence that V-ROUGE correlates well with human evaluation. Finally, we extensively validate\nour approach on these data sets, and show that it outperforms previously explored methods developed\nfor similar problems. The resulting learnt objective, moreover, matches human summarization\nperformance on test data.\n\n2\n\nImage Collection Summarization\n\nSummarization is a task that most humans perform intuitively. Broadly speaking, summarization is\nthe task of extracting information from a source that is both minimal and most important. The precise\nmeaning of most important (relevance) is typically subjective and thus will differ from individual\nto individual and hence is dif\ufb01cult to precisely quantify. Nevertheless, we can identify two general\nproperties that characterize good image collection summarizes [19, 28]:\nFidelity: A summary should have good coverage, meaning that all of the distinct \u201cconcepts\u201d in\nthe collection have at least one representative in the summary. For example, a summary of a photo\ncollection containing both mountains and beaches should contain images of both scene types.\nDiversity: Summaries should be as diverse as possible, i.e., summaries should not contain images\nthat are similar or identical to each other. Other words for this concept include diversity or dispersion.\nIn computer vision, this property has been referred to as orthogonality [28].\n\n2\n\n\fNote that [28] also includes the notion of \u201clikelihood,\u201d where summary images should have high\nsimilarity to many other images in the collection. We believe, however, that such likelihood is\ncovered by \ufb01delity. I.e., a summary that only has images similar to many in the collection might miss\ncertain outlier, or minority, concepts in the collection, while a summary that has high \ufb01delity should\ninclude a representative image for every both majority and minority concept in the collection.Also,\nthe above properties could be made very high without imposing further size or budget constraints.\nI.e., the goal of a summary is to \ufb01nd a small or within-budget subset having the above properties.\n\n2.1 Problem Formulation\n\nWe cast the problem of image collection summarization as a subset selection problem: given a\ncollection of images I = (I1, I2,\u00b7\u00b7\u00b7 , I|V |) represented by an index set V and given a budget c, we\naim to \ufb01nd a subset S \u2286 V,|S| \u2264 c, which best summarizes the collection. Though alternative\napproaches are possible, we aim to solve this problem by learning a scoring function F : 2V \u2192 R+,\nsuch that high quality summaries are mapped to high scores and low quality summaries to low scores.\nThen, image collection summarization can be performed by computing:\n\nS\u2217 \u2208 argmaxS\u2286V,|S|\u2264c F (S).\n\n(1)\nFor arbitrary set functions, computing S\u2217 is intractable, but for monotone submodular functions\nwe rely on the classic result [25] that the greedy algorithm offers a constant-factor mathematical\nquality guarantee. Computational tractability notwithstanding, submodular functions are natural for\nmeasuring \ufb01delity and diversity [19] as we argue in Section 4.\n\n2.2 Evaluation Criteria: V-ROUGE\n\nBefore describing practical submodular functions for mixture components, we discuss a crucial ele-\nment for both summarization evaluation and for the automated learning of mixtures: an objective evalu-\nation criterion for judging the quality of summaries. Our criterion is constructed similar to the popular\nROUGE score used in multi-document summarization [17] and that correlates well with human per-\nception. For document summarization, ROUGE (which in fact, is submodular [19, 20]) is de\ufb01ned as:\n\nw\u2208W(cid:80)\n(cid:80)\n(cid:80)\nw\u2208W(cid:80)\n\nrS (A) =\n\nS\u2208S cw(S)\n\nS\u2208S min (cw(A), cw(S))\n\n( (cid:44) r(A) when S is clear from the context),\n\n(2)\nwhere S is a set of human-generated reference summaries, W is a set of features (n-grams), and where\ncw(A) is the occurrence-count of w in summary A. We may extend r(\u00b7) to handle images by letting W\nbe a set of visual words, S a set of reference summaries, and cw(A) be the occurrence-counts of visual\nword w in summary A. Visual words can for example be computed from SIFT-descriptors [21] as com-\nmon in the popular bag-of-words framework in computer vision [31]. We call this V-ROUGE (visual\nROUGE). In our experiments, we use visual words extracted from color histograms, from super-pixels,\nand also from OverFeat [27], a deep convolutional network \u2014 details are given in Section 5.\n\nsubmodular functions f1, f2, . . . , fm, i.e. Fw(S) = (cid:80)m\nwi \u2265 0,(cid:80)\n\n3 Learning Framework\nWe construct our submodular scoring functions Fw(\u00b7) as convex combinations of non-negative\ni=1 wifi(S), where w = (w1, . . . , wm),\ni wi = 1. The functions fi are submodular components and assumed to be normalized:\ni.e., fi(\u2205) = 0, and fi(V ) = 1 for polymatroid functions and maxA\u2286V fi(A) \u2264 1 for non-monotone\nfunctions. This ensures that the components are compatible with each other. Obviously, the merit of\nthe scoring function Fw(\u00b7) depends on the selection of the components. In Section 4, we provide a\nlarge number of natural component choices, mixtures over which span a large diversity of submodular\nfunctions. Many of these component functions have appeared individually in past work and are\nuni\ufb01ed into a single framework in our approach.\nLarge-margin Structured Prediction: We optimize the weights w of the scoring function Fw(\u00b7)\nin a large-margin structured prediction framework, i.e. the weights are optimized such that human\nsummaries S are separated from competitor summaries by a loss-dependent margin:\n\n(3)\nwhere (cid:96)(\u00b7) is the considered loss function, and where Y is a structured output space (for example Y\nis the set of summaries that satisfy a certain budget c, i.e. Y = {S(cid:48) \u2286 V : |S(cid:48)| \u2264 c}). We assume\n\nFw(S) \u2265 Fw(S(cid:48)) + (cid:96)(S(cid:48)),\n\n\u2200S \u2208 S, S(cid:48) \u2208 Y \\ S,\n\n3\n\n\f(cid:21)\n\n(cid:20)\n\n(cid:88)\n\nthe loss to be normalized, 0 \u2264 (cid:96)(S(cid:48)) \u2264 1,\u2200S(cid:48) \u2286 V , to ensure mixture and loss are calibrated.\nEquation (3) can be stated as Fw(S) \u2265 maxS(cid:48)\u2208Y [Fw(S(cid:48)) + (cid:96)(S(cid:48))] ,\u2200S \u2208 S which is called loss-\naugmented inference. We introduce slack variables and minimize the regularized sum of slacks [20]:\n\n+\n\n\u03bb\n2\n\nmin\n\nmax\n\nS\u2208S\n\n(cid:107)w(cid:107)2\n2,\n\nw\u22650,(cid:107)w(cid:107)1=1\n\nS(cid:48)\u2208Y [Fw(S(cid:48)) + (cid:96)(S(cid:48))] \u2212 Fw(S)\n\n(4)\nwhere the non-negative orthant constraint, w \u2265 0, ensures that the \ufb01nal mixture is submodular. Note\na 2-norm regularizer is used on top of a 1-norm constraint (cid:107)w(cid:107)1 = 1 which we interpret as a prior to\nencourage higher entropy, and thus more diverse mixture, distributions. Tractability depends on the\nchoice of the loss function. An obvious choice is (cid:96)(S) = 1 \u2212 r(S), which yields a non-submodular\noptimization problem suitable for optimization methods such as [10] (and which we try in Section 7).\nWe also consider other loss functions that retain submodularity in loss augmented inference. For\nnow, assume that \u02dcS = maxS(cid:48)\u2208Y [Fw(S(cid:48)) + (cid:96)(S(cid:48))] can be estimated ef\ufb01ciently. The objective in (4)\ncan then be minimized using standard stochastic gradient descent methods, where the gradient for\nsample S with respect to weight wi is given as\n\n(cid:18)\n\n\u2202\n\u2202wi\n\n(cid:19)\n\nFw( \u02dcS) + (cid:96)( \u02dcS) \u2212 Fw(S) +\n\n(cid:107)w(cid:107)2\n\n2\n\n\u03bb\n2\n\n= fi( \u02dcS) \u2212 fi(S) + \u03bbwi.\n\n(5)\n\nS\n\nw\u2208W c\n\n(cid:80)\n\nS\u2208S(cid:80)\n\nLoss Functions: A natural loss function is (cid:96)1\u2212R(S) = 1 \u2212 r(S) where r(S) = V-ROUGE(S).\nBecause r(S) is submodular, 1 \u2212 r(S) is supermodular and hence maximizing Fw(S(cid:48)) + (cid:96)(S(cid:48))\nrequires difference-of-submodular set function maximization [24] which is NP-hard [10]. We\nalso consider two alternative loss functions [20], complement V-ROUGE and surrogate V-ROUGE.\nComplement V-ROUGE sets (cid:96)c(S) = r(V \\ S) and is still submodular but it is non-monotone.\n(cid:96)c(\u00b7) does have the necessary characteristics of a proper loss: summaries S+ with large V-ROUGE\nscore are mapped to small values and summaries S\u2212 with small V-ROUGE score are mapped to\nlarge values. In particular, submodularity means r(S) + r(V \\ S) \u2265 r(V ) + r(\u2205) = r(V ) or\nr(V \\ S) \u2265 r(V )\u2212 r(S) = 1\u2212 r(S), so complement rouge is a submodular upper bound of the ideal\nsupermodular loss. We de\ufb01ne surrogate V-ROUGE as (cid:96)surr(A) = 1\ncw(A), where\nW c\nZ\nS is the set of visual words that do not appear in reference summary S and Z is a normalization\nconstant. Here, a summary has a high loss if it contains many visual words that do not occur in\nreference summaries and a low loss if it mainly contains visual words that occur in the reference\nsummaries. Surrogate V-ROUGE is not only monotone submodular, it is modular.\nLoss augmented Inference: Depending on the loss function, different algorithms for performing\nloss augmented inference, i.e. computation of the maximum in (4), must be used. When using the\nsurrogate loss lsurr(\u00b7), the mixture function together with the loss, i.e. fL(S) = Fw(S) + (cid:96)(S), is\nsubmodular and monotone. Hence, the greedy algorithm [25] can be used for maximization. This\nalgorithm is extremely simple to implement, and starting at S0 = \u2205, iteratively chooses an element\nj /\u2208 St that maximizes fL(St \u222a j), until the budget constraint is violated. While the complexity of\nthis simple procedure is O(n2) function evaluations, it can be signi\ufb01cantly accelerated, thanks again\nto submodularity [23], which in practice we \ufb01nd is almost linear time. When using complement rouge\n(cid:96)c(\u00b7) as the loss, fL(S) is still submodular but no longer monotone, so we utilize the randomized\ngreedy algorithm [2] (which is essentially a randomized variant of the greedy algorithm above, and\nhas approximation guarantees). Finally, when using loss 1-V-ROUGE, Fw(S) + (cid:96)(S) is neither\nsubmodular nor monotone and approximate maximization is intractable. However, we resort to well\nmotivated and scalable heuristics, such as the submodular-supermodular procedures that have shown\ngood performance in various applications [24, 10].\nRuntime Inference: Having learnt the weights for the mixture components, the resulting function\ni=1 wifi(S) is monotone submodular, which can be optimized by the accelerated greedy\n\nFw(S) =(cid:80)m\n\nalgorithm [23]. Thanks to submodularity, we can obtain near optimal solutions very ef\ufb01ciently.\n\n4 Submodular Component Functions\nIn this section, we consider candidate submodular component functions to use in Fw(\u00b7). We consider\n\ufb01rst functions capturing more of the notion of \ufb01delity, and then next diversity, although the distinction\nis not entirely crystal clear in these functions as some have aspects of both. Many of the components\nare graph-based. We de\ufb01ne a weighted graph G(V, E, s), with V representing a the full set of images\nand E is every pair of elements in V . Each edge (i, j) \u2208 E has weight si,j computed from the visual\nfeatures as described in Section 7. The weight si,j is a similarity score between images i and j.\n\n4\n\n\f4.1 Fidelity-like Functions\n\nA function representing the \ufb01delity of a subset to the whole is one that gets a large value when\nthe subset faithfully represents that whole. An intuitively reasonable property for such a function\nis one that scores a summary highly if it is the case that the summary, as a whole, is similar to a\nlarge majority of items in the set V . In this case, if a given summary A has a \ufb01delity of f (A), then\nany superset B \u2283 A should, if anything, have higher \ufb01delity, and thus it seems natural to use only\nmonotone non-decreasing functions as \ufb01delity functions. Submodularity is also a natural property\nsince as more and more properties of an image collection are covered by a summary, the less chance\nany given image not part of the summary would have in offering additional coverage \u2014 in other\nwords, submodularity is a natural model for measuring the inherent redundancy in any summary.\nGiven this, we brie\ufb02y describe some possible choices for coverage functions:\nFacility Location. Given a summary S \u2286 V , we can quantify coverage of the whole image collection\nV by the similarity between i \u2208 V and its closest image j \u2208 S. Summing these similarities yields the\ni\u2208V maxj\u2208S si,j. The facility location function has been\n\nused for scene summarization in [28] and as one of the components in [20].\nSum Coverage. Here, we compute the average similarity in S rather than the similarity of the best\nelement in S only. From the graph perspective (G) we sum over the weights of edges with at least\n\nfacility location function ffac.loc.(S) =(cid:80)\none vertex in S. Thus, fsum cov.(S) =(cid:80)\nin S, we threshold the inner sum. De\ufb01ne \u03c3i(S) =(cid:80)\ncause of the objective getting large, we de\ufb01ne fthresh.sum(S) =(cid:80)\n\nj\u2208S si,j.\n\nThresholded sum/truncated graph cut This function has been used in document summariza-\ntion [20] and is similar to the sum coverage function except that instead of summing over all elements\nj\u2208S si,j, i.e. informally, \u03c3i(S) conveys how\nmuch of image i is covered by S. In order to keep an element i from being overly covered by S as the\ni\u2208V min(\u03c3i(S), \u03b1 \u03c3i(V )), which is\nboth monotone and submodular [20]. Under budget constraints, this function avoids summaries that\nover-cover any images.\nFeature functions. Consider a bag-of-words image model where for i \u2208 V , bi = (bi,w)w\u2208W\nis a bag-of-words representation of image i indexed by the set of visual words W (cf. Section 5).\nWe can then de\ufb01ne a feature coverage function [14], de\ufb01ned using the visual words, as follows:\n\n(cid:1), where g(\u00b7) is a monotone non-decreasing concave function.\n\nffeat.cov.(S) =(cid:80)\n\nw\u2208W g(cid:0)(cid:80)\n\nThis class is both monotone and submodular, and an added bene\ufb01t of scalability, since it does not\nrequire computation of a O(n2) similarity matrix like the graph-based functions proposed above.\n\ni\u2208I bi,w\n\n(cid:80)\n\ni\u2208V\n\n4.2 Diversity\n\ni\u2208S\n\n(cid:80)\n\nover all pairs as follows: fdissim.(S) = \u2212(cid:80)\ntakes the form \u2212(cid:80)\n\nDiversity is another trait of a good summary, and there are a number of ways to quantify it. In this\ncase, while submodularity is still quite natural, monotonicity sometimes is not.\nPenalty based diversity/dispersion Given a set S, we penalize similarity within S by summing\nj\u2208S,j>i si,j [28] (a variant, also submodular,\ni,j\u2208S si,j [19]). These functions are submodular, and monotone decreasing, so\nwhen added to other functions can yield non-monotone submodular functions. Such functions have\noccurred before in document summarization [19], as a dispersion function [1], and even for scene\nsummarization [28] (in this last case, the submodularity property was not explicitly mentioned).\nDiversity reward based on clusters. As in [20], we de\ufb01ne a cluster based function rewarding\ndiversity. Given clusters P1, P2,\u00b7\u00b7\u00b7 , Pk obtained by some clustering algorithm. We de\ufb01ne diversity\nj=1 g(S \u2229 Pj), where g(\u00b7) is a monotone submodular function\nso that fdiv.reward(\u00b7) is also monotone and submodular. Given a budget, fdiv.reward(S) is maximized\nby selecting S as diverse, over different clusters, as possible because of diminishing credit when\nrepeatedly choosing an item in a cluster.\n\nreward functions fdiv.reward(S) =(cid:80)k\n\n5 Visual Words for Evaluation\n\nV-ROUGE (see Section 2.2) depends on a visual \u201cbag-of-words\u201d vocabulary, and to construct a visual\nvocabulary, multitude choices exists. Common choices include SIFT descriptors [21], color descrip-\ntors [34], raw image patches [7], etc. For encoding, vector quantization (histogram encoding) [4],\nsparse coding [35], kernel codebook encoding [4], etc. can all be used. For the construction of our\n\n5\n\n\fV-ROUGE metric, we computed three lexical types and used their union as our visual vocabulary. The\ndifferent types are intended to capture information about the images at different scales of abstraction.\nColor histogram. The goal here is to capture overall image information via color information. We\nfollow the method proposed in [34]: Firstly, we extract the most frequent colors in RGB color space\nfrom the images in an image collection using 10 \u00d7 10 pixel patches. Secondly, these frequent colors\nare clustered by k-means into 128 clusters, resulting in 128 cluster centers. Finally, we quantize the\nmost frequent colors in every 10 \u00d7 10 pixel image patch using nearest neighbor vector quantization.\nFor every image, the resulting bag-of-colors is normalized to unit (cid:96)1-norm.\nSuper pixels. Here, we wish to capture information about small objects or image regions that are\nidenti\ufb01ed by segmentation. Images are \ufb01rst segmented using the quick shift algorithm implemented\nin VLFeat [33]. For every segment, dense SIFT descriptors are computed and clustered into 200\nclusters. Then, a patch-wise intermediate bag of words bpatch is computed by vector quantization\nand the RGB color histogram of the corresponding patch cpatch is appended to that set of words.\nThis results in intermediate features \u03c6patch = [bpatch, cpatch]. These intermediate features are again\nclustered into 200 clusters. Finally, the intermediate features are vector-quantized according to their\n(cid:96)1-distance. This \ufb01nal bag-of-words representation is normalized to unit (cid:96)1-norm.\nDeep convolutional neural network. Our deep neural network based words are meant to capture\nhigh-level information from the images. We use OverFeat [27], i.e. an image recognizer and feature\nextractor based on a convolutional neural network for extracting medium to high level image features.\nA sliding window is moved across an input picture such that every image is divided into 10 \u00d7 10\nblocks (using a 50% overlap) and the pixels within the window are presented to OverFeat as input.\nThe activations on layer 17 are taken as intermediate features \u03c6k and clustered by k-means into 300\nclusters. Then, each \u03c6k is encoded by kernel codebook encoding [4]. For every image, the resulting\nbag-of-words representation is normalized to the unit (cid:96)1-norm.\n\n6 Data Collection\n\nDataset. One major contribution of our paper is our new data set which we plan soon to publicly\nrelease. Our data set consists of 14 image collections, each comprising 100 images. The image\ncollections are typical real world personal image collections as they, for the most part, were taken\nduring holiday trips. For each collection, human-generated summaries were collected using Amazon\nmechanical Turk. Workers were asked to select a subset of 10 images from an image collection such\nthat it summarizes the collection in the best possible way.1 In contrast to previous work on movie\nsummarization [13], Turkers were not tested for their ability to produce high quality summaries.\nEvery Turker was rewarded 10 US cents for every summary.\nPruning of poor human-generated summaries. The summaries collected using Amazon\nmechanical Turk differ drastically in quality. For example, some of the collected summaries have low\nquality because they do not represent an image collection properly, e.g. they consist only of pictures\nof the same people but no pictures showing, say, architecture. Even though we went through several\ndistinct iterations of summary collection via Amazon Turk, improving the quality of our instructions\neach time, it was impossible to ensure that all individuals produced meaningful summaries. Such\nlow quality summaries can drastically degrade performance of the learning algorithm. We thus\ndeveloped a strategy to automatically prune away bad summaries, where \u201cbad\u201d is de\ufb01ned as the\nworst V-ROUGE score relative to a current set of human summaries. The strategy is depicted in\nAlgorithm 1. Each pruning step removes the worst human summary, and then creates a new instance\nof V-ROUGE using the updated pruned summaries. Pruning proceeds as long as a signi\ufb01cant fraction\n(greater than a desired \u201cp-value\u201d) of null-hypothesis summarizes (generated uniformly at random)\nscores better than the worst human summary. We chose a signi\ufb01cant value of p = 0.10.\n\n7 Experiments\nTo validate our approach, we learned mixtures of submodular functions with 594 component\nfunctions using the data set described in Section 6. In this data set, all human generated reference\nsummaries are size 10, and we evaluated performance of our learnt mixtures also by producing size\n10 summaries. The component functions were the monotone submodular functions described in\n\n1We did not provide explicit instructions on precisely how to summarize an image collection and instead\nonly asked that they choose a representative subset. We relied on their high-level intuitive understanding that the\ngestalt of the image collection should be preserved in the summary.\n\n6\n\n\fAlgorithm 1 Algorithm for pruning poor human-generated summaries.\nRequire: Con\ufb01dence level p, human summaries S, number of random summaries N\n\n(cid:80)\n\nSample N uniformly at random size-10 image sets, to be used as summaries R = (R1, . . . , RN )\nInstantiate V-ROUGE-score rS (\u00b7) instantiated with summaries S\no \u2190 1|R|\nwhile o > p do\nS \u2190 S \\ (argminS\u2208S rS (S))\nRe-instantiate V-ROUGE score rS (\u00b7) using updated pruned human summaries S.\nRecompute overlap o as above, but with updated V-ROUGE score.\n\nR\u2208R 1{rS (R)>minS\u2208S rS (S)} // fraction of random summaries better than worst human\n\nend while\nreturn Pruned human summaries S\n\nFigure 1: Three example 10\u00d710 image collections from our new data set.\n\nSection 4 using features described in Section 5. For weight optimization, we used AdaGrad [6], an\nadaptive subgradient method allowing for informative gradient-based learning. We do 20 passes\nthrough the samples in the collection.\nWe considered two types of experiments: 1) cheating experiments to verify that our proposed mixture\ncomponents can effectively learn good scoring functions; and 2) a 14-fold cross-validation experiment\nto test our approach in real- world scenarios. In the cheating experiments, training and testing is\nperformed on the same image collection, and this is repeated 14 times. By contrast, for our 14-fold\ncross-validation experiments, training is performed on 13 out of 14 image collections and testing is\nperformed on the held out summary, again repeating this 14 times. In both experiment types, since\nour learnt functions are always monotone submodular, we compute summaries S\u2217 of size 10 that\napproximately maximize the scoring functions using the greedy algorithm. For these summaries,\nwe compute the V-ROUGE score r(S\u2217). For easy score interpretation, we normalize it according to\nsc(S\u2217) = (r(S\u2217) \u2212 R)/(H \u2212 R), where R is the average V-ROUGE score of random summaries\n(computed from 1000 summaries) and where H is the average V-ROUGE score of the collected \ufb01nal\npruned human summaries. The result sc(S\u2217) is smaller than zero if S\u2217 scores worse than the average\nrandom summary and larger than one if it scores better than the average human summary.\nThe best cheating results are shown as Cheat in Table 1, learnt using 1-V-ROUGE as a loss. The\nresults in column Min are computed by constrainedly minimizing V-ROUGE via the methods of [11],\nand the results in column Max are computed by maximizing V-ROUGE using the greedy algorithm.\nTherefore, the Max column is an approximate upper bound on our achievable performance. Clearly,\nwe are able to learn good scoring functions, as on average we signi\ufb01cantly exceed average human\nperformance, i.e., we achieve an average score of 1.42 while the average human score is 1.00.\nResults for cross-validation experiments are presented in Table 1. In the columns Our Methods\nwe present the performance of our mixtures learnt using the proposed loss functions described in\nSection 3. We also present a set of baseline comparisons, using similarity scores computed via a\nhistogram intersection [32] method over the visual words used in the construction of V-ROUGE. We\npresent baseline results for the following schemes:\n\nFL the facility location objective ffac.loc.(S) alone;\n\nFLpen the facility location objective mixed with a \u03bb-weighted penalty, i.e. ffac.loc.(S) + \u03bbfdissim.(S);\nMMR Maximal marginal relevance [3], using \u03bb to tradeoff between relevance and diversity;\nGCpen Graphcut mixed with a \u03bb-weighted penalty, similar to FLpen but where graphcut is used in\n\nplace of facility location;\nkM K-Medoids clustering [9, Algorithm 14.2]. Initial cluster centers were selected uniformly at\nrandom. As a dissimilarity score between images i and j, we used 1 \u2212 si,j. Clustering was\nrun 20 times, and we used the cluster centers of the best clustering as the summary.\n\n7\n\n\fIn each of the above cases where a \u03bb weight is used, we take for each image collection the \u03bb \u2208\n{0, 0.1, 0.2, . . . , 0.9, 1.0} that produced a submodular function that when maximized produced the\nbest average V-ROUGE score on the 13 training image sets. This approach, therefore, selects the best\nbaseline possible when performing a grid-search on the training sets. Note that both \u03bb-dependent\nfunctions, i.e. FLpen and GCpen, are non-monotone submodular. Therefore, we used the randomized\ngreedy algorithm [2] for maximization which has a mathematical guarantee (we ran the algorithm 10\ntimes and used the best result).\nTable 1 shows that using 1-V-ROUGE as a loss signi\ufb01cantly outperforms the other methods. Further-\nmore, the performance is on average better than human performance, i.e. we achieve an average score\nof 1.13 while the average human score is 1.00. This indicates that we can ef\ufb01ciently learn scoring\nfunctions suitable for image collection summarization. For the other two losses, i.e. surrogate and\ncomplement V-ROUGE, performance is signi\ufb01cantly worse. Thus, in this case it seems advantageous\nto use the proper (supermodular) loss and heuristic optimization (the submodular-supermodular\nprocedure [24, 10]) for loss-augmented inference during training, compared to using an approximate\n(submodular or modular) loss in combination with an optimization algorithm for loss-augmented\ninference with strong guarantees. This could, however, perhaps be circumvented by constructing a\nmore accurate strictly submodular surrogate loss but we leave this to future work.\n\nTable 1: Cross-Validation Experiments (see text for details). Average human performance is 1.00,\naverage random performance is 0.00. For each image collection, the best result achieved by any of\nOur Methods and by any of the Baseline Methods is highlighted in bold.\n\nLimits\n\nOur Methods\n\nBaseline Methods\n\nFLpen MMR GCpen\n1.06\n0.82\n0.21\n0.58\n0.94\n-0.53\n1.01\n-0.02\n-1.28\n0.93\n0.20\n1.16\n-0.84\n0.70\n-1.27\n0.38\n0.94\n-0.59\n0.07\n0.99\n0.56\n0.05\n-0.01\n0.54\n-0.04\n-0.06\n0.14\n-0.80\n-0.27\n0.69\n\n-0.51\n0.65\n0.85\n0.51\n0.95\n-0.08\n-0.33\n0.57\n0.09\n-0.26\n-0.29\n0.02\n0.52\n0.22\n0.21\n\nkM\n1.23\n0.89\n0.52\n1.32\n0.70\n1.05\n0.97\n0.91\n0.38\n0.73\n0.26\n0.63\n0.02\n0.29\n0.71\n\nNo. Min Max Cheat\n1.71\n1.38\n1.64\n1.42\n1.60\n1.81\n1.07\n1.45\n1.73\n1.39\n1.22\n1.57\n0.77\n1.07\n1.42\n\n-2.55\n-2.06\n-2.07\n-3.20\n-1.65\n-2.83\n-2.44\n-1.66\n-2.32\n-1.46\n-1.55\n-1.74\n-0.94\n-1.46\n-2.00\n\n2.78\n2.22\n2.24\n2.04\n1.92\n2.40\n2.07\n2.04\n2.59\n2.34\n1.85\n2.39\n1.72\n1.75\n2.17\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\nAvg.\n\n(cid:96)1\u2212R\n1.51\n1.27\n1.46\n1.04\n1.11\n1.47\n1.07\n1.13\n1.21\n1.06\n0.95\n1.11\n0.32\n1.08\n1.13\n\n(cid:96)c\n0.87\n1.26\n0.95\n0.81\n1.06\n0.65\n0.96\n0.96\n1.13\n0.78\n0.92\n0.58\n0.53\n0.97\n0.89\n\n(cid:96)surr\n-0.36\n0.44\n0.23\n-0.18\n0.58\n0.27\n0.15\n0.07\n0.51\n0.14\n-0.08\n0.12\n0.14\n0.77\n0.20\n\nFL\n1.45\n0.18\n0.47\n0.71\n0.96\n1.26\n0.93\n0.62\n0.81\n1.58\n0.43\n0.78\n0.02\n0.23\n0.75\n\n8 Conclusions and Future Work\nWe have considered the task of automated summarization of image collections. A new data set\ntogether with many human generated ground truth summaries was presented and a novel automated\nevaluation metric called V-ROUGE was introduced. Based on large-margin structured prediction,\nand either submodular or non-submodular optimization, we proposed a method for learning scoring\nfunctions for image collection summarization and demonstrated its empirical effectiveness. In future\nwork, we would like to scale our methods to much larger image collections. A key step in this\ndirection is to consider low complexity and highly scalable classes of submodular functions. Another\nchallenge for larger image collections is how to collect ground truth, as it would be dif\ufb01cult for a\nhuman to summarize a collection of, say, 10,000 images.\nAcknowledgments: This material is based upon work supported by the National Science Foundation\nunder Grant No. (IIS-1162606), the Austrian Science Fund under Grant No. (P25244-N15), a Google\nand a Microsoft award, and by the Intel Science and Technology Center for Pervasive Computing.\nRishabh Iyer is also supported by a Microsoft Research Fellowship award.\nReferences\n[1] A. Borodin, H. C. Lee, and Y. Ye. Max-sum diversi\ufb01cation, monotone submodular functions and dynamic\nupdates. In Proc. of the 31st symposium on Principles of Database Systems, pages 155\u2013166. ACM, 2012.\n[2] N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. Submodular maximization with cardinality\n\nconstraints. In SODA, 2014.\n\n8\n\n\f[3] J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents and\n\nproducing summaries. In Research and Development in Information Retrieval, pages 335\u2013336, 1998.\n\n[4] K. Chat\ufb01eld, V. Lemtexpitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of\n\nrecent feature encoding methods. In British Machine Vision Conference (BMVC), 2011.\n\n[5] T. Denton, M. Demirci, J. Abrahamson, A. Shokoufandeh, and S. Dickinson. Selecting canonical views for\n\nview-based 3-d object recognition. In ICPR, volume 2, pages 273\u2013276, Aug 2004.\n\n[6] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research (JMLR), 12:2121\u20132159, July 2011.\n\n[7] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning.\n\nIn CVPR, volume 2, pages 264\u2013271. IEEE, June 2003.\n\n[8] Y. Hadi, F. Essannouni, and R. O. H. Thami. Video summarization by k-medoid clustering. In Symposium\n\non Applied Computing (SAC), pages 1400\u20131401, 2006.\n\n[9] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning, volume 2. Springer New\n\nYork, 2nd edition, 2009.\n\n[10] R. Iyer and J. Bilmes. Algorithms for approximate minimization of the difference between submodular\n\nfunctions, with applications. In UAI, 2012.\n\n[11] R. Iyer, S. Jegelka, and J. Bilmes. Fast semidifferential-based submodular function optimization. In ICML,\n\n2013.\n\n[12] A. Jaffe, M. Naaman, T. Tassa, and M. Davis. Generating summaries and visualization for large collections\n\nof geo-referenced photographs. In MIR, pages 89\u201398. ACM, 2006.\n\n[13] A. Khosla, R. Hamid, C.-J. Lin, and N. Sundaresan. Large-scale video summarization using web-image\n\npriors. In Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.\n\n[14] K. Kirchhoff and J. Bilmes. Submodularity for data selection in machine translation. In Empirical Methods\n\nin Natural Language Processing (EMNLP), October 2014.\n\n[15] A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th\n\nInternational Conference on Machine Learning, 2011.\n\n[16] X. Li, L. Chen, L. Zhang, F. Lin, and W.-Y. Ma. Image annotation by large-scale content-based image\n\nretrieval. In Proc. 14th ann. ACM Int. Conf. on Multimedia, pages 607\u2013610. ACM, 2006.\n\n[17] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches\n\nOut: Proceedings of the ACL-04 Workshop, pages 74\u201381, July 2004.\n\n[18] H. Lin and J. Bilmes. Multi-document summarization via budgeted maximization of submodular functions.\n\n[19] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In ACL/HLT-2011,\n\nIn NAACL, 2010.\n\nPortland, OR, June 2011.\n\n[20] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization.\n\nIn Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 479\u2013490, 2012.\n\n[21] D. Lowe. Object recognition from local scale-invariant features. In International Conference on Computer\n\nVision (ICCV), volume 2, pages 1150\u20131157, 1999.\n\n[22] M. Meeker and L. Wu. Internet trends. Technical report, Kleiner Perkins Cau\ufb01eld & Byers, 2013.\n[23] M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. Optimization\n\nTechniques, pages 234\u2013243, 1978.\n\n[24] M. Narasimhan and J. Bilmes. A submodular-supermodular procedure with applications to discriminative\nstructure learning. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), Edinburgh, Scotland, July\n2005. Morgan Kaufmann Publishers.\n\n[25] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of approximations for maximizing submodular set\n\nfunctions\u2013i. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[26] C.-W. Ngo, Y.-F. Ma, and H. Zhang. Automatic video summarization by graph modeling. In International\n\nConference on Computer Vision (ICCV), pages 104\u2013109 vol.1, Oct 2003.\n\n[27] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition,\n\nLocalization and Detection using Convolutional Networks. ArXiv e-prints, Dec. 2013.\n\n[28] I. Simon, N. Snavely, and S. M. Seitz. Scene Summarization for Online Image Collections. ICCV, 2007.\n[29] P. Sinha and R. Jain. Extractive summarization of personal photos from life events. In International\n\nConference on Multimedia and Expo (ICME), pages 1\u20136, 2011.\n\n[30] P. Sinha, S. Mehrotra, and R. Jain. Summarization of personal photologs using multidimensional content\n\nand context. In International Conference on Multimedia Retrieval (ICMR), pages 1\u20138, 2011.\n\n[31] J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering objects and their location in\n\nimages. In International Conference on Computer Vision (ICCV), volume 1, pages 370\u2013377, Oct 2005.\n\n[32] M. J. Swain and D. H. Ballard. Color indexing. Int. journal of computer vision, 7(1):11\u201332, 1991.\n[33] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms.\n\nhttp://www.vlfeat.org/, 2008.\n\n[34] C. Wengert, M. Douze, and H. J\u00b4egou. Bag-of-colors for improved image search.\n\nIn International\n\nConference on Multimedia (MM), pages 1437\u20131440. ACM, 2011.\n\n[35] J. Yang, K. Yu, Y. Gong, and T. S. Huang. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In CVPR, pages 1794\u20131801, 2009.\n\n9\n\n\f", "award": [], "sourceid": 780, "authors": [{"given_name": "Sebastian", "family_name": "Tschiatschek", "institution": "Graz University of Technology"}, {"given_name": "Rishabh", "family_name": "Iyer", "institution": "University of Washington"}, {"given_name": "Haochen", "family_name": "Wei", "institution": null}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": "University of Washington, Seattle"}]}