{"title": "Diverse Sequential Subset Selection for Supervised Video Summarization", "book": "Advances in Neural Information Processing Systems", "page_first": 2069, "page_last": 2077, "abstract": "Video summarization is a challenging problem with great application potential. Whereas prior approaches, largely unsupervised in nature, focus on sampling useful frames and assembling them as summaries, we consider video summarization as a supervised subset selection problem. Our idea is to teach the system to learn from human-created summaries how to select informative and diverse subsets, so as to best meet evaluation metrics derived from human-perceived quality. To this end, we propose the sequential determinantal point process (seqDPP), a probabilistic model for diverse sequential subset selection. Our novel seqDPP heeds the inherent sequential structures in video data, thus overcoming the deficiency of the standard DPP, which treats video frames as randomly permutable items. Meanwhile, seqDPP retains the power of modeling diverse subsets, essential for summarization. Our extensive results of summarizing videos from 3 datasets demonstrate the superior performance of our method, compared to not only existing unsupervised methods but also naive applications of the standard DPP model.", "full_text": "Diverse Sequential Subset Selection for\n\nSupervised Video Summarization\n\nBoqing Gong\u2217\n\nDepartment of Computer Science\nUniversity of Southern California\n\nLos Angeles, CA 90089\nboqinggo@usc.edu\n\nKristen Grauman\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\nAustin, TX 78701\n\ngrauman@cs.utexas.edu\n\nWei-Lun Chao\u2217\n\nDepartment of Computer Science\nUniversity of Southern California\n\nLos Angeles, CA 90089\nweilunc@usc.edu\n\nFei Sha\n\nDepartment of Computer Science\nUniversity of Southern California\n\nLos Angeles, CA 90089\n\nfeisha@usc.edu\n\nAbstract\n\nVideo summarization is a challenging problem with great application potential.\nWhereas prior approaches, largely unsupervised in nature, focus on sampling use-\nful frames and assembling them as summaries, we consider video summarization\nas a supervised subset selection problem. Our idea is to teach the system to learn\nfrom human-created summaries how to select informative and diverse subsets, so\nas to best meet evaluation metrics derived from human-perceived quality. To this\nend, we propose the sequential determinantal point process (seqDPP), a proba-\nbilistic model for diverse sequential subset selection. Our novel seqDPP heeds the\ninherent sequential structures in video data, thus overcoming the de\ufb01ciency of the\nstandard DPP, which treats video frames as randomly permutable items. Mean-\nwhile, seqDPP retains the power of modeling diverse subsets, essential for summa-\nrization. Our extensive results of summarizing videos from 3 datasets demonstrate\nthe superior performance of our method, compared to not only existing unsuper-\nvised methods but also naive applications of the standard DPP model.\n\n1\n\nIntroduction\n\nIt is an impressive yet alarming fact that there is far more video being captured\u2014by consumers, sci-\nentists, defense analysts, and others\u2014than can ever be watched or browsed ef\ufb01ciently. For example,\n144,000 hours of video are uploaded to YouTube daily; lifeloggers with wearable cameras amass\nGigabytes of video daily; 422,000 CCTV cameras perched around London survey happenings in\nthe city 24/7. With this explosion of video data comes an ever-pressing need to develop automatic\nvideo summarization algorithms. By taking a long video as input and producing a short video (or\nkeyframe sequence) as output, video summarization has great potential to reign in raw video and\nmake it substantially more browseable and searchable.\nVideo summarization methods often pose the problem in terms of subset selection: among all the\nframes (subshots) in the video, which key frames (subshots) should be kept in the output summary?\nThere is a rich literature in computer vision and multimedia developing a variety of ways to answer\nthis question [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. Existing techniques explore a plethora of properties that a\ngood summary should capture, designing criteria that the algorithm should prioritize when deciding\n\n\u2217Equal contribution\n\n1\n\n\fwhich subset of frames (or subshots) to select. These include representativeness (the frames should\ndepict the main contents of the videos) [1, 2, 10], diversity (they should not be redundant) [4, 11],\ninterestingness (they should have salient motion/appearance [2, 3, 6] or trackable objects [5, 12, 7]),\nor importance (they should contain important objects that drive the visual narrative) [8, 9].\nDespite valuable progress in developing the desirable properties of a summary, prior approaches are\nimpeded by their unsupervised nature. Typically the selection algorithm favors extracting content\nthat satis\ufb01es criteria like the above (diversity, importance, etc.), and performs some sort of frame\nclustering to discover events. Unfortunately, this often requires some hand-crafting to combine the\ncriteria effectively. After all, the success of a summary ultimately depends on human perception.\nFurthermore, due to the large number of possible subsets that could be selected, it is dif\ufb01cult to\ndirectly optimize the criteria jointly on the selected frames as a subset; instead, sampling methods\nthat identify independently useful frames (or subshots) are common.\nTo address these limitations, we propose to consider video summarization as a supervised subset\nselection problem. The main idea is to use examples of human-created summaries\u2014together with\ntheir original source videos\u2014to teach the system how to select informative subsets. In doing so,\nwe can escape the hand-crafting often necessary for summarization, and instead directly optimize\nthe (learned) factors that best meet evaluation metrics derived from human-perceived quality. Fur-\nthermore, rather than independently select \u201chigh scoring\u201d frames, we aim to capture the interlocked\ndependencies between a given frame and all others that could be chosen.\nTo this end, we propose the sequential determinantal point process (seqDPP), a new probabilistic\nmodel for sequential and diverse subset selection. The determinantal point process (DPP) has re-\ncently emerged as a powerful method for selecting a diverse subset from a \u201cground set\u201d of items [13],\nwith applications including document summarization [14] and information retrieval [15]. However,\nexisting DPP techniques have a fatal modeling \ufb02aw if applied to video (or documents) for sum-\nmarization: they fail to capture their inherent sequential nature. That is, a standard DPP treats the\ninputs as bags of randomly permutable items agnostic to any temporal structure. Our novel seqDPP\novercomes this de\ufb01ciency, making it possible to faithfully represent the temporal dependencies in\nvideo data. At the same time, it lets us pose summarization as a supervised learning problem.\nWhile learning how to summarize from examples sounds appealing, why should it be possible\u2014\nparticularly if the input videos are expected to vary substantially in their subject matter?1 Unlike\nmore familiar supervised visual recognition tasks, where test data can be reasonably expected to\nlook like the training instances, a supervised approach to video summarization must be able to learn\ngeneric properties that transcend the speci\ufb01c content of the training set. For example, the learner\ncan recover a \u201cmeta-cue\u201d for representativeness, if the input features record pro\ufb01les of the similar-\nity between a frame and its increasingly distant neighbor frames. Similarly, category-independent\ncues about an object\u2019s placement in the frame, the camera person\u2019s active manipulation of view-\npoint/zoom, etc., could play a role. In any such case, we can expect the learning algorithm to focus\non those meta-cues that are shared by the human-selected frames in the training set, even though the\nsubject matter of the videos may differ.\nIn short, our main contributions are: a novel learning model (seqDPP) for selecting diverse subsets\nfrom a sequence, its application to video summarization (the model is applicable to other sequential\ndata as well), an extensive empirical study with three benchmark datasets, and a successful \ufb01rst-\nstep/proof-of-concept towards using human-created video summaries for learning to select subsets.\nThe rest of the paper is organized as follows. In section 2, we review DPP and its application to\ndocument summarization. In section 3, we describe our seqDPP method, followed by a discussion\nof related work in section 4. We report results in section 5, then conclude in section 6.\n\n2 Determinantal point process (DPP)\n\nThe DPP was \ufb01rst used to characterize the Pauli exclusion principle, which states that two identi-\ncal particles cannot occupy the same quantum state simultaneously [16]. The notion of exclusion\nhas made DPP an appealing tool for modeling diversity in application such as document summariza-\ntion [14, 13], or image search and ranking [17]. In what follows, we give a brief account on DPP and\nhow to apply it to document summarization where the goal is to generate a summary by selecting\n\n1After all, not all videos on YouTube are about cats.\n\n2\n\n\fseveral sentences from a long document [18, 19]. The selected sentences should be not only diverse\n(i.e., different) to reduce the redundancy in the summary, but also representative of the document.\nBackground Let Y = {1, 2,\u00b7\u00b7\u00b7 , N} be a ground set of N items (eg., sentences). In its simplest\nform, a DPP de\ufb01nes a discrete probability distribution over all the 2N subsets of Y. Let Y denote\nthe random variable of selecting a subset. Y is then distributed according to\n\nP (Y = y) =\n\ndet(Ly)\n\ndet(L + I)\n\n(1)\n\nfor y \u2286 Y. The kernel L \u2208 SN\u00d7N\nis the DPP\u2019s parameter and is constrained to be positive semide\ufb01-\nnite. I is the identity matrix. Ly is the principal minor (sub-matrix) with rows and columns selected\naccording to the indices in y. The determinant function det(\u00b7) gives rise to the interesting property\nof pairwise repulsion. To see that, consider selecting a subset of two items i and j. We have\n\n+\n\nP (Y = {i, j}) \u221d LiiLjj \u2212 L2\nij.\n\n(2)\nIf the items i and j are the same, then P (Y = {i, j}) = 0 (because Lij = Lii = Ljj). Namely,\nidentical items should not appear together in the same set. A more general case also holds: if i and\nj are similar to each other, then the probability of observing i and j in a subset together is going to\nbe less than that of observing either one of them alone (see the excellent tutorial [13] for details).\nThe most diverse subset of Y is thus the one that attains the highest probability\n\ny\u2217 = arg maxy P (Y = y) = arg maxy det(Ly),\n\n(3)\nwhere y\u2217 results from MAP inference. This is a NP-hard combinatorial optimization problem.\nHowever, there are several approaches to obtaining approximate solutions [13, 20].\nLearning DPPs for document summarization Suppose we model selecting a subset of sentences\nas a DPP over all sentences in a document. We are given a set of training samples in the form of\ndocuments (i.e., ground sets) and the ground-truth summaries. How can we discover the underlying\nparameter L so as to use it for generating summaries for new documents?\nNote that the new documents will likely have sentences that have not been seen before in the training\nsamples. Thus, the kernel matrix L needs to be reparameterized in order to generalize to unseen\ndocuments. [14] proposed a special reparameterization called quality/diversity decomposition:\n\n(cid:19)\n\n(cid:18) 1\n\n2\n\nLij = qi\u03c6T\n\ni \u03c6jqj,\n\nqi = exp\n\n\u03b8Txi\n\n,\n\n(4)\n\nwhere \u03c6i is the normalized TF-IDF vector of the sentence i so that \u03c6T\ni \u03c6j computes the cosine angle\nbetween two sentences. The \u201cquality\u201d feature vector xi encodes the contextual information about i\nand its representativeness of other items. In document summarization, xi are the sentence lengths,\npositions of the sentences in the texts, and other meta cues. The parameter \u03b8 is then optimized with\nmaximum likelihood estimation (MLE) such that the target subsets have the highest probabilities\n\n\u03b8\u2217 = arg max\u03b8\n\nlog P (Y = y\u2217\n\nn; Ln(\u03b8)),\n\n(5)\n\n(cid:88)\n\nn\n\nwhere Ln is the L matrix formulated using sentences in the n-th ground set, and y\u2217\nsponding ground-truth summary.\nDespite its success in document summarization [14], a direct application of DPP to video summa-\nrization is problematic. The DPP model is agnostic about the order of the items. For video (and to a\nlarge degree, text data), it does not respect the inherent sequential structures. The second limitation\nis that the quality-diversity decomposition, while cleverly leading to a convex optimization, limits\nthe power of modeling complex dependencies among items. Speci\ufb01cally, only the quality factor qi\nis optimized on the training data. We develop new approaches to overcoming those limitations.\n\nn is the corre-\n\n3 Approach\n\nIn what follows, we describe our approach for video summarization. Our approach contains three\ncomponents: (1) a preparatory yet crucial step that generates ground-truth summaries from multiple\nhuman-created ones (section 3.1); (2) a new probabilistic model\u2014the sequential determinantal point\nprocess (seqDPP)\u2014that models the process of sequentially selecting diverse subsets (section 3.2);\n(3) a novel way of re-parameterizing seqDPP that enables learning more \ufb02exible and powerful rep-\nresentations for subset selection from standard visual and contextual features (section 3.3).\n\n3\n\n\fFigure 1: The agreement among human-created summaries is high, as is the agreement between the oracle\nsummary generated by our algorithm (cf. section 3.1) and human annotations.\n\n3.1 Generating ground-truth summaries\n\nThe \ufb01rst challenge we need to address is what to provide to our learning algorithm as ground-truth\nsummaries. In many video datasets, each video is annotated (manually summarized) by multiple\nhuman users. While the users were often well instructed on the annotation task, discrepancies are\nexpected due to many uncontrollable individual factors such as whether the person was attentive,\nidiosyncratic viewing preferences, etc. There are some studies on how to evaluate automatically\ngenerated summaries in the presence of multiple human-created annotations [21, 22, 23]. However,\nfor learning, our goal is to generate one single ground-truth or \u201coracle\u201d summary per video.\nOur main idea is to synthesize the oracle summary that maximally agrees with all annotators. Our\nhypothesis is that despite the discrepancies, those summaries nonetheless share the common traits of\nre\ufb02ecting the subject matters in the video. These commonalities, to be discovered by our synthesis\nalgorithm, will provide strong enough signals for our learning algorithm to be successful.\nTo begin with, we \ufb01rst describe a few metrics in quantifying the agreement in the simplest setting\nwhere there are only two summaries. These metrics will also be used later in our empirical studies\nto evaluate various summarization methods. Using those metrics, we then analyze the consistency\nof human-created summaries in two video datasets to validate our hypothesis. Finally, we present\nour algorithm for synthesizing one single oracle summary per video.\nEvaluation metrics Given two video summaries A and B, we measure how much they are in\nagreement by \ufb01rst matching their frames, as they might be of different lengths. Following [24], we\ncompute the pairwise distances between all frames across the two summaries. Two frames are then\n\u201cmatched\u201d if their visual difference is below some threshold; a frame is constrained to appear in\nthe matched pairs at most once. After the matching, we compute the following metrics (commonly\nknown as Precision, Recall and F-score):\n\nPAB =\n\n#matched frames\n\n#frames in A\n\n, RAB =\n\n#matched frames\n\n#frames in B\n\n, FAB =\n\nPAB \u00b7 RAB\n\n0.5(PAB + RAB)\n\n.\n\nAll of them lie between 0 and 1, and higher values indicate better agreement between A and B. Note\nthat these metrics are not symmetric \u2013 if we swap A and B, the results will be different.\nOur idea of examining the consistency among all summaries is to treat each summary in turn as if it\nwere the gold-standard (and assign it as B) while treating the other summaries as A\u2019s. We report our\nanalysis of existing video datasets next.\nConsistency in existing video databases We analyze video summaries in two video datasets: 50\nvideos from the Open Video Project (OVP) [25] and another 50 videos from Youtube [24]. Details\nabout these two video datasets are in section 5. We brie\ufb02y point out that the two datasets have very\ndifferent subject matters and composition styles. Each of the 100 videos has 5 annotated summaries.\nFor each video, we compute the pairwise evaluation metrics in precision, recall, and F-score by\nforming total 20 pairs of summaries from two different annotators. We then average them per video.\nWe plot how these averaged metrics distribute in Fig. 1. The plots show the number of videos\n(out of 100) whose averaged metrics exceed certain thresholds, marked on the horizontal axes. For\nexample, more than 80% videos have an averaged F-score greater than 0.6, and 60% more than 0.7.\nNote that there are many videos (\u224820) with averaged F-scores greater than 0.8, indicating that on\naverage, human-created summaries have a high degree of agreement. Note that the mean values of\nthe averaged metrics per video are also high.\n\n4\n\n\fGreedy algorithm for synthesizing an oracle summary Encouraged by our \ufb01ndings, we develop\na greedy algorithm for synthesizing one oracle summary per video, from multiple human-created\nones. This algorithm is adapted from a similar one for document summarization [14]. Speci\ufb01cally,\nfor each video, we initialize the oracle summary with the empty set y\u2217 = \u2205. Iteratively, we then add\nto y\u2217 one frame i at a time from the video sequence\ny\u2217 \u2190 y\u2217 \u222a arg maxi\n\n(cid:88)\n\nFy\u2217\u222ai,yu .\n\n(6)\n\nu\n\nIn words, the frame i is selected to maximally increase the F-score between the new oracle summary\nand the human-created summaries yu. To avoid adding all frames in the video sequence, we stop\nthe greedy process as soon as there is no frame that can increase the F-score.\nWe measure the quality of the synthesized oracle summaries by computing their mean agreement\nwith the human annotations. The results are shown in Fig. 1 too. The quality is high: more than\n90% of the oracle summaries agree well with other summaries, with an F-score greater than 0.6. In\nwhat follows, we will treat the oracle summaries as ground-truth to inform our learning algorithms.\n\n3.2 Sequential determinantal point processes (seqDPP)\n\nThe determinantal point process, as described in section 2, is a powerful tool for modeling diverse\nsubset selection. However, video frames are more than items in a set. In particular, in DPP, the\nground set is a bag \u2013 items are randomly permutable such that the most diverse subset remains\nunchanged. Translating this into video summarization, this modeling property essentially suggests\nthat we could randomly shuf\ufb02e video frames and expect to get the same summary!\nTo address this serious de\ufb01ciency, we propose sequential DPP, a new probabilistic model to in-\ntroduce strong dependency structures between items. As a motivating example, consider a video\nportraying the sequence of someone leaving home for school, coming back to home for lunch, leav-\ning for market and coming back for dinner. If only visual appearance cues are available, a vanilla\nDPP model will likely select only one frame from the home scene and repel other frames occurring\nat the home. Our model, on the other hand, will recognize that the temporal span implies those\nframes are still diverse despite their visual similarity. Thus, our modeling intuition is that diversity\nshould be a weaker prior for temporally distant frames but ought to act more strongly for closely\nneighboring frames. We now explain how our seqDPP method implements this intuition.\nModel de\ufb01nition Given a ground set (a long video sequence) Y, we partition it into T disjoint yet\nt=1 Yt = Y. At time t, we introduce a subset selection variable Yt.\nWe impose a DPP over two neighboring segments where the ground set is Ut = Yt \u222a yt\u22121, ie., the\nunion between the video segments and the selected subset in the immediate past. Let \u2126t denote the\nL-matrix de\ufb01ned over the ground set Ut. The conditional distribution of Yt is thus given by,\n\nconsecutive short segments(cid:83)T\n\ndet \u2126yt\u22121\u222ayt\ndet(\u2126t + It)\n\n.\n\nP (Yt = yt|Yt\u22121 = yt\u22121) =\n\n(7)\nAs before, the subscript yt\u22121 \u222a yt selects the corresponding rows and columns from \u2126t. It is a\ndiagonal matrix, the same size as Ut. However, the elements corresponding to yt\u22121 are zeros and\nthe elements corresponding to Yt are 1 (see [13] for details). Readers who are familiar with DPP\nmight identify the conditional distribution is also a DPP, restricted to the ground set Yt.\nThe conditional probability is de\ufb01ned in such a way that at time t, the subset selected should be\ndiverse among Yt as well as be diverse from previously selected yt\u22121. However, beyond those two\npriors, the subset is not constrained by subsets selected in the distant past. Fig. 2 illustrates the idea\nin graphical model notation. In particular, the joint distribution of all subsets is factorized\nP (Yt = yt|Yt\u22121 = yt\u22121).\n\nP (Y1 = y1, Y2 = y2,\u00b7\u00b7\u00b7 , YT = yT ) = P (Y1 = y1)\n\n(cid:89)\n\n(8)\n\nt=2\n\nInference and learning The MAP inference for the seqDPP model eq. (8) is as hard as the stan-\ndard DPP model. Thus, we propose to use the following online inference, analogous to Bayesian\nbelief updates (for Kalman \ufb01ltering):\n\ny\u2217\n1 = arg maxy\u2208Y1 P (Y1 = y)\nt = arg maxy\u2208Yt P (Yt = y|Yt\u22121 = y\u2217\ny\u2217\n\nt\u22121)\n\n2 = arg maxy\u2208Y2 P (Y2 = y|Y1 = y\u2217\ny\u2217\n1)\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\n5\n\n\fY1\n\nY2\n\nY3\n\nY1\n\nY2\n\nY3\n\n\u00b7\u00b7\u00b7\n\nYt\n\nYt\n\nYT\n\nYT\n\n\u00b7\u00b7\u00b7\n\nFigure 2: Our sequential DPP for modeling sequential video data, drawn as a Bayesian network\n\nNote that, at each step, the ground set could be quite small; thus an exhaustive search for the most\ndiverse subset is plausible. The parameter learning is similar to the standard DPP model. We\ndescribe the details in the supplementary material.\n\n3.3 Learning representations for diverse subset selection\n\nAs described previously, the kernel L of DPP hinges on the reparameterization with features of\nthe items that can generalize across different ground sets. The quality-diversity decomposition in\neq. (4), while elegantly leading to convex optimization, is severely limited in its power in modeling\ncomplex items and dependencies among them. In particular, learning the subset selection rests solely\non learning the quality factor, as the diversity component remains handcrafted and \ufb01xed.\nWe overcome this de\ufb01ciency with more \ufb02exible and powerful representations. Concretely, let fi\nstand for the feature representation for item (frame) i, including both low-level visual cues and\nmeta-cues such as contextual information. We reparameterize the L matrix with fi in two ways.\nLinear embeddings The simplest way is to linearly transform the original features\n\nLij = f T\n\ni W TW fj,\n\n(9)\n\nwhere W is the transformation matrix.\nNonlinear hidden representation We use a one-hidden-layer neural network to infer a hidden\nrepresentation for fi\n\nLij = zT\n\ni W TW zj where zi = tanh(U fi),\n\n(10)\n\nwhere tanh(\u00b7) stands for the hyperbolic transfer function.\nTo learn the parameters W or U and W , we use maximum likelihood estimation (cf. eq. (5)), with\ngradient-descent to optimize. Details are given in the supplementary material.\n\n4 Related work\n\nSpace does not permit a thorough survey of video summarization methods. Broadly speaking, ex-\nisting approaches develop a variety of selection criteria to prioritize frames for the output summary,\noften combined with temporal segmentation. Prior work often aims to retain diverse and represen-\ntative frames [2, 1, 10, 4, 11], and/or de\ufb01nes novel metrics for object and event saliency [3, 2, 6, 8].\nWhen the camera is known to be stationary, background subtraction and object tracking are valuable\ncues (e.g., [5]). Recent developments tackle summarization for dynamic cameras that are worn or\nhandheld [10, 8, 9] or design online algorithms to process streaming data [7].\nWhereas existing methods are largely unsupervised, our idea to explicitly learn subset selection\nfrom human-given summaries is novel. Some prior work includes supervised learning components\nthat are applied during selection (e.g., to generate learned region saliency metrics [8] or train classi-\n\ufb01ers for canonical viewpoints [10]), but they do not train/learn the subset selection procedure itself.\nOur idea is also distinct from \u201cinteractive\u201d methods, which assume a human is in the loop to give\nsupervision/feedback on each individual test video [26, 27, 12].\nOur focus on the determinantal point process as the building block is largely inspired by its appealing\nproperty in modeling diversity in subset selection, as well as its success in search and ranking [17],\ndocument summarization [14], news headline displaying [28], and pose estimation [29]. Applying\nDPP to video summarization, however, is novel to the best of our knowledge.\nOur seqDPP is closest in spirit to the recently proposed Markov DPP [28]. While both models enjoy\nthe Markov property by de\ufb01ning conditional probabilities depending only on the immediate past,\n\n6\n\n\fTable 1: Performance of various video summarization methods on OVP. Ours and its variants perform the best.\n\nDT\n[30]\nF\n57.6\nP\n67.7\nR 53.2\n\nVSUMM1\n\nUnsupervised methods\nSTIMO\n[31]\n63.4\n60.3\n72.2\n\n[24]\n70.3\n70.6\n75.8\n\nVSUMM2\n\n[24]\n68.2\n73.1\n69.1\n\n[14]\n\nDPP + Q/D\n70.8\u00b10.3\n71.5\u00b10.4\n74.5\u00b10.3\n\nSupervised subset selection\nOurs (seqDPP+)\n\nQ/D\n\n68.5\u00b10.3\n66.9\u00b10.4\n75.8\u00b10.5\n\nLINEAR\n75.5\u00b10.4\n77.5\u00b10.5\n78.4\u00b10.5\n\nN.NETS\n77.7\u00b10.4\n75.0\u00b10.5\n87.2\u00b10.3\n\nTable 2: Performance of our method with different representation learning\nVSUMM2 [24]\nR\n58.7\n80.6\n\n57.8\u00b10.5\n75.3\u00b10.7\n\n69.8\u00b10.5\n80.4\u00b10.9\n\n54.2\u00b10.7\n77.8\u00b11.0\n\n60.3\u00b10.5\n78.9\u00b10.5\n\nseqDPP+LINEAR\n\nP\n59.7\n75.7\n\nR\n\nF\n\nF\n\nP\n\nF\n55.7\n68.9\n\nP\n\n59.4\u00b10.6\n81.9\u00b10.8\n\nR\n\n64.9\u00b10.5\n81.1\u00b10.9\n\nseqDPP+N. NETS\n\nYoutube\nKodak\n\nMarkov DPP\u2019s ground set is still the whole video sequence, whereas seqDPP can select diverse sets\nfrom the present time. Thus, one potential drawback of applying Markov DPP is to select video\nframes out of temporal order, thus failing to model the sequential nature of the data faithfully.\n\n5 Experiments\n\nWe validate our approach of sequential determinantal point processes (seqDPP) for video summa-\nrization on several datasets, and obtain superior performance to competing methods.\n\n5.1 Setup\n\nData We benchmark various methods on 3 video datasets: the Open Video Project (OVP), the\nYoutube dataset [24], and the Kodak consumer video dataset [32]. They have 50, 392, and 18 videos,\nrespectively. The \ufb01rst two have 5 human-created summaries per video and the last has one human-\ncreated summary per video. Thus, for the \ufb01rst two datasets, we follow the algorithm described in\nsection 3.1 to create an oracle summary per video. We follow the same procedure as in [24] to\npreprocess the video frames. We uniformly sample one frame per second and then apply two stages\nof pruning to remove uninformative frames. Details are in the supplementary material.\nFeatures Each frame is encoded with an (cid:96)2-normalized 8192-dimensional Fisher vector \u03c6i [33],\ncomputed from SIFT features [34]. The Fisher vector represents well the visual appearance of the\nvideo frame, and is hence used to compute the pairwise correlations of the frames in the quality-\ndiversity decomposition (cf. eq. (4)). We derive the quality features xi by measuring the represen-\ntativeness of the frame. Speci\ufb01cally, we place a contextual window centered around the frame of\ninterest, and then compute its mean correlation (using the SIFT Fisher vector) to the other frames in\nthe window. By varying the size of the windows from 5 to 15, we obtain 12-dimensional contextual\nfeatures. We also add features computed from the frame saliency map [35]. To apply our method\nfor learning representations (cf. section 3.3), however, we do not make a distinction between the two\ntypes, and instead compose a feature vector fi by concatenating xi and \u03c6i. The dimension of our\nlinear transformed features W fi is 10, 40 and 100 for OVP, Youtube, and Kodak, respectively. For\nthe neural network, we use 50 hidden units and 50 output units.\nOther details For each dataset, we randomly choose 80% of the videos for training and use the\nremaining 20% for testing. We run 100 rounds of experiments and report the average performance,\nwhich is evaluated by the aforementioned F-score, Precision, and Recall (cf. section 3.1). For\nevaluation, we follow the standard procedure: for each video, we treat each human-created summary\nas golden-standard and assess the quality of the summary output by our algorithm. We then average\nover all human annotators to obtain the evaluation metrics for that video.\n\n5.2 Results\n\nWe contrast our approach to several state-of-the-art methods for video summarization\u2014which in-\nclude several leading unsupervised methods\u2014as well as the vanilla DPP model that has been suc-\ncessfully used for document summarization but does not model sequential structures. We compare\nthe methods in greater detail on the OVP dataset. Table 1 shows the results.\n\n2In total there are 50 Youtube videos. We keep 39 of them after excluding the cartoon videos.\n\n7\n\n\fYoutube\n(Video 99)\n\nKodak\n(Video 4)\n\nFigure 3: Exemplar video summaries by our seqDPP LINEAR vs. VSUMM summary [24].\n\nUnsupervised or supervised? The four unsupervised methods are DT [30], STIMO [31],\nVSUMM1 [24], and VSUMM2 with a postprocessing step to VSUMM1 to improve the precision of\nthe results. We implement VSUMM ourselves using features described in the orignal paper and tune\nits parameters to have the best test performance. All 4 methods use clustering-like procedures to\nidentify key frames as video summaries. Results of DT and STIMO are taken from their original\npapers. They generally underperform VSUMM.\nWhat is interesting is that the vanilla DPP does not outperform the unsupervised methods, despite\nits success in other tasks. On the other end, our supervised method seqDPP, when coupled with the\nlinear or neural network representation learning, performs signi\ufb01cantly better than all other methods.\nWe believe the improvement can be attributed to two factors working in concert: (1) modeling se-\nquential structures of the video data, and (2) more \ufb02exible and powerful representation learning.\nThis is evidenced by the rather poor performance of seqDPP with the quality/diversity (Q/D) de-\ncomposition, where the representation of the items is severely limited such that modeling temporal\nstructures alone is simply insuf\ufb01cient.\n\nLinear or nonlinear? Table 2 concentrates on comparing the effectiveness of these two types of\nrepresentation learning. The performances of VSUMM are provided for reference only. We see that\nlearning representations with neural networks generally outperforms the linear representations.\n\nQualitative results We present exemplar video summaries by different methods in Fig. 3. The\nchallenging Youtube video illustrates the advantage of sequential diverse subset selection. The visual\nvariance in the beginning of the video is far greater (due to close-shots of people) than that at the end\n(zooming out). Thus the clustering-based VSUMM method is prone to select key frames from the \ufb01rst\nhalf of the video, collapsing the latter part. In contrast, our seqDPP copes with time-varying diversity\nvery well. The Kodak video demonstrates again our method\u2019s ability in attaining high recall when\nusers only make diverse selections locally but not globally. VSUMM fails to acknowledge temporally\ndistant frames can be diverse despite their visual similarities.\n\n6 Conclusion\n\nOur novel learning model seqDPP is a successful \ufb01rst step towards using human-created summaries\nfor learning to select subsets for the challenging video summarization problem. We just scratched\nthe surface of this fruit-bearing direction. We plan to investigate how to learn more powerful repre-\nsentations from low-level visual cues.\nAcknowledgments B. G., W. C. and F. S. are partially supported by DARPA D11-AP00278, NSF IIS-1065243,\nand ARO #W911NF-12-1-0241. K. G. is supported by ONR YIP Award N00014-12-1-0754 and gifts from\nIntel and Google. B. G. and W. C. also acknowledge supports from USC Viterbi Doctoral Fellowship and USC\nAnnenberg Graduate Fellowship. We are grateful to Jiebo Luo for providing the Kodak dataset [32].\n\n8\n\nSequential LINEAR(F=70, P=60, R=88)OracleVSUMM1(F=59, P=65, R=55)UserSequential LINEAR(F=86, P=75, R=100)VSUMM1(F=50, P=100, R=33)\fReferences\n\n[1] R. Hong, J. Tang, H. Tan, S. Yan, C. Ngo, and T. Chua. Event driven summarization for web videos. In\n\nACM SIGMM Workshop on Social Media, 2009.\n\n[2] C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang. Automatic video summarization by graph modeling. In ICCV,\n\n[3] Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, and Mingjing Li. A user attention model for video summarization.\n\nIn ACM MM, 2002.\n\nvariable length. In ECCV, 2002.\n\n[4] Tiecheng Liu and John R. Kender. Optimization algorithms for the selection of key frame sequences of\n\n[5] Y. Pritch, A. Rav-Acha, A. Gutman, and S. Peleg. Webcam synopsis: Peeking around the world. In ICCV,\n\n[6] H. Kang, X. Chen, Y. Matsushita, and Tang X. Space-time video montage. In CVPR, 2006.\n[7] Shikun Feng, Zhen Lei, Dong Yi, and Stan Z. Li. Online content-aware video condensation. In CVPR,\n\n2003.\n\n2007.\n\n2012.\n\n[8] Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. Discovering important people and objects for\n\negocentric video summarization. In CVPR, 2012.\n\n[9] Zheng Lu and Kristen Grauman. Story-driven summarization for egocentric video. In CVPR, 2013.\n[10] A. Khosla, R. Hamid, C-J. Lin, and N. Sundaresan. Large-scale video summarization using web-image\n\npriors. In CVPR, 2013.\n\n[11] Hong-Jiang Zhang, Jianhua Wu, Di Zhong, and Stephen W. Smoliar. An integrated system for content-\n\nbased video retrieval and browsing. Pattern Recognition, 30(4):643\u2013658, 1997.\n\n[12] D. Liu, Gang Hua, and Tsuhan Chen. A hierarchical visual model for video object summarization. PAMI,\n\n[13] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Foundations and\n\n32(12):2178\u20132190, 2010.\nTrends R(cid:13) in Machine Learning, 5(2-3):123\u2013286, 2012.\n\n[14] Alex Kulesza and Ben Taskar. Learning determinantal point processes. In UAI, 2011.\n[15] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Discovering diverse and salient threads in document\n\n[16] Odile Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability,\n\ncollections. In EMNLP/CNLL, 2012.\n\n7(1):83\u2013122, 1975.\n\n[17] Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. In ICML, 2011.\n[18] Hoa Trang Dang. Overview of duc 2005. In Document Understanding Conf., 2005.\n[19] Hui Lin and Jeff Bilmes. Multi-document summarization via budgeted maximization of submodular\n\nfunctions. In NAACL/HLT, 2010.\n\nprocesses. In NIPS, 2012.\n\n[20] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determinantal point\n\n[21] V\u00b4\u0131ctor Vald\u00b4es and Jos\u00b4e M Mart\u00b4\u0131nez. Automatic evaluation of video summaries. ACM Trans. on multime-\n\ndia computing, communications, and applications, 8(3):25, 2012.\n\n[22] Emilie Dumont and Bernard M\u00b4erialdo. Automatic evaluation method for rushes summary content. In\n\nICME, 2009.\n\n[23] Yingbo Li and Bernard Merialdo. Vert: automatic evaluation of video summaries. In ACM MM, 2010.\n[24] Sandra Eliza Fontes de Avila, Ana Paula Brand\u02dcao Lopes, et al. Vsumm: A mechanism designed to\nproduce static video summaries and a novel evaluation method. Pattern Recognition Letters, 32(1):56\u2013\n68, 2011.\n\n[25] Open video project. http://www.open-video.org/.\n[26] M. Ellouze, N. Boujemaa, and A. Alimi.\n\nIm(s)2: Interactive movie summarization system. J VCIR,\n\n21(4):283\u2013294, 2010.\n\nand editing. In SIGGRAPH, 2006.\n\n[27] Dan B Goldman, Brian Curless, and Steven M. Seitz. Schematic storyboarding for video visualization\n\n[28] R. H. Affandi, A. Kulesza, and E. B. Fox. Markov determinantal point processes. In UAI, 2012.\n[29] A. Kulesza and B. Taskar. Structured determinantal point processes. In NIPS, 2011.\n[30] Padmavathi Mundur, Yong Rao, and Yelena Yesha. Keyframe-based video summarization using delaunay\n\nclustering. Int\u2019l J. on Digital Libraries, 6(2):219\u2013232, 2006.\n\n[31] Marco Furini, Filippo Geraci, Manuela Montangero, and Marco Pellegrini. Stimo: Still and moving video\n\nstoryboard for the web scenario. Multimedia Tools and Applications, 46(1):47\u201369, 2010.\n\n[32] Jiebo Luo, Christophe Papin, and Kathleen Costello. Towards extracting semantically meaningful key\nframes from personal video clips: from humans to computers. IEEE Trans. on Circuits and Systems for\nVideo Technology, 19(2):289\u2013301, 2009.\n\n[33] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization.\n\n[34] David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110, 2004.\n[35] Esa Rahtu, Juho Kannala, Mikko Salo, and Janne Heikkil. Segmenting salient objects from images and\n\nIn CVPR, 2007.\n\nvideos. In ECCV, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1117, "authors": [{"given_name": "Boqing", "family_name": "Gong", "institution": "University of Southern California (USC)"}, {"given_name": "Wei-Lun", "family_name": "Chao", "institution": "University of Southern California"}, {"given_name": "Kristen", "family_name": "Grauman", "institution": "UT Austin"}, {"given_name": "Fei", "family_name": "Sha", "institution": "University of Southern California (USC)"}]}