{"title": "Bayesian Video Shot Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 1009, "page_last": 1015, "abstract": null, "full_text": "Bayesian video shot segmentation \n\nNuno Vasconcelos \n\nAndrew Lippman \n\nMIT Media Laboratory, 20 Ames St, E15-354, Cambridge, MA 02139, \n\n{nuno,lip}@media.mit.edu, \n\nhttp://www.media.mit.edurnuno \n\nAbstract \n\nPrior knowledge about video structure can be used both as a means to \nimprove the peiformance of content analysis and to extract features that \nallow semantic classification. We introduce statistical models for two \nimportant components of this structure, shot duration and activity, and \ndemonstrate the usefulness of these models by introducing a Bayesian \nformulation for the shot segmentation problem. The new formulations \nis shown to extend standard thresholding methods in an adaptive and \nintuitive way, leading to improved segmentation accuracy. \n\n1 \n\nIntroduction \n\nGiven the recent advances on video coding and streaming technology and the pervasiveness \nof video as a form of communication, there is currently a strong interest in the development \nof techniques for browsing, categorizing, retrieving and automatically summarizing video. \nIn this context, two tasks are of particular relevance: the decomposition of a video stream \ninto its component units, and the extraction of features for the automatic characterization \nof these units. Unfortunately, current video characterization techniques rely on image \nrepresentations based on low-level visual primitives (such as color, texture, and motion) \nthat, while practical and computationally efficient, fail to capture most of the structure \nthat is relevant for the perceptual decoding of the video. In result, it is difficult to design \nsystems that are truly useful for naive users. Significant progress can only be attained by a \ndeeper understanding of the relationship between the message conveyed by the video and \nthe patterns of visual structure that it exhibits. \n\nThere are various domains where these relationships have been thoroughly studied, albeit \nnot always from a computational standpoint. For example, it is well known by film theorists \nthat the message strongly constrains the stylistic elements of the video [1, 6], which are \nusually grouped into two major categories: the elements of montage and the elements of \nmise-en-scene. Montage refers to the temporal structure, namely the aspects of film editing, \nwhile, mise-en-scene deals with spatial structure, i.e. the composition of each image, and \nincludes variables such as the type of set in which the scene develops, the placement of \nthe actors, aspects of lighting, focus, camera angles, and so on. Building computational \nmodels for these stylistic elements can prove useful in two ways: on one hand it will allow \nthe extraction of semantic features enabling video characterization and classification much \ncloser to that which people use than current descriptors based on texture properties or optical \nflow. On the other hand, it will provide constraints for the low-level analysis algorithms \nrequired to perform tasks such as video segmentation, keyframing, and so on. \n\n\fThe first point is illustrated by Figure 1 where we show how a collection of promotional \ntrailers for commercially released feature films populates a 2-D feature space based on the \nmost elementary characterization of montage and mise-en-scene: average shot duration vs. \naverage shot activity t . Despite the coarseness of this characterization, it captures aspects \nthat are important for semantic movie classification: close inspection of the genre assigned \nto each movie by the motion picture association of America reveals that in this space the \nmovies cluster by genre! \n\n09 \n\n06 \n\n05 \n\n04 \n\n+ crda \n\n+ \n+ \nedwood \n\"''\" \n\n+ \nsanta \n\ni:'(cid:173)\np..inlor \n\no \nnV8rw11d \n\n+ \nblaritman \n\n, \n\n, \n\n, , , \n, , , , \n+ '\" \n\n+ steePi ng \n+ \nscout \n\n+ \nclouds \n\n\" , \n, \n\n' \n\n\" \n\n, \n\n, , \n\n0 4 \n\n+ \nwal~~~ce~ \n\nJungle \n\n+ puWel \n\nShot acllVlty \n\n+ o \n+ \n\n, , , \n, , , , , \n, , , , \n, , \n, \n, , \n\no~lghter \no terrnnaJ \nvengeance \n\no \nd\", \n\n+ \nIIlBdness \nbadboys \n\nMovie \n\"Circle of Friends\" \n\"French Kiss\" \n'Miami Rhapsody\" \n'''The Santa Clause\" \n\"Exit to Eden\" \n\"A Walk in the Clouds\" \n'While you Were Sleeping\" \n\"Bad Boys\" \n\"Juni or\" \n\"Crimson Tide\" \n'The Scout\" \n'''The Walking Dead\" \n\"Ed Wood\" \n\"'The Jungle Book\" \n\"Puppet M aster\" \n\"A Little Princess\" \n'\"Judge Dredd\" \n'The River Wild\" \n'''Terminal Velocity\" \n'1l1ankman\" \n'1 n the Mouth of Madness\" \n\"Street Fighter\" \n\"Die Hard: With a Vengeance\" \n\nLegend \ncircle \nfrench \nmiami \nsanta \neden \nclouds \nsleeping \nbadboys \njunior \ntide \nseouL \nwalking \nedwood \njungle \npuppel \nprincess \ndredd \nriverwild \ntermin al \nblankman \nmadness \nfig hter \nvengeance \n\nFigure 1: Shot activity vs. duration features. The genre of each movie is identified by the symbol \nused to represent the movie in the plot. \n\nIn this paper, we concentrate on the second point, i.e. how the structure exhibited by Figure 1 \ncan be exploited to improve the performance of low-level processing tasks such as shot \nsegmentation. Because knowledge about the video structure is a form of prior knowledge, \nBayesian procedures provide a natural way to accomplish this goal. We therefore introduce \ncomputational models for shot duration and activity and develop a Bayesian framework for \nsegmentation that is shown to significantly outperform current approaches. \n\n2 Modeling shot duration \n\nBecause shot boundaries can be seen as arrivals over discrete, non-overlapping temporal \nintervals, a Poisson process seems an appropriate model for shot duration [3]. However, \nevents generated by Poisson processes have inter-arrival times characterized by the expo(cid:173)\nnential density which is a monotonically decreasing function of time. This is clearly not \nthe case for the shot duration, as can be seen from the histograms of Figure 2. In this work, \nwe consider two alternative models, the Erlang and Weibull distributions. \n\n2.1 The Erlang model \n\nLetting T be the time since the previous boundary, the Erlang distribution [3] is described \nby \n\n(1) \n\nIThe activity features are described in section 3. \n\n\fFigure 2: Shot duration histogram, and maximum likelihood fit obtained with the Erlang (left) and \nWeibull (right) distributions. \n\nIt is a generalization of the exponential density, characterized by two parameters: the order \nr, and the expected inter-arrival time (1/ A) of the underlying Poisson process. When r = 1, \nthe Erlang distribution becomes the exponential distribution. For larger values of r, it \ncharacterizes the time between the rth order inter-arrival time of the Poisson process. This \nleads to an intuitive explanation for the use of the Erlang distribution as a model of shot \nduration: for a given order r, the shot is modeled as a sequence of r events which are \nthemselves the outcomes of Poisson processes. Such events may reflect properties of the \nshot content, such as \"setting the context\" through a wide angle view followed by \"zooming \nin on the details\" when r = 2, or \"emotional buildup\" followed by \"action\" and \"action \noutcome\" when r = 3. Figure 2 presents a shot duration histogram, obtained from the \ntraining set to be described in section 5, and its maximum likelihood (ML) Erlang fit. \n\n2.2 The Wei bull model \n\nWhile the Erlang model provides a good fit to the empirical density, it is of limited practical \nutility due to the constant arrival rate assumption [5] inherent to the underlying Poisson \nprocess. Because A is a constant, the expected rate of occurrence of a new shot boundary \nis the same if 10 seconds or 1 hour have elapsed since the occurrence of the previous one. \nAn alternative models that does not suffer from this problem is the Weibull distribution [5], \nwhich generalizes the exponential distribution by considering an expected rate of arrival of \nnew events that is a function of time r \n\naro<- l \n\nA(r)=~, \n\nand of the parameters a and (3; leading to a probability density of the form \n\nwo< ,j3(r) = ~exp -\n\naro<-l \n\n. \n\n(2) \n\n[(r) 0<] \n\n/3 \n\nFigure 2 presents the ML Weibull fit to the shot duration histogram. Once again we obtain \na good approximation to the empirical density estimate. \n\n3 Modeling shot activity \n\nThe color histogram distance has been widely used as a measure of (dis)similarity between \nimages for the purposes of object recognition [7], content -based retrieval [4], and temporal \nvideo segmentation [2]. A histogram is first computed for each image in the sequence \nand the distance between successive histograms is used as a measure of local activity. A \nstandard metric for video segmentation [2] is the L l norm of the histogram difference, \n\nB \n\nV(a, b) = L lai - bil, \n\ni=l \n\n(3) \n\n\fwhere a and b are histograms of successive frames, and B the number of histogram bins. \n\nStatistical modeling of the histogram distance features requires the identification of the \nvarious states through which the video may progress. For simplicity, in this work we \nrestrict ourselves to a video model composed of two states: \"regular frames\" (S = 0) \nand \"shot transitions\" (S = 1). The fundamental principles are however applicable to \nmore complex models. As illustrated by Figure 3, for \"regular frames\" the distribution is \nasymmetric about the mean, always positive and concentrated near zero. This suggests that \na mixture of Erlang distributions is an appropriate model for this state, a suggestion that is \nconfirmed by the fit to the empirical density obtained with EM, also depicted in the figure. \nOn the other hand, for \"shot transitions\" the fit obtained with a simple Gaussian model is \nsufficient to achieve a reasonable approximation to the empirical density. In both cases, a \nuniform mixture component is introduced to account for the tails of the distributions. \n\nFigure 3: Left: Conditional activity histogram for regular frames, and best fit by a mixture with \nthree Erlang and a uniform component. Right: Conditional activity histogram for shot transitions, \nand best fit by a mixture with a Gaussian and a uniform component. \n\n4 A Bayesian framework for shot segmentation \n\nBecause shot segmentation is a pre-requisite for virtually any task involving the understand(cid:173)\ning, parsing, indexing, characterization, or categorization of video, the grouping of video \nframes into shots has been an active topic of research in the area of multimedia signal pro(cid:173)\ncessing. Extensive evaluation of various approaches has shown that simple thresholding of \nhistogram distances performs surprisingly well and is difficult to beat [2]. In this work, we \nconsider an alternative formulation that regards the problem as one of statistical inference \nbetween two hypothesis: \n\n\u2022 No : no shot boundary occurs between the two frames under analysis (S = 0), \n\u2022 Jit: a shot boundary occurs between the two frames (S = 1), \n\nfor which the optimal decision is provided by a likelihood ratio test where Nt is chosen if \n\nP(VIS = 1) \n\nC = log P(VIS = 0) > 0, \n\n(4) \n\nand No is chosen otherwise. It is well known that standard thresholding is a particular case \nof this formulation, in which both conditional densities are assumed to be Gaussians with \nthe same covariance. From the discussion in the previous section, it is clear that this does \nnot hold for real video. One further limitation of the thresholding model is that it does not \ntake into account the fact that the likelihood of a new shot transition is dependent on how \nmuch time has elapsed since the previous one. On the other hand, the statistical formulation \ncan easily incorporate the shot duration models developed in section 2. \n\n\f4.1 Notation \n\nBecause video is a discrete process, characterized by a given frame rate, shot boundaries \nare not instantaneous, but last for one frame period. To account for this, states are defined \nover time intervals, i.e. instead of St = 0 or St = 1, we have St ,tH; = 0 or St,t+6 = 1, \nwhere t is the start of a time interval, and 8 its duration. We designate the features observed \nduring the interval [t, t + <5] by Vt,tH' \nTo simplify the notation, we reserve t for the temporal instant at which the last shot \nboundary has occurred and make all temporal indexes relative to this instant. I.e. instead of \nSt+r,t+r+6 we write Sr,r+6, or simply S6 if T = O. Furthermore, we reserve the symbol \n8 for the duration of the interval between successive frames (inverse of the frame rate), \nand use the same notation for a simple frame interval and a vector of frame intervals (the \ntemporal indexes being themselves enough to avoid ambiguity). I.e., while Sr,rH = 0 \nindicates that no shot boundary is present in the interval [t + T, t + T + 8], SrH = \u00b0 \nindicates that no shot boundary has occurred in any of the frames between t and t + T + 8. \nSimilarly, VrH represents the vector of observations in [t, t + T + 8]. \n\n4.2 Bayesian formulation \n\nGiven that there is a shot boundary at time t and no boundaries occur in the interval [t, t + T], \nthe posterior probability that the next shot change happens during the interval [t + T, t +T+ 8] \nis, using Bayes rule, \nP(Sr,rH = 11Sr = 0, VrH) = 'YP(VrHISr = O,Sr,rH = l)P(Sr,rH = 11Sr = 0), \nwhere'Y is a normalizing constant. Similarly, the probability of no change in [t + T, t + T + 8] \nis \n\nP(Sr,rH = OISr = 0, VrH) = 'YP(VrHISrH = O)P(Sr,rH = OISr = 0), \n\nand the posterior odds ratio between the two hypothesis is \nP(Sr,rH = 11Sr = 0, VrH) \nP(Sr,rH = OISr = 0, VrH) \n\nP(Vr,rHISr,rH = 1) P(Sr,rH = 11Sr = 0) \nP(Vr,rHISr,rH = 0) P(Sr,rH = OISr = 0) \n\n= \n= P(Vr,rHISr,rH = 1) P(Sr,rH = I,Sr = 0\\,5) \n\nP(Vr,rHISr,rH = 0) \n\nP(SrH = 0) \n\nwhere we have assumed that, given Sr,rH, Vr,rH is independent of all other V and S. \nIn this expression, while the first term on the right hand side is the ratio of the conditional \nlikelihoods of activity given the state sequence, the second term is simply the ratio of \nprobabilities that there may (or not) be a shot transition T units of time after the previous \none. Hence, the shot duration density becomes a prior for the segmentation process. This \nis intuitive since knowledge about the shot duration is a form of prior knowledge about the \nstructure of the video that should be used to favor segmentations that are more plausible. \nAssuming further that V is stationary, defining Llr = [t + T, t + T + <5], considering the \nprobability density function p( T) for the time elapsed until the first scene change after t, \nand taking logarithms, leads to a log posterior odds ratio Cp08t of the form \n\nCp08t = log P(V \n\nIS = 0) + log Joo \n\nJ: H p(a)da \n()d . \na \nr+6 P a \n\nP(V~TIS~T = 1) \n\n~T ~T \n\n(6) \n\nThe optimal answer to the question if a shot change occurs or not in [t + T, t + T + 8] is \nthus to declare that a boundary exists if \n\nP(V,dS~T = 1) > 10 Jr':6 P(a)da = 7(T) \n' \n\nlog P(V~T IS~T = 0) -\n\ng J: H p(a)da \n\n(7) \n\n\fand that there is no boundary otherwise. Comparing this with (4), it is clear that the inclusion \nof the shot duration prior transforms the fixed thresholding approach into an adaptive one, \nwhere the threshold depends on how much time has elapsed since the previous shot boundary. \n\n4.2.1 The Erlang model \n\nIt can be shown that, under the Erlang assumption, \n\nand the threshold of (7) becomes \n\n\"( ) -1 \n\" \n\nT \n\n-\n\nog \n\nL~-l \u00a3i,.x(T + 8) \n\nr \n\nLi=lh,.x(T) - \u00a3i,.x(T + 8)] \n\n(8) \n\n(9) \n\n. \n\nIts variation over time is presented in Figure 4. While in the initial segment of the shot, the \nthreshold is large and shot changes are unlikely to be accepted, the threshold decreases as \nthe scene progresses increasing the likelihood that shot boundaries will be declared. \n\n. - - - -\n\n,, ~ ... \n\n- - - - .(cid:173)\n\nFigure 4: Temporal evolution of the Bayesian threshold for the Erlang (left) and Weibull (center) \npriors. Right: Total number of errors for all thresholds. \n\nEven though, qualitatively, this is behavior that what one would desire, a closer observation \nof the figure reveals the major limitation of the Erlang prior: its steady-state behavior. \nIdeally, in addition to decreasing monotonically over time, the threshold should not be \nlower bounded by a positive value as this may lead to situations in which its steady-state \nvalue is high enough to miss several consecutive shot boundaries. This limitation is a \nconsequence of the constant arrival rate assumption discussed in section 2 and can be \navoided by relying instead on the Weibull prior. \n\n4.2.2 The Weibull model \n\nIt can be shown that, under the Wei bull assumption, \n\nfrom which \n\nTw ( T) = - log { exp [( T + 8J: - TCX\n\n] \n\n- 1 } . \n\n(10) \n\n(11) \n\nAs illustrated by Figure 4, unlike the threshold associated with the Erlang prior, Tw(T) \ntends to -00 when T grows without bound. This guarantees that a new shot boundary will \nalways be found if one waits long enough. In summary, both the Erlang and the Weibull \nprior lead to adaptive thresholds that are more intuitive than the fixed threshold commonly \nemployed for shot segmentation. \n\n\f5 Segmentation Results \n\nThe performance of Bayesian shot segmentation was evaluated on a database containing the \npromotional trailers of Figure 1. Each trailer consists of 2 to 5 minutes of video and the total \nnumber of shots in the database is 1959. In all experiments, performance was evaluated by \nthe leave-one-out method. Ground truth was obtained by manual segmentation of all the \ntrailers. \n\nWe evaluated the performance of Bayesian models with Erlang, Weibull and Poisson shot \nduration priors and compared them against the best possible performance achievable with a \nfixed threshold. For the latter, the optimal threshold was obtained by brute-force, i.e. testing \nseveral values and selecting the one that performed best. Error rates for all priors are shown \nin Figure 4 where it is visible that, while the Poisson prior leads to worse accuracy than the \nstatic threshold, both the Erlang and the Weibull priors lead to significant improvements. \nThe Weibull prior achieves the overall best performance decreasing the error rate of the \nstatic threshold by 20%. \n\nThe reasons for the improved performance of Bayesian segmentation are illustrated by \nFigure 5, which presents the evolution of the thresholding process for a segment from one \nof the trailers in the database (\"blankman\"). Two thresholding approaches are depicted: \nBayesian with the Weibull prior, and standard fixed thresholding. The adaptive behavior of \nthe Bayesian threshold significantly increases the robustness against spurious peaks of the \nactivity metric originated by events such as very fast motion, explosions, camera flashes, \netc. \n\nFigure 5: An example of the thresholding process. Top: Bayesian. The likelihood ratio and the \nWeibull threshold are shown. Bottom: Fixed. Histogram distances and optimal threshold (determined \nby leave-one-out using the remainder of the database) are presented. Errors are indicated by circles. \n\nReferences \n[1] D. Bordwell and K. Thompson. Film Art: an Introduction. McGraw-Hill, 1986. \n[2] J. Boreczky and L. Rowe. Comparison of Video Shot Boundary Detection Techniques. In Proc. \n\nSPIE Con! on Visual Communication and Image Processing, 1996. \n\n[3] A. Drake. Fundamentals of Applied Probability Theory. McGraw-Hill, 1987. \n[4] W. Niblack et al. The QBIC project: Querying images by content using color, texture, and shape. \nIn Storage and Retrievalfor Image and Video Databases, pages 173- 181, SPIE, Feb. 1993, San \nJose, California. \n\n[5] R. Hogg and E. Tanis. Probability and Statistical Inference. Macmillan, 1993. \n[6] K. Reisz and G. Millar. The Technique of Film Editing. Focal Press, 1968. \n[7] M. Swain and D. Ballard. Color Indexing. International Journal of Computer Vision , Vol. \n\n7(1):11- 32, 1991. \n\n\f", "award": [], "sourceid": 1812, "authors": [{"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}, {"given_name": "Andrew", "family_name": "Lippman", "institution": null}]}