{"title": "Dynamic Foreground/Background Extraction from Images and Videos using Random Patches", "book": "Advances in Neural Information Processing Systems", "page_first": 929, "page_last": 936, "abstract": null, "full_text": "Dynamic Foreground/Background Extraction from Images and Videos using Random Patches\n\nLe Lu Integrated Data Systems Department Siemens Corporate Research Princeton, NJ 08540 le-lu@siemens.com\n\nGregory Hager Department of Computer Science Johns Hopkins University Baltimore, MD 21218 hager@cs.jhu.edu\n\nAbstract\nIn this paper, we propose a novel exemplar-based approach to extract dynamic foreground regions from a changing background within a collection of images or a video sequence. By using image segmentation as a pre-processing step, we convert this traditional pixel-wise labeling problem into a lower-dimensional supervised, binary labeling procedure on image segments. Our approach consists of three steps. First, a set of random image patches are spatially and adaptively sampled within each segment. Second, these sets of extracted samples are formed into two \"bags of patches\" to model the foreground/background appearance, respectively. We perform a novel bidirectional consistency check between new patches from incoming frames and current \"bags of patches\" to reject outliers, control model rigidity and make the model adaptive to new observations. Within each bag, image patches are further partitioned and resampled to create an evolving appearance model. Finally, the foreground/background decision over segments in an image is formulated using an aggregation function defined on the similarity measurements of sampled patches relative to the foreground and background models. The essence of the algorithm is conceptually simple and can be easily implemented within a few hundred lines of Matlab code. We evaluate and validate the proposed approach by extensive real examples of the object-level image mapping and tracking within a variety of challenging environments. We also show that it is straightforward to apply our problem formulation on non-rigid object tracking with difficult surveillance videos.\n\n1 Introduction\nIn this paper, we study the problem of object-level figure/ground segmentation in images and video sequences. The core problem can be defined as follows: Given an image X with known figure/ground labels L, infer the figure/ground labels L of a new image X closely related to X. For example, we may want to extract a walking person in an image using the figure/ground mask of the same person in another image of the same sequence. Our approach is based on training a classifier from the appearance of a pixel and its surrounding context (i.e., an image patch centered at the pixel) to recognize other similar pixels across images. To apply this process to a video sequence, we also evolve the appearance model over time. A key element of our approach is the use of a prior segmentation to reduce the complexity of the segmentation process. As argued in [22], image segments are a more natural primitive for image modeling than pixels. More specifically, an image segmentation provides a natural dimensional reduction from the spatial resolution of the image to a much smaller set of spatially compact and relatively homogeneous regions. We can then focus on representing the appearance characteristics\n\n\nThe work has been done while the first author was a graduate student in Johns Hopkins University.\n\n\f\nof these regions. Borrowing a term from [22], we can think of each region as a \"superpixel\" which represents a complex connected spatial region of the image using a rich set of derived image features. We can then consider how to classify each superpixel (i.e. image segment) as foreground or background, and then project this classification back into the original image to create the pixel-level foreground-background segmentation we are interested in. The original superpixel representation in [22, 19, 18] is a feature vector created from the image segment's color histogram [19], filter bank responses [22], oriented energy [18] and contourness [18]. These features are effective for image segmentation [18], or finding perceptually important boundaries from segmentation by supervised training [22]. However, as shown in [17], those parameters do not work well for matching different classes of image regions from different images. Instead, we propose using a set of spatially randomly sampled image patches as a non-parametric, statistical superpixel representation. This non-parametric \"bag of patches\" model1 can be easily and robustly evolved with the spatial-temporal appearance information from video, while maintaining the model size (the number of image patches per bag) using adaptive sampling. Foreground/background classification is then posed as the problem of matching sets of random patches from the image with these models. Our major contributions are demonstrating the effectiveness and computational simplicity of a nonparametric random patch representation for semantically labelling superpixels and a novel bidirectional consistency check and resampling strategy for robust foreground/background appearance adaptation over time.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a) An example indoor image, (b) the segmentation result using [6] coded in random colors, (c)\nthe boundary pixels between segments shown in red, the image segments associated with the foreground, a walking person here, shown in blue, (d) the associated foreground/background mask. Notice that the color in (a) is not very saturated. This is a common fact in our indoor experiments without any specific lighting controls.\n\nWe organize the paper as follows. We first address several image patch based representations and the associated matching methods are described. In section 3, the algorithm used in our approach is presented with details. We demonstrate the validity of the proposed approach using experiments on real examples of the object-level figure/ground image mapping and non-rigid object tracking under dynamic conditions from videos of different resolutions in section 4. Finally, we summarize the contributions of the paper and discuss possible extensions and improvements.\n\n2 Image Patch Representation and Matching\nBuilding stable appearance representations of images patches is fundamental to our approach. There are many derived features that can be used to represent the appearance of an image patch. In this paper, we evaluate our algorithm based on: 1) an image patch's raw RGB intensity vector, 2) mean color vector, 3) color + texture descriptor (filter bank response or Haralick feature [17]), and 4) PCA, LDA and NDA (Nonparametric Discriminant Analysis) features [7, 3] on the raw RGB vectors. For completeness, we give a brief description of each of these techniques. Texture descriptors: To compute texture descriptions, we first apply the Leung-Malik (LM) filter bank [13] which consists of 48 isotropic and anisotropic filters with 6 directions, 3 scales and 2 phases. Thus each image patch is represented by a 48 component feature vector. The Haralick texture descriptor [10] was used for image classification in [17]. Haralick features are derived from the Gray Level Co-occurrence Matrix, which is a tabulation of how often different combinations of pixel brightness values (grey levels) occur in an image region. We selected 5 out of 14 texture\n1 Highly distinctive local features [16] are not the adequate substitutes for image patches. Their spatial sparseness nature limits their representativity within each individual image segment, especially for the nonrigid, nonstructural and flexible foreground/background appearance.\n\n\f\ndescriptors [10] including dissimilarity, Angular Second Moment (ASM), mean, standard deviation (STD) and correlation. For details, refer to [10, 17]. Dimension reduction representations: The Principal Component Analysis (PCA) algorithm is used to reduce the dimensionality of the raw color intensity vectors of image patches. PCA makes no prior assumptions about the labels of data. However, recall that we construct the \"bag of patches\" appearance model from sets of labelled image patches. This supervised information can be used to project the bags of patches into a subspace where they are best separated using Linear discriminant Analysis (LDA) or Nonparametric Discriminant Analysis (NDA) algorithm [7, 3] by assuming Gaussian or Non-Gaussian class-specific distributions. Patch matching: After image patches are represented using one of the above methods, we must match them against the foreground/background models. There are 2 methods investigated in this paper: the nearest neighbor matching using Euclidean distance and KDE (Kernel Density Estimation) [12] in PCA/NDA subspaces. For nearest-neighbor matching, we find, for each patch p, its nearest neighbors pF , pB in foreground/background bags, and then compute dF = p - pF , n n p n dB = p - pB . On the other hand, an image patch's matching scores mF and mB are evaluated p n p p as probability density values from the KDE functions K DE (p, F ) and K DE (p, B ) where F |B are bags of patch models. Then the segmentation-level classification is performed as section 3.2.\n\n3 Algorithms\nWe briefly summarize our labeling algorithm as follows. We assume that each image of interest has been segmented into spatial regions. A set of random image patches are spatially and adaptively sampled within each segment. These sets of extracted samples are formed into two \"bags of patches\" to model the foreground/background appearance respectively. The foreground/background decision for any segment in a new image is computed using one of two aggregation functions on the appearance similarities from its inside image patches to the foreground and background models. Finally, for videos, within each bag, new patches from new frames are integrated through a robust bidirectional consistency check process and all image patches are then partitioned and resampled to create an evolving appearance model. As described below, this process prune classification inaccuracies in the nonparametric image patch representations and adapts them towards current changes in foreground/background appearances for videos. We describe each of these steps for video tracking of foreground/background segments in more detail below, and for image matching, which we treat as a special case by simply omitting step 3 and 4 in Figure 2. Non-parametric Patch Appearance Modelling-Matching Algorithm\ninputs: Pre-segmented Images Xt , t = 1, 2, ..., T ; Label L1 F |B outputs: Labels Lt , t = 2, ..., T ; 2 \"bags of patches\" appearance model for foreground/background T 1. Sample segmentation-adaptive random image patches {P1 } from image X1 . 2. Construct 2 new bags of patches 1 for foreground/background using patches {P1 } and label L1 ; set t = 1. 3. t = t + 1; Sample segmentation-adaptive random image patches {Pt } from image Xt ; match F |B {Pt } with t-1 and classify segments of Xt to generate label Lt by aggregation. 4. Classify and reject ambiguous patch samples, probable outliers and redundant appearance patch F |B samples from new extracted image patches {Pt } against t-1 ; Then integrate the filtered {Pt } F |B F |B into t-1 and evaluate the probability of survival ps for each patch inside t-1 against the original unprocessed {Pt } (Bidirectional Consistency Check). 5. Perform the random partition and resampling process according to the normalized product of F |B F |B probability of survival ps and partition-wise sampling rate inside t-1 to generate t . 6. If t = T , output Lt , t = 2, ..., T and T\nF |B F |B\n\n; exit. If t < T , go to (3).\n\nFigure 2: Non-parametric Patch Appearance Modelling-Matching Algorithm\n\n\f\nFigure 3: Left: Segment adaptive random patch sampling from an image with known figure/ground labels.\nGreen dots are samples for background; dark brown dots are samples for foreground. Right: Segment adaptive random patch sampling from a new image for figure/ground classification, shown as blue dots.\n\n3.1 Sample Random Image Patches We first employ an image segmentation algorithm2 [6] to pre-segment all the images or video frames in our experiments. A typical segmentation result is shown in Figure 1. We use Xt , t = 1, 2, ..., T to represent a sequence of video frames. Given an image segment, we formulate its representation as a distribution on the appearance variation over all possible extracted image patches inside the segment. To keep this representation to a manageable size, we approximate this distribution by sampling a random subset of patches.\nF B We denote an image segment as Si with Si for a foreground segment, and Si for a background segment, where i is the index of the (foreground/background)image segment within an image. AcF B cordingly, Pi , PiF and PiB represent a set of random image patches sampled from Si , Si and Si respectively. The cardinality Ni of an image segment Si generated by [6] typically ranges from 50 to thousands. However small or large superpixels are expected to have roughly the same amount of uniformity. Therefore the sampling rate i of Si , defined as i = siz e(Pi )/Ni , should decrease with increasing Ni . For simplicity, we keep i as a constant for all superpixels, unless Ni is above a predefined threshold , (typically 2500 3000), above which siz e(Pi ) is held fixed. This sampling adaptivity is illustrated in Figure 3. Notice that large image segments have much more sparsely sampled patches than small image segments. From our experiments, this adaptive spatial sampling strategy is sufficient to represent image segments of different sizes.\n\n3.2 Label Segments by Aggregating Over Random Patches For an image segment Si from a new frame to be classified, we again first sample a set of random patches Pi as its representative set of appearance samples. For each patch p Pi , we calculate its F B distances dp , dB or matching scores mp , mF towards the foreground and background appearance p p models respectively as described in Section 2. The decision of assigning Si to foreground or background, is an aggregating process over all {dF , dB } or {mB ; mF } where p Pi . Since Pi is considered as a set of i.i.d. samples of the p p p p F appearance distribution of Si , we use the average of {dp , dB } or {mB ; mF } (ie. first-order statisp p p B B F F tics) as its distances DPi , DPi or fitness values MPi , MPi with the foreground/background model. F B F In terms of distances {dF , dB }, DPi = meanpPi (dp ) and DPi = meanpPi (dB ). Then the p p p F F segment's foreground/background fitness is set as the inverse of the distances: MPi = 1/DPi B F F F B B and MPi = 1/DPi . In terms of KDE matching scores {mp ; mp }, MPi = meanpPi (mp ) and B B F MPi = meanpPi (mB ). Finally, Si is classified as foreground if MPi > MPi , and vice versa. The p Median robust operator can also be employed in our experiments, without noticeable difference in performance. Another choice is to classify each p Pi from mB and mF , then vote the majority p p foreground/background decision for Si . The performance is similar with mean and median.\nBecause we are not focused on image segmentation algorithms, we choose Felzenszwalb's segmentation code which generates good results and is publicly available at http://people.cs.uchicago.edu/pff/segment/.\n2\n\n\f\n3.3 Construct a Robust Online Nonparametric Foreground/Background Appearance Model with Temporal Adaptation From sets of random image patches extracted from superpixels with known figure/ground labels, 2 foreground/background \"bags of patches\" are composed. The bags are the non-parametric form of the foreground/background appearance distributions. When we intend to \"track\" the figure/ground model sequentially though a sequence, these models need to be updated by integrating new image patches extracted from new video frames. However the size (the number of patches) of the bag will be unacceptably large if we do not also remove the some redundant information over time. More importantly, imperfect segmentation results from [6] can cause inaccurate segmentation level figure/ground labels. For robust image patch level appearance modeling of t , we propose a novel bidirectional consistency check and resampling strategy to tackle various noise and labelling uncertainties.\nF B More precisely, we classify new extracted image patches {Pt } as {Pt } or {Pt } according to F |B F |B F B t-1 ; and reject ambiguous patch samples whose distances dp , dp towards respective t-1 have no good contrast (simply, the ratio between dF and dB falls into the range of 0.8 to 1/0.8). We p p F further sort the distance list of the newly classified foreground patches {Pt } to F 1 , filter out t- image patches on the top of the list which have too large distances and are probably to be outliers, and ones from the bottom of the list which have too small distances and contain probably redundant B appearances compared with F 1 3 . We perform the same process with {Pt } according to B 1 . t- t-\n\nThen the filtered {Pt } are integrated into t-1 to form t-1 , and we evaluate the probability of survival ps for each patch inside t-1\nF |B \n\nF |B\n\nF |B \n\nagainst the original unprocessed {Pt } with their labels4 .\nF |B \n\nNext, we cluster all image patches of t-1 into k partitions [8], and randomly resample image patches within each partition. This is roughly equivalent to finding the modes of an arbitrary distribution and sampling for each mode. Ideally, the resampling rate should decrease with increasing partition size, similar to the segment-wise sampling rate . For simplicity, we define as a constant value for all partitions, unless setting a threshold to be the minimal required size5 of partitions after resampling. If we perform resampling directly over patches without partitioning, some modes of the appearance distribution may be mistakenly removed. This strategy represents all partitions with sufficient number of image patches, regardless of their different sizes. In all, we resample image F |B patches of t-1 , according to the normalized product of probability of survival ps and partitionF |B wise sampling rate , to generate t . By approximately fixing the expected bag model size, the number of image patches extracted from a certain frame Xt in the bag decays exponentially in time. The problem of partitioning image patches in the bag can be formulated as the NP-hard k-center problem. The definition of k-center is as follows: given a data set of n points and a predefined cluster number k , find a partition of the points into k subgroups P1 , P2 , ..., Pk and the data centers c1 , c2 , ..., ck , to minimize the maximum radius of clusters maxi maxpPi p - ci , where i is the index of clusters. Gonzalez [8] proposed an efficient greedy algorithm, farthest-point clustering, which proved to give an approximation factor of 2 of the optimum. The algorithm operates as follows: pick a random point p1 as the first cluster center and add it to the center set C ; for iterations i = 2, ..., k , find the point pi with the farthest distance to the current center set C : di (pi , C ) = mincC pi - c and add pi to set C ; finally assign data points to its nearest center and recompute the means of clusters in C . Compared with the popular k-means algorithm, this algorithm is computationally efficient and theoretically bounded6. In this paper, we employ the Eu3\n\nSimply, we reject patches with distances dFF that are larger than mean(dFF ) + std(dFF ) or smaller p p p\nt t t\n\nF than mean(dFF ) - std(dpF ) where controls the range of accepting patch samples of t-1 , called model pt t rigidity. 4 F F For example, we compute the distance of each patch in t-1 to {Pt }, and covert them as surviving probabilities using a exponential function over negative covariance normalized distances. Patches with smaller distances have higher survival chances during resampling; and vice versa. We perform the same process with B B 1 according to {Pt }. t- 5 All image patches from partitions that are already smaller than are kept during resampling. 6 The random initialization of all k centers and the local iterative smoothing process in k-means, which is time-consuming in high dimensional space and possibly converges to undesirable local minimum, are avoided.\n\nF |B\n\n\f\nclidean distance between an image patch and a cluster center, using the raw RGB intensity vector or the feature representations discussed in section 2.\n\n4 Experiments\nWe have evaluated the image patch representations described in Section 2 for figure/ground mapping between pairs of image on video sequences taken with both static and moving cameras. Here we summarize our results. 4.1 Evaluation on Object-level Figure/Ground Image Mapping We first evaluate our algorithm on object-level figure/ground mapping between pairs of images under eight configurations of different image patch representations and matching criteria. They are listed as follows: the nearest neighbor distance matching on the image patch's mean color vector (MCV); raw color intensity vector of regular patch scanning (RCV) or segment-adaptive patch sampling over image (SCV); color + filter bank response (CFB); color + Haralick texture descriptor (CHA); PCA feature vector (PCA); NDA feature vector (NDA) and kernel density evaluation on PCA features (KDE). In general, 8000 12000 random patches are sampled per image. There is no apparent difference on classification accuracy for the patch size ranging from 9 to 15 pixels and the sample rate from 0.02 to 0.10. The PCA/NDA feature vector has 20 dimensions, and KDE is evaluated on the first 3 6 PCA features. Because the foreground figure has fewer of pixels than background, we conservatively measure the classification accuracy from the foreground's detection precision and recall on pixels. Precision is the ratio of the number of correctly detected foreground pixels to the total number of detected foreground pixels; recall is is the ratio of the number of correctly detected foreground pixels to the total number of foreground pixels in the image. The patch size is 11 by 11 pixels, and the segmentwise patch sampling rate is fixed as 0.06, unless stated otherwise. Using 40 pairs of (720 480) images with the labelled figure/ground segmentation, we compare their average classification accuracies in Tables 1. MCV 0 .4 6 0 .2 8 RCV 0 .8 1 0 .8 9 SCV 0 .9 7 0 .9 5 CFB 0 .9 2 0 .8 5 CHA 0 .8 9 0 .8 1 PCA 0 .9 3 0 .8 5 NDA 0 .9 6 0 .8 7 KDE 0 .6 9 0 .9 8\n\nTable 1: Evaluation on classification accuracy (ratio). The first row is precision; the second row is recall. For figure/ground extraction accuracy, SCV has the best classification ratio using the raw color intensity vector without any dimension reduction. MCV has the worst accuracy, which shows that pixelcolor leads to poor separability between figure and ground in our data set. Four feature based representations, CFB, CHA, PCA, NDA with reduced dimensions, have similar performance, whereas NDA is slightly better than the others. KDE tends to be more biased towards the foreground class because background usually has a wider, flatter density distribution. The superiority of SCV over RCV proves that our segment-wise random patch sampling strategy is more effective at classifying image segments than regularly scanning the image, even with more samples. As shown in Figure 4 (b), some small or irregularly-shaped image segments do not have enough patch samples to produce stable classifications.\n\n(a) MCV\n\n(b) RCV\n\n(c) SCV\n\n(d) CFB\n\n(e) CHA\n\n(f) P C A\n\n(g) NDA\n\n(h) KDE\n\nFigure 4: An example of evaluation on object-level figure/ground image mapping. The labeled figure image\nsegments are coded in blue.\n\n\f\n4.2 Figure/Ground Segmentation Tracking with a Moving Camera From Figure 4 (h), we see KDE tends to produce some false positives for the foreground. However the problem can be effectively tackled by multiplying the appearance KDE by the spatial prior which is also formulated as a KDE function of image patch coordinates. By considering videos with complex appearance-changing figure/ground, imperfect segmentation results [6] are not completely avoidable which can cause superpixel based figure/ground labelling errors. However our robust bidirectional consistency check and resampling strategy, as shown below, enables to successfully track the dynamic figure/ground segmentations in challenging scenarios with outlier rejection, model rigidity control and temporal adaptation (as described in section 3.3). Karsten.avi shows a person walking in an uncontrolled indoor environment while tracked with a handheld camera. After we manually label the frame 1, the foreground/background appearance model starts to develop, classify new frames and get updated online. Eight Example tracking frames are shown in Figure 5. Notice that the significant non-rigid deformations and large scale changes of the walking person, while the original background is completely substituted after the subject turned his way. In frame 258, we manually eliminate some false positives of the figure. The reason for this failure is that some image regions which were behind the subject begin to appear when the person is walking from left to the center of image (starting from frame 220). Compared to the online foreground/background appearance models by then, these newly appearing image regions have quite different appearance from both the foreground and the background. Therefore the foreground's spatial prior dominates the classification. We leave this issue for future work.\n\n(a) 12#\n\n(b) 91#\n\n(c) 155#\n\n(d) 180#\n\n(e) 221#\n\n(f) 257#\n\n(g) 308#\n\n(h) 329#\n\nFigure 5: Eight example frames (720 by 480 pixels) from the video sequence Karsten.avi of 330 frames. The\nvideo is captured using a handheld Panasonic PV-GS120 in standard NTSC format. Notice that the significant non-rigid deformations and large scale changes of the walking person, while the original background is completely substituted after the subject turned his way. The red pixels are on the boundary of segments; the tracked image segments associated with the foreground walking person is coded in blue.\n\n4.3 Non-rigid Object Tracking from Surveillance Videos We can also apply our nonparametric treatment of dynamic random patches in Figure 2 into tracking non-rigid interested objects from surveillance videos. The difficult is that surveillance cameras normally capture small non-rigid figures, such as a walking person or running car, in low contrast and low resolution format. Thus to adapt our method to solve this problem, we make the following modifications. Because our task changes to localizing figure object automatically overtime, we can simply model figure/ground regions using rectangles and therefore no pre-segmentation [6] is needed. Random figure/ground patches are then extracted from the image regions within these two rectangles. Using two sets of random image patches, we train an online classifier for figure/ground classes at each time step, generate a figure appearance confidence map of classification for the next frame and, similarly to [1], apply mean shift [4] to find the next object location by mode seeking. In our problem solution, the temporal evolution of dynamic image patch appearance models are executed by the bidirectional consistency check and resampling described in section 3.3. Whereas [1] uses boosting for both temporal appearance model updating and classification, our online binary classification training can employ any off-the-shelf classifiers, such as k-Nearest Neighbors (KNN), support vector machine (SVM). Our results are favorably competitive to the state-of-the-art algorithms [1, 9], even under more challenging scenario.\n\n5 Conclusion and Discussion\nAlthough quite simple both conceptually and computationally, our algorithm of performing dynamic foreground-background extraction in images and videos using non-parametric appearance\n\n\f\nmodels produces very promising and reliable results in a wide variety of circumstances. For tracking figure/ground segments, to our best knowledge, it is the first attempt to solve this difficult \"video matting\" problem [15, 25] by robust and automatic learning. For surveillance video tracking, our results are very competitive with the state-of-art [1, 9] under even more challenging conditions. Our approach does not depend on an image segmentation algorithm that totally respects the boundaries of the foreground object. Our novel bidirectional consistency check and resampling process has been demonstrated to be effectively robust and adaptive. We leave the explorations on supervised dimension reduction and density modeling techniques on image patch sets, optimal random patch sampling strategy, and self-tuned optimal image patch size searching as our future work. In this paper, we extract foreground/background by classifying on individual image segments. It might improve the figure/ground segmentation accuracy by modeling their spatial pairwise relationships as well. This problem can be further solved using generative or discriminative random field (MRF/DRF) model or the boosting method on logistic classifiers [11]. In this paper, we focus on learning binary dynamic appearance models by assuming figure/ground are somewhat distributionwise separatable. Other cues, as object shape regularization and motion dynamics for tracking, can be combined to improve performance.\n\nReferences\n[1] S. Avidan, Ensemble Tracking, CVPR 2005. [2] Y. Boykov and M. Jolly, Interactive Graph Cuts for Optimal boundary and Region Segmentation of Objects in n-d Images, ICCV, 2001. [3] M. Bressan and J. Vitria, Nonparametric discriminative analysis and nearest neighbor classification, Pat` tern Recognition Letter, 2003. [4] D. Comaniciu and P. Meer, Mean shift: A robust approach toward feature space analysis. IEEE Trans. PAMI, 2002. [5] A. Efros, T. Leung, Texture Synthesis by Non-parametric Sampling, ICCV, 1999. [6] P. Felzenszwalb and D. Huttenlocher, Efficient Graph-Based Image Segmentation, IJCV, 2004. [7] K. Fukunaga and J. Mantock, Nonparametric discriminative analysis, IEEE Trans. on PAMI, Nov. 1983. [8] T. Gonzalez, Clustering to minimize the maximum intercluster distance, Theoretical Computer Science, 38:293-306, 1985. [9] B. Han and L. Davis, On-Line Density-Based Appearance Modeling for Object Tracking, ICCV 2005. [10] R. Haralick, K. Shanmugam, I. Dinstein, Texture features for image classification. IEEE Trans. on SMC, 1973. [11] D. Hoiem, A. Efros and M. Hebert, Automatic Photo Pop-up, SIGGRAPH, 2005. [12] A. Ihler, Kernel Density Estimation Matlab Toolbox, http://ssg.mit.edu/ ihler/code/kde.shtml. [13] T. Leung and J. Malik, Representing and Recognizing the Visual Appearance of Materials using ThreeDimensional Textons, IJCV, 2001. [14] Y. Li, J. Sun, C.-K. Tang and H.-Y. Shum, Lazy Snapping, SIGGRAPH, 2004. [15] Y. Li, J. Sun and H.-Y. Shum. Video Object Cut and Paste, SIGGRAPH, 2005. [16] D. Lowe, Distinctive image features from scale-invariant keypoints, IJCV, 2004. [17] L. Lu, K. Toyama and G. Hager, A Two Level Approach for Scene Recognition, CVPR, 2005. [18] J. Malik, S. Belongie, T. Leung, J. Shi, Contour and Texture Analysis for Image Segmentation, IJCV, 2001. [19] D. Martin, C. Fowlkes, J. Malik, Learning to Detect Natural Image Boundaries Using Local Brightness, Color, and Texture Cues, IEEE Trans. on PAMI, 26(5):530-549, May 2004. [20] A. Mittal and N. Paragios, Motion-based Background Substraction using Adaptive Kernel Density Estimation, CVPR, 2004. [21] Eric Nowak, Frdric Jurie, Bill Triggs, Sampling Strategies for Bag-of-Features Image Classification, ECCV, 2006. [22] X. Ren and J. Malik, Learning a classification model for segmentation, ICCV, 2003. [23] C. Rother, V. Kolmogorov and A. Blake, Interactive Foreground Extraction using Iterated Graph Cuts, SIGGRAPH, 2004. [24] Yaser Sheikh and Mubarak Shah, Bayesian Object Detection in Dynamic Scenes, CVPR, 2005. [25] J. Wang, P. Bhat, A. Colburn, M. Agrawala and M. Cohen, Interactive Video Cutout. SIGGRAPH, 2005.\n\n\f\n", "award": [], "sourceid": 3016, "authors": [{"given_name": "Le", "family_name": "Lu", "institution": null}, {"given_name": "Gregory", "family_name": "Hager", "institution": null}]}