{"title": "Coresets for k-Segmentation of Streaming Data", "book": "Advances in Neural Information Processing Systems", "page_first": 559, "page_last": 567, "abstract": "Life-logging video streams, financial time series, and Twitter tweets are a few examples of high-dimensional signals over practically unbounded time. We consider the problem of computing optimal segmentation of such signals by k-piecewise linear function, using only one pass over the data by maintaining a coreset for the signal. The coreset enables fast further analysis such as automatic summarization and analysis of such signals. A coreset (core-set) is a compact representation of the data seen so far, which approximates the data well for a specific task -- in our case, segmentation of the stream. We show that, perhaps surprisingly, the segmentation problem admits coresets of cardinality only linear in the number of segments k, independently of both the dimension d of the signal, and its number n of points. More precisely, we construct a representation of size O(klog n /eps^2) that provides a (1+eps)-approximation for the sum of squared distances to any given k-piecewise linear function. Moreover, such coresets can be constructed in a parallel streaming approach. Our results rely on a novel eduction of statistical estimations to problems in computational geometry. We empirically evaluate our algorithms on very large synthetic and real data sets from GPS, video and financial domains, using 255 machines in Amazon cloud.", "full_text": "Coresets for k-Segmentation of Streaming Data\n\nGuy Rosman \u2217\u2020\nCSAIL, MIT\n\n32 Vassar St., 02139,\nCambridge, MA USA\nrosman@csail.mit.edu\n\nMikhail Volkov \u2020\nCSAIL, MIT\n\n32 Vassar St., 02139,\nCambridge, MA USA\nmikhail@csail.mit.edu\n\nDanny Feldman \u2020\n\nCSAIL, MIT\n\n32 Vassar St., 02139,\nCambridge, MA USA\ndannyf@csail.mit.edu\n\nJohn W. Fisher III\n\nCSAIL, MIT\n\n32 Vassar St., 02139,\nCambridge, MA USA\n\ufb01sher@csail.mit.edu\n\nDaniela Rus \u2020\nCSAIL, MIT\n\n32 Vassar St., 02139,\nCambridge, MA USA\n\nrus@csail.mit.edu\n\nAbstract\n\nLife-logging video streams, \ufb01nancial time series, and Twitter tweets are a few ex-\namples of high-dimensional signals over practically unbounded time. We consider\nthe problem of computing optimal segmentation of such signals by a k-piecewise\nlinear function, using only one pass over the data by maintaining a coreset for the\nsignal. The coreset enables fast further analysis such as automatic summarization\nand analysis of such signals.\nA coreset (core-set) is a compact representation of the data seen so far, which\napproximates the data well for a speci\ufb01c task \u2013 in our case, segmentation of the\nstream. We show that, perhaps surprisingly, the segmentation problem admits\ncoresets of cardinality only linear in the number of segments k, independently\nof both the dimension d of the signal, and its number n of points. More pre-\ncisely, we construct a representation of size O(k log n/\u03b52) that provides a (1 + \u03b5)-\napproximation for the sum of squared distances to any given k-piecewise linear\nfunction. Moreover, such coresets can be constructed in a parallel streaming ap-\nproach. Our results rely on a novel reduction of statistical estimations to problems\nin computational geometry. We empirically evaluate our algorithms on very large\nsynthetic and real data sets from GPS, video and \ufb01nancial domains, using 255\nmachines in Amazon cloud.\n\nIntroduction\n\n1\nThere is an increasing demand for systems that learn long-term, high-dimensional data streams.\nExamples include video streams from wearable cameras, mobile sensors, GPS, \ufb01nancial data and\nbiological signals. In each, a time instance is represented as a high-dimensional feature, for example\nlocation vectors, stock prices, or image content feature histograms.\n\nWe develop real-time algorithms for summarization and segmentation of large streams, by com-\npressing the signals into a compact meaningful representation. This representation can then be used\nto enable fast analyses such as summarization, state estimation and prediction. The proposed algo-\nrithms support data streams that are too large to store in memory, afford easy parallelization, and\nare generic in that they apply to different data types and analyses. For example, the summarization\nof wearable video data can be used to ef\ufb01ciently detect different scenes and important events, while\ncollecting GPS data for citywide drivers can be used to learn weekly transportation patterns and\ncharacterize driver behavior.\n\n\u2217\n\u2020\n\nGuy Rosman was partially supported by MIT-Technion fellowship\nSupport for this research has been provided by Hon Hai/Foxconn Technology Group and MIT Lincoln Laboratory. The authors are grateful for this support.\n\n1\n\n\fIn this paper we use a data reduction technique called coresets [1, 9] to enable rapid content-\nbased segmentation of data streams. Informally, a coreset D is problem dependent compression\nof the original data P , such that running algorithm A on the coreset D yields a result A(D) that\nprovably approximates the result A(P ) of running the algorithm on the original data. If the coreset\nD is small and its construction is fast, then computing A(D) is fast even if computing the result\nA(P ) on the original data is intractable. See de\ufb01nition 2 for the speci\ufb01c coreset which we develop\nin this paper.\n1.1 Main Contribution\nThe main contributions of the paper are: (i) A new coreset for the k-segmentation problem (as given\nin Subsection 1.2) that can be computed at one pass over streaming data (with O(log n) insertion\ntime/space) and supports distributed computation. Unlike previous results, the insertion time per\nnew observation and required memory is only linear in both the dimension of the data, and the\nnumber k of segments. This result is summarized in Theorem 4, and proven in the supplementary\nmaterial. Our algorithm is scalable, parallelizable, and provides a provable approximation of the\ncost function.\n(ii) Using this novel coreset we demonstrate a new system for segmentation and\ncompression of streaming data. Our approach allows realtime summarization of large-scale video\nstreams in a way that preserves the semantic content of the aggregated video sequences, and is\neasily extendable.\n(iii) Experiments to demonstrate our approach on various data types: video,\nGPS, and \ufb01nancial data. We evaluate performance with respect to output size, running time and\nquality and compare our coresets to uniform and random sample compression. We demonstrate the\nscalability of our algorithm by running our system on an Amazon cluster with 255 machines with\nnear-perfect parallelism as demonstrated on 256, 000 frames. We also demonstrate the effectiveness\nof our algorithm by running several analysis algorithms on the computed coreset instead of the\nfull data. Our implementation summarizes the video in less than 20 minutes, and allows real-time\nsegmentation of video streams at 30 frames per second on a single machine.\nStreaming and Parallel computations. Maybe the most important property of coresets is that\neven an ef\ufb01cient off-line construction implies a fast construction that can be computed (a) Embar-\nrassingly in parallel (e.g. cloud and GPUs), (b) in the streaming model where the algorithm passes\nonly once over the (possibly unbounded) streaming data. Only small amount of memory and update\ntime (\u223c log n) per new point insertion is allowed, where n is the number of observations so far.\n1.2 Problem Statement\nThe k-segment mean problem optimally \ufb01ts a given discrete time signal of n points by a set of k\nlinear segments over time, where k \u2265 1 is a given integer. That is, we wish to partition the signal\ninto k consecutive time intervals such that the points in each time interval are lying on a single line;\nsee Fig. 1(left) and the following formal de\ufb01nition.\n\nWe make the following assumptions with respect to the data: (a) We assume the data is repre-\nsented by a feature space that suitably represents its underlying structure; (b) The content of the data\nincludes at most k segments that we wish to detect automatically; An example for this are scenes\nin a video, phases in the market as seen by stock behaviour, etc. and (c) The dimensionality of the\nfeature space is often quite large (from tens to thousands of features), with the speci\ufb01c choice of the\nfeatures being application dependent \u2013 several examples are given in Section 3. This motivates the\nfollowing problem de\ufb01nition.\nDe\ufb01nition 1 (k-segment mean). A set P in Rd+1 is a signal if P = {(1, p1), (2, p2),\u00b7\u00b7\u00b7 , (n, pn)}\nwhere pi \u2208 Rd is the point at time index i for every i = [n] = {1,\u00b7\u00b7\u00b7 , n}. For an integer k \u2265 1, a\nk-segment is a k-piecewise linear function f : R \u2192 Rd that maps every time i \u2208 R to a point f (i)\nin Rd. The \ufb01tting error at time t is the squared distance between pi and its corresponding projected\npoint f (i) on the k-segments. The \ufb01tting cost of f to P is the sum of these squared distances,\n\nn(cid:88)\n\ncost(P, f ) =\n\n(cid:107)pi \u2212 f (i)(cid:107)2\n2,\n\n(1)\n\nwhere (cid:107) \u00b7 (cid:107) denotes the Euclidean distance. The function f is a k-segment mean of P if it minimizes\ncost(P, f ).\n\ni=1\n\nFor the case k = 1 the 1-segment mean is the solution to the linear regression problem. If we\nrestrict each of the k-segments to be a horizontal segment, then each segment will be the mean height\nof the corresponding input points. The resulting problem is similar to the k-mean problem, except\n\n2\n\n\fFigure 1: For every k-segment f, the cost of input points (red) is approximated by the cost of the coreset\n(dashed blue lines). Left: An input signal and a 3-segment f (green), along with the regression distance to one\npoint (dashed black vertical lines). The cost of f is the sum of these squared distances from all the input points.\nRight: The coreset consists of the projection of the input onto few segments, with approximate per-segment\nrepresentation of the data.\n\neach of the voronoi cells is forced to be a single region in time, instead of nearest center assignment,\ni.e. the regions are contiguous.\nIn this paper we are interested in seeking a compact representation D that approximates cost(P, f )\nfor every k-segment f using the above de\ufb01nition of cost(cid:48)(D, f ). We denote a set D as a (k, \u03b5)-\ncoreset according to the following de\ufb01nition,\nDe\ufb01nition 2 ((k, \u03b5)-coreset). Let P \u2286 Rd+1, k \u2265 1 be an integer, for some small \u03b5 > 0. A set D,\nwith a cost function cost(cid:48)(\u00b7) is a (k, \u03b5)-coreset for P if for every k-segment f we have\n\n(1 \u2212 \u03b5)cost(P, f ) \u2264 cost(cid:48)(D, f ) \u2264 (1 + \u03b5)cost(P, f ).\n\nWe present a new coreset construction with provable approximations for a family of natural k-\nsegmentation optimization problems. This is the \ufb01rst such construction whose running time is linear\nin both the number of data points n, their dimensionality d, and the number k of desired segments.\nThe resulting coreset consists of O(dk/\u03b52) points that approximates the sum of square distances\nfor any k-piecewise linear function (k segments over time). In particular, we can use this coreset\nto compute the k-piecewise linear function that minimize the sum of squared distances to the input\npoints, given arbitrary constraints or weights (priors) on the desired segmentation. Such a general-\nization is useful, for example, when we are already given a set of candidate segments (e.g. maps or\ndistribution of images) and wish to choose the right k segments that approximate the input signal.\n\nPrevious results on coresets for k-segmentation achieved running time or coreset size that are at\nleast quadratic in d and cubic in k [12, 11]. As such, they can be used with very large data, for\nexample to long streaming video data which is usually high-dimensional and contains large number\nof scenes. This prior work is based on some non-uniform sampling of the input data. In order to\nachieve our results, we had to replace the sampling approach by a new set of deterministic algorithms\nthat carefully select the coreset points.\n1.3 Related Work\nOur work builds on several important contributions in coresets, k-segmentations, and video summa-\nrization.\nApproximation Algorithms. One of the main challenges in providing provable guarantees for\nsegmentation w.r.t segmentation size and quality is global optimization. Current provable algorithms\nfor data segmentation are cubic-time in the number of desired segments, quadratic in the dimension\nof the signal, and cannot handle both parallel and streaming computation as desired for big data.\nThe closest work that provides provable approximations is that of [12].\n\nSeveral works attempt to summarize high-dimensional data streams in various application do-\nmains. For example, [19] describe the video stream as a high-dimensional stream and run approx-\nimated clustering algorithms such as k-center on the points of the stream; see [14] for surveys on\nstream summarization in robotics. The resulting k-centers of the clusters comprise the video sum-\nmarization. The main disadvantages of these techniques are (i) They partition the data stream into\nk clusters that do not provide k-segmentation over time. (ii) Computing the k-center takes time\nexponential in both d and k [16]. In [19] heuristics were used for dimension reduction, and in [14]\na 2-approximation was suggested for the off-line case, which was replaced by a heuristic forstream-\ning. (iii) In the context of analysis of video streams, they use a feature space that is often simplistic\nand does not utilize the large data available effciently. In our work the feature space can be updated\non-line using a coreset for k-means clustering of the features seen so far.\nk-segment Mean. The k-segment mean problem can be solved exactly using dynamic programming\n[4]. However, this takes O(dn2k) time and O(dn2) memory, which is impractical for streaming data.\nIn [15, Theorem 8] a (1 + \u03b5)-approximation was suggested using O(n(dk)4 log n/\u03b5) time. While\n\n3\n\n\fthe algorithm in [15] support ef\ufb01cient streaming, it is not parallel. Since it returns a k-segmentation\nand not a coreset, it cannot be used to solve other optimization problems with additional priors or\nconstraints. In [12] an improved algorithm that takes O(nd2k + ndk3) time was suggested. The\nalgorithm is based on a coreset of size O(dk3/\u03b53). Unlike the coreset in this paper, the running time\nof [12] is cubic in both d and k. The result in [12] is the last in a line of research for the k-segment\nmean problem and its variations; see survey in [11, 15, 13]. The application was segmentation of\n3-dimensional GPS signal (time, latitude, longitude). The coreset construction in [12] and previous\npapers takes time and memory that is quadratic in the dimension d and cubic in the number of\nsegments k. Conversely, our coreset construction takes time only linear in both k and d. While recent\nresults suggest running time linear in n, and space that is near-logarithmic in n, the computation time\nis still cubic in k, the number of segments, and quadratic in d, the dimension. Since the number k\nrepresents the number of scenes, and d is the feature dimensionality, this complexity is prohibitive.\nVideo Summarization One motivating application for us is online video summarization, where in-\nput video stream can be represented by a set of points over time in an appropriate feature space.\nEvery point in the feature space represents the frame, and we aim to produce a compact approxima-\ntion of the video in terms of this space and its Euclidean norm. Application-aware summarization\nand analysis of ad-hoc video streams is a dif\ufb01cult task with many attempts aimed at tackling it from\nvarious perspectives [5, 18, 2]. The problem is highly related to video action classi\ufb01cation, scene\nclassi\ufb01cation, and object segmentation [18]. Applications where life-long video stream analysis is\ncrucial include mapping and navigation medical / assistive interaction, and augmented-reality ap-\nplications, among others. Our goal differs from video compression in that compression is geared\ntowards preserving image quality for all frames, and therefore stores semantically redundant con-\ntent. Instead, we seek a summarization approach that allows us to represent the video content by a\nset of key segments, for a given feature space.\n\nThis paper is organized as follows. We begin by describing the k-segmentation problem and the\nproposed coresets, and describe their construction, and their properties in Section 2. We perform\nseveral experiments in order to validate the proposed approach on data collected from GPS and\nwerable web-cameras, and demonstrate the aggregation and analysis of multiple long sequences of\nwearable user video in Section 3. Section 4 concludes the paper and discusses future directions.\n\n2 A Novel Coreset for k-segment Mean\n\nThe key insights for constructing the k-segment coreset are: i) We observe that for the case k = 1,\na 1-segment coreset can be easily obtained using SVD. ii) For the general case, k \u2265 2 we can\npartition the signal into a suitable number of intervals, and compute a 1-segment coreset for each\nsuch interval. If the number of intervals and their lengths are carefully chosen, most of them will be\nwell approximated by every k-segmentation, and the remaining intervals will not incur a large error\ncontribution.\n\nBased on these observations, we propose the following construction. 1) Estimate the signal\u2019s\ncomplexity, i.e., the approximated \ufb01tting cost to its k-segment mean. We denote this step as a call to\nthe algorithm BICRITERIA. 2) Given an complexity measure for the data, approximate the data by a\nset of segments with auxiliary information, which is the proposed coreset, denoted as the output of\nalgorithm BALANCEDPARTITION.\n\nWe then prove that the resulting coreset allows us to approximate with guarantees the \ufb01tting cost\nfor any k-segmentation over the data, as well as compute an optimal k-segmentation. We state the\nmain result in Theorem 4, and describe the proposed algorithms as Algorithms 1 and 2. We refer the\nreader to the supplementary material for further details and proofs.\n\n2.1 Computing a k-Segment Coreset\nWe would like to compute a (k, \u03b5)-coreset for our data. A (k, \u03b5)-coreset D for a set P approximates\nthe \ufb01tting cost of any query k-segment to P up to a small multiplicative error of 1 \u00b1 \u03b5. We note that\na (1, 0)-coreset can be computed using SVD; See the supplementary material for details and proof.\nHowever, for k > 2, we cannot approximate the data by a representative point set (we prove this\nin the supplementary material). Instead, we de\ufb01ne a data structure D as our proposed coreset, and\nde\ufb01ne a new cost function cost(cid:48)(D, f ) that approximates the cost of P to any k-segment f.\nThe set D consists of tuples of the type (C, g, b, e). Each tuple corresponds to a different time\ninterval [b, e] in R and represents the set P (b, e) of points in this interval. g is the 1-segment mean\nof the data P in the interval [b, e]. The set C is a (1, \u03b5)-coreset for P (b, e).\n\n4\n\n\fWe note the following: 1) If all the points of the k-segment f are on the same segment in this\ntime interval, i.e, {f (t) | b \u2264 t \u2264 e} is a linear segment, then the cost from P (b, e) to f can be\napproximated well by C, up to (1 + \u03b5) multiplicative error. 2) If we project the points of P (b, e) on\ntheir 1-segment mean g, then the projected set L of points will approximate well the cost of P (b, e)\nto f, even if f corresponds to more than one segment in the time interval [b, e]. Unlike the previous\ncase, the error here is additive. 3) Since f is a k-segment there will be at most k \u2212 1 time intervals\nthat will intersect more than two segments of f, so the overall additive error is small . This motivates\nthe following de\ufb01nition of D and cost(cid:48).\nDe\ufb01nition 3 (cost(cid:48)(D, f )). Let D = {(Ci, gi, bi, ei)}m\ni=1 where for every i \u2208 [m] we have\nCi \u2286 Rd+1, gi\n: R \u2192 Rd and bi \u2264 ei \u2208 R. For a k-segment f : R \u2192 Rd and i \u2208 [m]\nwe say that Ci is served by one segment of f if {f (t) | bi \u2264 t \u2264 ei} is a linear segment. We de-\nnote by Good(D, f ) \u2286 [m] the union of indexes i such that Ci is served by one segment of f.\nWe also de\ufb01ne Li = {gi(t) | bi \u2264 t \u2264 ei}, the projection of Ci on gi. We de\ufb01ne cost(cid:48)(D, f ) as\n\ni\u2208Good(D,f ) cost(Ci, f ) +(cid:80)\n(cid:80)\n\ni\u2208[m]\\Good(D,f ) cost(Li, f ).\n\nOur coreset construction for general k > 1 is based on an input parameter \u03c3 > 0 such that for\nan appropriate \u03c3 the output is a (k, \u03b5)-coreset. \u03c3 characterizes the complexity of the approxima-\ntion. The BICRITERIA algorithm, given as Algorithm 1, provides us with such an approximation.\nProperties of this algorithms are described in the supplementary material.\nTheorem 4. Let P = {(1, p1),\u00b7\u00b7\u00b7 , (n, pn)} such that pi \u2208 Rd for every i \u2208 [n]. Let D be the\noutput of a call to BALANCEDPARTITION(P, \u03b5, \u03c3), and let f be the output of BICRITERIA(P, k);\n\nLet \u03c3 = cost(f ). Then D is a (k, \u03b5)-coreset for P of size |D| = O(k) \u00b7(cid:0)log n/\u03b52(cid:1) , and can be\n\ncomputed in O(dn/\u03b54) time.\n\nProof. We give a sketch of the proof, also given in Theorem 10 in the supplementary material,\nand accompanying theorems. Lemma 8 states that given an estimate \u03c3 of the optimal segmentation\ncost, BALANCEDPARTITION(P, \u03b5, \u03c3) provides a (k, \u03b5)-coreset of the data P . This hinges on the\nobservation that given a \ufb01ne enough segmentation of the time domain, for each segment we can\napproximate the data by an SVD with bounded error. This approximation is exact for 1\u2212 segments\n(See claim 2 in the supplementary material), and can be bounded for a k-segments because of the\nnumber of segment intersections. According to Theorem 9 of the supplementary material, \u03c3 as\ncomputed by BICRITERIA(P, k) provides such an approximation.\n\nAlgorithm 1: BICRITERIA(P, k)\nInput: A set P \u2286 Rd+1 and an integer k \u2265 1\nOutput: A bicriteria (O(log n), O(log n))-approximation to the k-segment mean of P .\n1 if n \u2264 2k + 1 then\n\n2\n3\n\nf := a 1-segment mean of P ;\nreturn f;\n\n4 Set t1 \u2264 \u00b7\u00b7\u00b7 \u2264 tn and p1,\u00b7\u00b7\u00b7 , pn \u2208 Rd such that P = {(t1, p1),\u00b7\u00b7\u00b7 , (tn, pn)}\n5 m \u2190 {t \u2208 R | (t, p) \u2208 P}\n6 Partition P into 4k sets P1,\u00b7\u00b7\u00b7 , P2k \u2286 P such that for every i \u2208 [2k \u2212 1]:\n(i) |{t | (t, p) \u2208 Pi}| =\n\n(ii) if (t, p) \u2208 Pi and (t(cid:48), p(cid:48)) \u2208 Pi+1 then t < t(cid:48).\n\n(cid:106) m\n\n, and\n\n(cid:107)\n\n4k\n\n8\n\nCompute a 2-approximation gi to the 1-segment mean of Pi\n\n;7 for i := 1 to 4k do\n9 Q := the union of k + 1 signals Pi with the smallest value cost(Pi, gi) among i \u2208 [2k].\n10 h := BICRITERIA(P \\ Q, k); Repartition the segments that do not have a good approx.\n11 Set\n\n(cid:26)gi(t) \u2203(t, p) \u2208 Pi such that Pi \u2286 Q\n\n.\n\nf (t) :=\n\n12 return f;\n\n;\n\nh(t)\n\notherwise\n\n5\n\n\fAlgorithm 2: BALANCEDPARTITION(P, \u03b5, \u03c3)\nInput: A set P = {(1, p1),\u00b7\u00b7\u00b7 , (n, pn)} in Rd+1\nan error parameters \u03b5 \u2208 (0, 1/10) and \u03c3 > 0.\nOutput: A set D that satis\ufb01es Theorem 4.\n1 Q := \u2205; D = \u2205 ; pn+1:= an arbitrary point in Rd ;\n2 for i := 1 to n + 1 do\n3\n4\n5\n6\n7\n8\n9\n10\n\nQ := Q \u222a {(i, pi)}; Add new point to tuple\nf\u2217 := a linear approximation of Q; \u03bb := cost(Q, f\u2217)\nif \u03bb > \u03c3 or i = n + 1 then\n\nT := Q \\ {(i, pi)} ; take all the new points into tuple\nC := a (1, \u03b5/4)-coreset for T ; Approximate points by a local representation\ng := a linear approximation of T , b := i \u2212 |T|, e := i \u2212 1; save endpoints\nD := D \u222a {(C, g, b, e)} ; save a tuple\nQ := {(i, pi)} ; proceed to new point\n\n11 return D\n\nFor ef\ufb01cient k-segmentation we run a k-segment mean algorithm on our small coreset instead of\nthe original large input. Since the coreset is small we can apply dynamic programming (as in [4])\nin an ef\ufb01cient manner. In order to compute an (1 + \u03b5) approximation to the k-segment mean of\nthe original signal P , it suf\ufb01ces to compute a (1 + \u03b5) approximation to the k-segment mean of\nthe coreset, where cost is replaced by cost(cid:48). However, since D is not a simple signal, but a more\ninvolved data structure, it is not clear how to run existing algorithms on D. In the supplementary\nmaterial we show how to apply such algorithms on our coresets. In particular, we can run naive\ndynamic programming [4] on the coreset and get a (1 + \u03b5) approximate solution in an ef\ufb01cient\nmanner, as we summarize as follows.\nTheorem 5. Let P be a d-dimensional signal. A (1 + \u03b5) approximation to the k-segment mean of\nP can be computed in O (ndk/\u03b5 + d(klog(n)/\u03b5)O(1))) time .\n2.2 Parallel and Streaming Implementation\nOne major advantage of coresets is that they can be constructed in parallel as well as in a streaming\nsetting. The main observation is that the union of coresets is a coreset \u2014 if a data set is split into\nsubsets, and we compute a coreset for every subset, then the union of the coresets is a coreset of the\nwhole data set. This allows us to have each machine separately compute a coreset for a part of the\ndata, with a central node which approximately solves the optimization problem; see [10, Theorem\n10.1] for more details and a formal proof. As we show in the supplementary material, this allows us\nto use coresets in the streaming and parallel model.\n3 Experimental Results\nWe now demonstrate the results of our algorithm on four data types of varying length and dimen-\nsionality. We compare our algorithms against several other segmentation algorithms. We also show\nthat the coreset effectively improves the performance of several segmentation algorithms by running\nthe algorithms on our coreset instead of the full data.\n3.1 Segmentation of Large Datasets\nWe \ufb01rst examine the behavior of the algorithm on synthetic data which provides us with easy ground-\ntruth, to evaluate the quality of the approximation, as well as the ef\ufb01ciency, and the scalability of\nthe coreset algorithms. We generate synthetic test data by drawing a discrete k-segment P with\nk = 20, and then add Gaussian and salt-and-pepper noise. We then benchmark the computed (k, \u03b5)-\ncoreset D by comparing it against piecewise linear approximations with (1) a uniformly sampled\nsubset of control points U and (2) a randomly placed control points R. For a fair comparison\nbetween the (k, \u03b5)-coreset D and the corresponding approximations U, R we allow the same number\nof coef\ufb01cients for each approximation. Coresets are evaluated by computing the \ufb01tting cost to a\nquery k-segment Q that is constructed based on the a-priori parameters used to generate P .\n\n6\n\n\f(a) Coreset size vs coreset error\n\n(b) (k, \u03b5)-coreset size vs CPU time\n\n(c) Coreset dim. vs coreset error\nFigure 2: Figure 2a shows the coreset error (\u03b5) decreasing as a function of coreset size. The dotted black line\nindicates the point at which the coreset size is equal to the input size. Figure 2b shows the coreset construction\ntime in minutes as a function of coreset size. Trendlines show the linear increase in construction time with\ncoreset size. Figure 2c shows the reduction in coreset error as a function of the dimensionality of the 1-segment\ncoreset, for \ufb01xed input size (dimensionality can often be reduced down to R2.\n\nFigure 3: Segmentation from Google Glass. Black vertical lines present segment boundaries, overlayed on top\nof the bags of word representation. Icon images are taken from the middle of each segment.\n\nApproximation Power: Figure 2a shows the aggregated \ufb01tting cost error for 1500 experiments on\nsynthetic data. We varied the assumed k(cid:48) segment complexity. In the plot we show how well a given\nk(cid:48) performed as a guess for the true value of k. As Figure 2a shows, we signi\ufb01cantly outperform the\nother schemes. As the coreset size approaches the size P the error decreases to zero as expected.\nCoreset Construction Time: Figure 2b shows the linear relationship between input size and\nconstruction time of D for different coreset size. Figure 2c shows how a high dimensionality bene\ufb01ts\ncoreset construction. This is even more apparent in real data which tends to be sparse, so that in\npractice we are typically able to further reduce the coreset dimension in each segment.\nScalability: The coresets presented in this work are parallelizable, as discussed in Section 2.2. We\ndemonstrate scalability by conducting very large scale experiments on both real and synthetic data,\nrunning our algorithm on a network of 255 Amazon EC2 vCPU nodes. We compress a 256,000-\nframe bags-of-words (BOW) stream in approximately 20 minutes with almost-perfect scalability.\nFor a comparable single node running on the same data dataset, we estimate a total running time of\napproximately 42 hours.\n3.2 Real Data Experiments\nWe compare our coreset against uniform sample and random sample coresets, as well as two other\nsegmentation techniques: Ramer-Douglas-Peucker (RDP) algorithm [20, 8], and the Dead Reckon-\ning (DR) algorithm [23]. We also show that we can combine our coreset with segmentation algo-\nrithms, by running the algorithm on the coresets itself. We emphasize that segmentation techniques\nwere chosen as simple examples and are not intended to re\ufb02ect the state of the art \u2013 but rather to\ndemonstrate how the k-segment coreset can improve on any given algorithm.\n\nTo demonstrate the general applicability of our techniques, we run our algorithm using \ufb01nancial\n(1D) time series data, as well as GPS data. For the 1D case we use price data from the Mt.Gox\nBitcoin exchange. Bitcoin is of interest because its price has grown exponentially with its popularity\nin the past two years. Bitcoin has also sustained several well-documented market crashes [3],[6] that\nwe can relate to our analysis. For the 2D case we use GPS data from 343 taxis in San Francisco.\nThis is of interest because a taxi-route segmentation has an intuitive interpretation that we can easily\nevaluate, and on the other hand GPS data forms an increasingly large information source in which\nwe are interested.\n\n7\n\n\fFigure 4a shows the results for Bitcoin data. Price extrema are highlighted by local price highs\n(green) and lows (red). We observe that running the DR algorithm on our k-segment coreset captures\nthese events quite well. Figures 4b,4c show example results for a single taxi. Again, we observe that\nthe DR segmentation produces segments with a meaningful spatial interpretation. Figure 5 shows\na plot of coreset errors for the \ufb01rst 50 taxis (right), and the table gives a summary of experimental\nresults for the Bitcoin and GPS experiments.\n3.3 Semantic Video Segmentation\nIn addition, we demonstrate use of the proposed coreset for video streams summarization. While\ndifferent choices of frame representations for video summarization are available [22, 17, 18], we\nused color-augmented SURF features, quantized into 5000 visual words, trained on the ImageNet\n2013 dataset [7]. The resulting histograms are compressed in a streaming coreset. Computation in\non a single core runs at 6Hz; A parallel version achieves 30Hz on a single i7 machine, processing 6\nhours of video in 4 hours on a single machine, i.e. faster than real-time.\n\nIn Figure 3 we demonstrate segmentation of a video taken from Google Glass. We visualize\nBOWs, as well as the segments suggested by the k-segment mean algorithm [4] run on the coreset.\nInspecting the results, most segment transitions occur at scene and room changes.\n\nEven though optimal segmentation can not be done in real-time, the proposed coreset is computed\nin real-time and can further be used to automatically summarize the video by associating represen-\ntative frames with segments. To evaluate the \u201csemantic\u201d quality of our segmentation, we compared\nthe resulting segments to uniform segmentation by contrasting them with a human annotation of the\nvideo into scenes. Our method gave a 25% improvement (in Rand index [21]) over a 3000 frames\nsequence.\n\n(a) MTGOXUSD daily price data\n\n(b) GPS taxi data\n\n(c) GPS taxi data\n\nFigure 4: (a) shows the Bitcoin prices from 2013 on, overlayed with a DR segmentation computed on our\ncoreset. The red/green triangles indicate prominent market events. (b) 4c shows normalized GPS data overlayed\nwith a DR segmentation computed on our coreset.\n(c) shows a lat/long plot (right) demonstrating that the\nsegmentation yields a meaningful spatial interpretation.\n\nAverage \u03b5\nk-segment coreset\nUniform sample coreset\nRandom sample coreset\nRDP on original data\nRDP on k-segment\nDeadRec on original data\nDeadRec on k-segment\n\nBitcoin data GPS data\n0.0014\n0.0121\n0.0214\n0.0231\n0.0051\n0.0417\n0.0385\n\n0.0092\n1.8726\n8.0110\n0.0366\n0.0335\n0.0851\n0.0619\n\nFigure 5: Table: Summary for Bitcoin / GPS data. Plot: Errors / standard deviations for the \ufb01rst 50 cabs.\n\n4 Conclusions\nIn this paper we demonstrated a new framework for segmentation and event summarization of high-\ndimensional data. We have shown the effectiveness and scalability of the algorithms proposed, and\nits applicability for large distributed video analysis. In the context of video processing, we demon-\nstrate how using the right framework for analysis and clustering, even relatively straightforward\nrepresentations of image content lead to a meaningful and reliable segmentation of video streams at\nreal-time speeds.\n\n8\n\nApr\u22122013Jul\u22122013Oct\u22122013Jan\u22122014\u22122000200400600800100012001400DatePrice (USD/BTC)MTGOXUSD MTGOXUSD D1 closing priceDead Reckoning segmentationLocal price maximaLocal price minimaTimeLatitude (top), Longitude (bottom) X1: Latitude (top)X2: Longitude (bottom)Dead Reckoning segmentation37.637.6537.737.7537.837.85\u2212122.47\u2212122.46\u2212122.45\u2212122.44\u2212122.43\u2212122.42\u2212122.41\u2212122.4\u2212122.39\u2212122.38\u2212122.37Latitude (X1)Longitude (X2)0510152025303540455000.020.040.060.080.10.120.140.160.180.2Taxi IDcoreset error k\u2212segment coreset (mean and std)Uniform sample coresetRandom sample coresetRDP on pointsDead Reckoning on points\fReferences\n[1] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximations via coresets. Combina-\n\ntorial and Computational Geometry - MSRI Publications, 52:1\u201330, 2005.\n\n[2] S. Bandla and K. Grauman. Active learning of an action detector from untrimmed videos. In ICCV, 2013.\n[3] BBC. Bitcoin panic selling halves its value, 2013.\n[4] R. Bellman. On the approximation of curves by line segments using dynamic programming. Commun.\n\nACM, 4(6):284, 1961.\n\n[5] W. Churchill and P. Newman. Continually improving large scale long term visual navigation of a vehicle\nin dynamic urban environments. In Proc. IEEE Intelligent Transportation Systems Conference (ITSC),\nAnchorage, USA, September 2012.\n\n[6] CNBC. Bitcoin crash spurs race to create new exchanges, April 2013.\n[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImage Database. In Computer Vision and Pattern Recognition, 2009.\n\nImageNet: A Large-Scale Hierarchical\n\n[8] D. H. Douglas and T. K. Peucker. Algorithms for the reduction of the number of points required to\nrepresent a digitized line or its caricature. Cartographica: The International Journal for Geographic\nInformation and Geovisualization, 10(2):112\u2013122, 1973.\n\n[9] D. Feldman and M. Langberg. A uni\ufb01ed framework for approximating and clustering data. In STOC,\n\n2010. Manuscript available at arXiv.org.\n\n[10] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for\n\nk-means, PCA and projective clustering. SODA, 2013.\n\n[11] D. Feldman, A. Sugaya, and D. Rus. An effective coreset compression algorithm for large scale sensor\n\nnetworks. In IPSN, pages 257\u2013268, 2012.\n\n[12] D. Feldman, C. Sung, and D. Rus. The single pixel gps: learning big data signals from tiny coresets.\nIn Proceedings of the 20th International Conference on Advances in Geographic Information Systems,\npages 23\u201332. ACM, 2012.\n\n[13] A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, small-space\n\nalgorithms for approximate histogram maintenance. In STOC, pages 389\u2013398. ACM, 2002.\n\n[14] Y. Girdhar and G. Dudek. Ef\ufb01cient on-line data summarization using extremum summaries. In ICRA,\n\npages 3490\u20133496. IEEE, 2012.\n\n[15] S. Guha, N. Koudas, and K. Shim. Approximation and streaming algorithms for histogram construction\n\nproblems. ACM Transactions on Database Systems (TODS), 31(1):396\u2013438, 2006.\n\n[16] D. S. Hochbaum. Approximation algorithms for NP-hard problems. PWS Publishing Co., 1996.\n[17] Y. Li, D. J. Crandall, and D. P. Huttenlocher. Landmark classi\ufb01cation in large-scale image collections. In\n\nICCV, pages 1957\u20131964, 2009.\n\n[18] Z. Lu and K. Grauman. Story-driven summarization for egocentric video. In CVPR, pages 2714\u20132721,\n\n2013.\n\n[19] R. Paul, D. Feldman, D. Rus, and P. Newman. Visual precis generation using coresets. In ICRA. IEEE\n\nPress, 2014. accepted.\n\n[20] U. Ramer. An iterative procedure for the polygonal approximation of plane curves. Computer Graphics\n\nand Image Processing, 1(3):244 \u2013 256, 1972.\n\n[21] W. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical\n\nAssociation, 66(336):846\u2013850, 1971.\n\n[22] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In\n\nICCV, volume 2, pages 1470\u20131477, Oct. 2003.\n\n[23] G. Trajcevski, H. Cao, P. Scheuermann, O. Wolfson, and D. Vaccaro. On-line data reduction and the\n\nquality of history in moving objects databases. In MobiDE, pages 19\u201326, 2006.\n\n9\n\n\f", "award": [], "sourceid": 380, "authors": [{"given_name": "Guy", "family_name": "Rosman", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Mikhail", "family_name": "Volkov", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Dan", "family_name": "Feldman", "institution": "Massachusetts Institute of Technology"}, {"given_name": "John", "family_name": "Fisher III", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Daniela", "family_name": "Rus", "institution": "Massachusetts Institute of Technology"}]}