{"title": "Structural epitome: a way to summarize one\u2019s visual experience", "book": "Advances in Neural Information Processing Systems", "page_first": 1027, "page_last": 1035, "abstract": "In order to study the properties of total visual input in humans, a single subject wore a camera for two weeks capturing, on average, an image every 20 seconds (www.research.microsoft.com/~jojic/aihs). The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. Our first analysis goal is to create a visual summary of the subject\u2019s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. Direct application of existing algorithms, such as panoramic stitching (e.g. Photosynth) or appearance-based clustering models (e.g. the epitome), is impractical due to either the large dataset size or the dramatic variation in the lighting conditions. As a remedy to these problems, we introduce a novel image representation, the \u201cstel epitome,\u201d and an associated efficient learning algorithm. In our model, each image or image patch is characterized by a hidden mapping T, which, as in previous epitome models, defines a mapping between the image-coordinates and the coordinates in the large all-I-have-seen\" epitome matrix. The limited epitome real-estate forces the mappings of different images to overlap, with this overlap indicating image similarity. However, in our model the image similarity does not depend on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial configuration of scene or object parts, as the model is based on the palette-invariant stel models. As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination, that tend to uniformly affect pixels belonging to a single scene or object part.\"", "full_text": "Structural epitome: A way to summarize one\u2019s visual\n\nexperience\n\nNebojsa Jojic\n\nMicrosoft Research\n\nAlessandro Perina\nMicrosoft Research\nUniversity of Verona\n\nAbstract\n\nVittorio Murino\n\nItalian Institute of Technology\n\nUniversity of Verona\n\nIn order to study the properties of total visual input in humans, a single subject\nwore a camera for two weeks capturing, on average, an image every 20 seconds.\nThe resulting new dataset contains a mix of indoor and outdoor scenes as well\nas numerous foreground objects. Our \ufb01rst goal is to create a visual summary\nof the subject\u2019s two weeks of life using unsupervised algorithms that would au-\ntomatically discover recurrent scenes, familiar faces or common actions. Direct\napplication of existing algorithms, such as panoramic stitching (e.g., Photosynth)\nor appearance-based clustering models (e.g., the epitome), is impractical due to\neither the large dataset size or the dramatic variations in the lighting conditions.\nAs a remedy to these problems, we introduce a novel image representation, the\n\u201dstructural element (stel) epitome,\u201d and an associated ef\ufb01cient learning algorithm.\nIn our model, each image or image patch is characterized by a hidden mapping T\nwhich, as in previous epitome models, de\ufb01nes a mapping between the image co-\nordinates and the coordinates in the large \u201dall-I-have-seen\u201d epitome matrix. The\nlimited epitome real-estate forces the mappings of different images to overlap\nwhich indicates image similarity. However, the image similarity no longer de-\npends on direct pixel-to-pixel intensity/color/feature comparisons as in previous\nepitome models, but on spatial con\ufb01guration of scene or object parts, as the model\nis based on the palette-invariant stel models. As a result, stel epitomes capture\nstructure that is invariant to non-structural changes, such as illumination changes,\nthat tend to uniformly affect pixels belonging to a single scene or object part.\n\n1\n\nIntroduction\n\nWe develop a novel generative model which combines the powerful invariance properties achieved\nthrough the use of hidden variables in epitome [2] and stel (structural element) models [6, 8]. The\nlatter set of models have a hidden stel index si for each image pixel i. The number of discrete states\nsi can take is small, typically 4-10, as the stel indices point to a small palette of distributions over lo-\ncal measurements, e.g., color. The actual local measurement xi (e.g. color) for pixel i is assumed to\nhave been generated from the appropriate palette entry. This constrains the pixels with the same stel\nindex s to have similar colors or whatever local measurements xi represent. The indexing scheme is\nfurther assumed to change little accross different images of the same scene/object, while the palettes\ncan vary signi\ufb01cantly. For example, two images of the same scene captured in different levels of\noverall illumination would still have very similar stel partitions, even though their palettes may be\nvastly different. In this way, the image representation rises above a matrix of local measurements in\nfavor of a matrix of stel indices which can survive remarkable non-structural image changes, as long\nas these can be explained away by a change in the (small) palette. For example, in Fig. 1B, images\nof pedestrians are captured by a model that has a prior distribution of stel assignments shown in the\n\ufb01rst row. The prior on stel probabilities for each pixel adds up to one, and the 6 images showing\nthese prior probabilities add up to a uniform image of ones. Several pedestrian images are shown\n\n1\n\n\fFigure 1: A) Graphical model of Epitome, Probabilistic index map (PIM) and Stel epitome. B)\nExamples of PIM parameters. C) Example of stel epitome parameters. D) Four frames aligned with\nstel epitome E-F). In G) we show the original epitome model [2] trained on these four frames.\n\nbelow with their posterior distributions over stel assignments, as well as the mean color of each stel.\nThis illustrates that the different parts of the pedestrian images are roughly matched. Torso pixels,\nfor instance, are consistently assigned to stel s = 3, despite the fact that different people wore shirts\nor coats of very different colors. Such a consistent segmentation is possible because torso pixels\ntend to have similar colors within any given image and because the torso is roughly in the same\nposition across images (though misalignment of up to half the size of the segments is largely toler-\nated). While the \ufb01gure shows the model with S=6 stels, larger number of stels were shown to lead\nto further segmentation of the head and even splitting of the left from right leg [6]. Motivated by\nsimilar insights as in [6], a number of models followed, e.g. [7, 13, 14, 8], as the described addition\nof hidden variables s achieves the remarkable level of intensity invariance \ufb01rst demonstrated through\nthe use of similarity templates [12], but at a much lower computational cost.\nIn this paper, we embed the stel image representation within a large stel epitome: a stel prior matrix,\nlike the one shown in the top row of Fig. 1B, but much larger so that it can contain representations\nof multiple objects or scenes. This requires the additional transformation variables T for each image\nwhose role is to align it with the epitome. The model is thus qualitatively enriched in two ways: 1)\nthe model is now less sensitive to misalignment of images, as through alignment to the epitome, the\nimages are aligned to each other, and 2) interesting structure emerges when the epitome real estate\nis limited so that though it is much larger than the size of a single image, it is till much smaller\nthan the real estate needed to simply tile all images without overlap. In that case, a large collection\nof images must naturally undergo an unsupervised clustering in order for this real estate to be used\nas well as possible (or as well as the local minimum obtained by the learning algorithm allows).\nThis clustering is quite different from the traditional notion of clustering. As in the original epitome\nmodels, the transformation variables play both the alignment and cluster indexing roles. Different\n\n2\n\nSTeiTeip(s)XSp(S)EpitomePIMStel EpitomeXXitt\u039b\u039bPalettePalettePaletteStel 1Stel 2Stel 3Stel 4Stel 5Stel 6 e(s) s=1s=2s=3s=4q(s=1)q(s=2)q(s=3)q(s=4)q(s=1)q(s=2)q(s=3)q(s=4)q(s=1)q(s=2)q(s=3)q(s=4)InferencexB) PROBABILISTIC INDEX MAPA) GRAPHICAL MODELSC) STEL EPITOMED) FOUR FRAMESE) ALIGNMENTWITH STELF) ALIGNMENTOF INTENSITYIMAGESG) REGULAREPITOME [2]q(s=1)q(s=2)q(s=3)q(s=4)q(s=5)q(s=6)\u039b\u039b\u039b\fmodels over the typical scenes/objects have to compete over the positions in the epitome, with a\npanoramic version of each scene emerging in different parts of the epitome, \ufb01nally providing a rich\nimage indexing scheme. Such a panoramic scene submodel within the stel epitome is illustrated\nin Fig. 1C. A portion of the larger stel epitome is shown with 3 images that map into this region.\nThe region represents one of two home of\ufb01ces in the dataset analyzed in the experiments. Stel s=1\ncaptures the laptop screen, while the other stels capture other parts of the scene, as well as large\nshadowing effects (while the overall changes in illumination and color changes in object parts rarely\naffect stel representations, the shadows can break the stel invariance, and so the model learned to\ncope with them by breaking the shadows across multiple stels). The three images shown, mapping\nto different parts of this region, have very different colors as they were taken at different times of\nday and across different days, and yet their alignment is not adversely affected, as it is evident in\ntheir posterior stel segmentation aligned to the epitome.\nTo further illustrate the panoramic alignment, we used the epitome mapping to show for the 4 dif-\nferent images in Fig. 1D how they overlap with stel s=4 of another of\ufb01ce image (Fig. 1E), as well as\nhow multiple images of this scene, including these 4, look when they are aligned and overlapped as\nintensity images in Fig. 1F. To illustrate the gain from palette-invariance that motivated this work,\nwe show in Fig. 1G the original epitome model [2] trained on images of this scene. Without the\ninvariances afforded by the stel representation, the standard color epitome has to split the images of\nthe scene into two clusters, and so the laptop screen is doubled there.\nQualitatively quite different from both epitomes and previous stel models, the stel epitome is a\nmodel \ufb02exible enough to be applied to a very diverse set of images. In particular, we are interested\nin datasets that might represent well a human\u2019s total visual input over a longer period of time,\nand so we captured two weeks worth of SenseCam images, taken at a frequency of roughly one\nimage every 20 seconds during all waking hours of a human subject over a period of two weeks\n(www.research.microsoft.com/\u223cjojic/aihs).\n\n2 Stel epitome\n\nThe graphical model describing the dependencies in stel epitomes is provided in Fig. 1A. The para-\nmetric forms for the conditional distributions are standard multinomial and Gaussian distributions\njust as the ones used in [8]. We \ufb01rst consider the generation of a single image or an image patch (de-\npending on which visual scale we are epitomizing), and, for brevity, temporarily omit the subscript\nt indexing different images.\nThe epitome is a matrix of multinomial distributions over S indices s \u2208 {1, 2, ..., S}, associated\nwith each two-dimensional epitome location i:\n\np(si = s) = ei(s).\n\n(1)\nThus each location in the epitome contains S probabilities (adding to one) for different indices.\nIndices for the image are assumed to be generated from these distributions. The distribution over\nthe entire collection of pixels (either from an entire image, or a patch), p({xi}|{si}, T, \u039b), depends\non the parametrization of the transformations T . We adopt the discrete transformation model used\npreviously in graphical models e.g. [1, 2], where the shifts are separated from other transformations\nsuch as scaling or rotation, T = ((cid:96), r), with (cid:96) being a 2-dimensional shift and r being the index into\nthe set of other transformations, e.g., combinations of rotation and scaling:\n\ni\u2212(cid:96)|si, \u039b) =\n\ni\u2212(cid:96)|\u039bsi ),\n\np({xi}|{si}, T, \u039b) =\n\ni\n\np(xr\n\np(xr\n\n(2)\nwhere superscript r indicates transformation of the image x by the r-th transformation, and i \u2212 (cid:96) is\nthe mod-difference between the two-dimensional variables with respect to the edges of the epitome\n(the shifts wrap around). \u039b is the palette associated with the image, and \u039bs is its s \u2212 th entry.\nVarious palette models for probabilistic index / structure element map models have been reviewed\nin [8]. For brevity, in this paper we focus on the simplest case where the image measurements are\nsimply pixel colors, and the palette entries are simply Gaussians with parameters \u039bs = (\u00b5s, \u03c6s).\nIn this case, p(xr\ni\u2212(cid:96); \u00b5si, \u03c6si), and the joint likelihood over observed and hidden\nvariables can be written as\n\ni\u2212(cid:96)|\u039bsi) = N (xr\n\n(cid:89)\n\ni\n\n(cid:89)\n\n(cid:89)\n\n(cid:0)N (xr\n\ni\u2212(cid:96); \u00b5s, \u03c6s)ei(s)(cid:1)[si=s]\n\nP = p(\u039b)p((cid:96), r)\n\n,\n\n(3)\n\n(cid:89)\n\ni\n\ns\n\n3\n\n\fmodel Q and the appropriate free energy(cid:80) Q log Q\n\nwhere [] is the indicator function.\nTo derive the inference and leaning algorithms for the mode, we start with a posterior distribution\nP . The standard variational approach, however,\nis not as straightforward as we might hope as major obstacles need to be overcome to avoid local\nminima and slow convergence. To focus on these important issues, we further simplify the problem\nand omit both the non-shift part of the transformations (r) and palette priors p(\u039b), and for consis-\ntency, we also omit these parts of the model in the experiments. These two elements of the model\ncan be dealt with in the manner proposed previously: The R discrete transformations (scale/rotation\ncombinations, for example) can be inferred in a straight-forward way that makes the entire algorithm\nthat follows R times slower (see [1] for using such transformations in a different context), and the\nvarious palette models from [8] can all be inserted here with the update rules adjusted appropriately.\nA large stel epitome is dif\ufb01cult to learn because decoupling of all hidden variables in the poste-\nrior leads to severe local minima, with all images either mapped to a single spot in the epitome,\nor mapped everywhere in the epitome so that the stel distribution is \ufb02at. This problem becomes\nparticularly evident in larger epitomes, due to the imbalance in the cardinalities of the three types of\nhidden variables. To resolve this, we either need a very high numerical precision (and considerable\npatience), or the severe variational approximations need to be avoided as much as possible. It is\nindeed possible to tractably use a rather expressive posterior\n\n(cid:89)\n\ns\n\nQ = q((cid:96))\n\nq(\u039bs|(cid:96))\n\nq(si),\n\nfurther setting q(\u039bs|(cid:96)) = \u03b4(\u00b5s \u2212 \u02c6\u00b5s,(cid:96))\u03b4(\u03c6s \u2212 \u02c6\u03c6s,(cid:96)), where \u03b4 is the Dirac function. This leads to\n\n(cid:88)\n\nF = H(Q) +\n\nq((cid:96))q(si = s)\n\n(cid:88)\n\ns,(cid:96),i\n\n+\n\ns,(cid:96),i\n\nq((cid:96))q(si = s)\n\n\u02c6\u00b52\ns,(cid:96)\n2 \u02c6\u03c6s,(cid:96)\n\nx2\ni\u2212(cid:96)\n2 \u02c6\u03c6s,(cid:96)\n\n\u2212(cid:88)\n\ns\n\ni\n\nq((cid:96))q(si = s)\n\n\u02c6\u00b5s,(cid:96)xi\u2212(cid:96)\n\n\u02c6\u03c6s,(cid:96)\n\n+\n\nq(si = s) log ei(s),\n\ni\n\n(cid:89)\n\u2212(cid:88)\n(cid:88)\n\ns,(cid:96),i\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nwhere H(Q) is the entropy of the posterior distribution. Setting to zero the derivatives of this free\nenergy with respect to the variational parameters \u2013 the probabilities q(si = s), q((cid:96)), and the palette\nmeans and variance estimates \u02c6\u00b5s,(cid:96), \u02c6\u03c6s,(cid:96) \u2013 we obtain a set of updates for iterative inference.\n\n2.1 E STEP\nThe following steps are iterated for a single image x on an m \u00d7 n grid and for a given epitome\ndistributions e(s) on an M \u00d7 N grid. Index i corresponds to the epitome coordinates and masks\nm are used to describe which of all M \u00d7 N coordinates correspond to image coordinates. In the\nvariational EM learning on a collection of images index by t, these steps are done for each image,\nyielding posterior distributions indexed by t and then the M step is performed as described below.\nWe initialize q(si = s) = e(si) and then iterate the following steps in the following order.\nPalette updates\n\n\u02c6\u00b5s,(cid:96) =\n\n(cid:80)\n(cid:80)\n(cid:80)\n(cid:80)\n(cid:80)\n(cid:18)(cid:80)\n(cid:80)\n(cid:80)\n\n(cid:96)\n\n(cid:96)\n\n(cid:96)\n\ni mi\u2212(cid:96)q(si = s)q((cid:96))xi\u2212(cid:96)\n\ni q(si = s)q((cid:96))mi\u2212(cid:96)\n\n(cid:19)\n\ni mi\u2212(cid:96)q(si = s)q((cid:96))x2\ni q(si = s)q((cid:96))mi\u2212(cid:96)\n\ni\u2212(cid:96)\n\n(cid:96)\n\n\u2212 \u02c6\u00b52\n\ns,(cid:96)\n\nlog q((cid:96)) = const +\n\n(cid:88)\n\ni,s\n\n1\n2\n\nq(st\n\ni = s) log 2\u03c0\u03c6i\u2212(cid:96)\n\n\u02c6\u03c6s,(cid:96) =\n\nEpitome mapping update\n\nThis update is derived from the free energy and from the expression for \u03c6 above). This equation\ncan be used as is when the epitome e(s) is well de\ufb01ned (that is the entropy of component stel\n\n4\n\n\fdistribution is low in the latter iterations), as long as the usual care is taken in exponentiation before\nnormalization - the maximum log q((cid:96)) should be subtracted from all elements of the M \u00d7 N matrix\nlog q((cid:96)) before exponentiation.\nIn the early iterations of EM, however, when distributions ei(s) have not converged yet, numerical\nimprecision can stop the convergence, leaving the algorithm at a point which is not even a local\nminimum. The reason for this is that after the normalization step we described, q((cid:96)) will still be very\npeaky, even for relatively \ufb02at e(s) due to the large number of pixels in the image. The consequence is\nthat low alignment probabilities are rounded down to zero, as after exponentiation and normalization\ntheir values go below numerical precision. If there are areas of the epitome where no single image is\nmapped with high probability, then the update in those areas in the M step would have to depend on\nthe low-probability mappings for different images, and their relative probabilities would determine\nwhich of the images contribute more and which less to updating these areas of the epitome. To\npreserve the numerical precision needed for this, we set k thresholds \u03c4k, and compute log \u02dcq((cid:96))k, the\ndistributions at the k different precision levels:\n\nlog \u02dcq((cid:96))k = [log q((cid:96)) \u2265 \u03c4k] \u00b7 \u03c4k + [log q((cid:96)) < \u03c4k] \u00b7 log q((cid:96)),\n\nwhere [] is the indicator function. This limits how high the highest probability in the map is allowed\nto be. The k \u2212 th distribution sets all values above \u03c4k to be equal to \u03c4k.\nWe can now normalize these k distributions as discussed above:\n\nTo keep track of which precision level is needed for different (cid:96), we calculate the masks\n\n(cid:80)\nexp{log \u02dcq((cid:96))k \u2212 maxi log \u02dcq((cid:96))k}\n(cid:96) exp{log \u02dcq((cid:96))k \u2212 maxi log \u02dcq((cid:96))k}\n\n\u02dcq((cid:96))k =\n\n(cid:88)\n\n(cid:96)\n\n\u02dcmi,k =\n\n\u02dcq((cid:96))k \u00b7 mi\u2212(cid:96),\n\nwhere mask m is the mask discussed in the main text with ones in the upper left corner\u2019s m \u00d7 n\nentries and zeros elsewhere, designating the default image position for a shift of (cid:96) = 0 (or given that\nshifts are de\ufb01ned with a wrap-around, the shift of (cid:96) = (M, N )). Masks \u02dcmi,k provide total weight of\nthe image mapping at the appropriate epitome location at different precision levels.\nPosterior stel distribution q(s) update at multiple precision levels\n\nlog \u02dcq(si = s)k =\n\nconst \u2212(cid:88)\n(cid:88)\n\u2212(cid:88)\n\n(cid:96)\n\ni|i\u2212(cid:96)\u2208C\n\n(cid:96)\n\n(cid:88)\n\n\u02dcq((cid:96))k\n\ni|i\u2212(cid:96)\u2208C\n\n\u02dcq((cid:96))k\n\n\u02c6\u00b52\ns,(cid:96)\n2 \u02c6\u03c6s,(cid:96)\n\n(cid:88)\n\n(cid:88)\n\n+\n\nx2\ni\u2212(cid:96)\n2 \u02c6\u03c6s,(cid:96)\n+ \u02dcmi,k \u00b7 log e(si = s).\n\ni|i\u2212(cid:96)\u2208C\n\n(cid:96)\n\n\u02dcq((cid:96))k\n\n\u02c6\u00b5s,(cid:96)xi\u2212(cid:96)\n\n\u02c6\u03c6s,(cid:96)\n\n\u2212\n\n(9)\n\nTo keep track of these different precision levels, we also de\ufb01ne a mask M so that Mi = k indicates\nthat the k-th level of detail should be used for epitome location i. The k-th level is reserved for those\nlocations that have only the values from up to the k-th precision band of q((cid:96)) mapped there (we will\nhave m \u00d7 n mappings of the original image to each epitome location, as this many different shifts\nwill align the image so as to overlap with any given epitome location). One simple, though not most\n\nef\ufb01cient way to de\ufb01ne this matrix is Mi = 1 + (cid:98)(cid:80)\ns) =(cid:80)\n\nk[Mi = k] \u00b7 \u02dcq(si = s)k.\n\nk \u02dcmi,k(cid:99).\n\nWe now normalize log \u02dcq(si = s)k to compute the distribution at k different precision levels, \u02dcq(si =\ns)k, and compute q(s) integrating the results from different numerical precision levels as q(si =\n\n2.2 M STEP\ni }, is determined over all images xt in\nThe highest k for each epitome location Di = maxt{M t\nthe dataset, so that we know the appropriate precision level at which to perform summation and\nnormalization. Then the epitome update consists of:\n\ne(si = s) =\n\n(cid:80)\n(cid:80)\nt[M t = k] \u00b7 qt(si)\n\nt[M t = k]\n\n.\n\n(cid:88)\n\nk\n\n[Di = k]\n\n5\n\n\fFigure 2: Some examples from the dataset (www.resaerch.microsoft.com/\u223cjojic/aihs)\n\nNote that most of the summations can be performed by convolution operations and as the result, the\ncomplexity of the algorithm is of the O(SM N log M N ) for M X N epitomes.\n\n3 Experiments\n\nUsing a SenseCam wearable camera, we have obtained two weeks worth of images, taken at the rate\nof one frame every 20 seconds during all waking hours of a human subject. The resulting image\ndataset captures the subject\u2019s (summer) life rather completely in the following sense: Majority of\nimages can be assigned to one of the emergent categories (Fig. 2) and the same categories represent\nthe majority of images from any time period of a couple of days. We are interested in appropriate\nsummarization, browsing, and recognition tasks on this dataset. This dataset also proved to be\nfundamental for testing stel epitomes, as the illumination and viewing angle variations are signi\ufb01cant\nacross images and we found that the previous approaches to scene recognition provide only modest\nrecognition rates. For the purposes of evaluation, we have manually labeled a random collection\nof 320 images and compared our method with other approaches on supervised and unsupervised\nclassi\ufb01cation. We divided this reduced dataset in 10 different recurrent scenes (32 images per class);\nsome examples are depicted in Fig. 2. In all the experiments with the reduced dataset we used an\nepitome area 14 times larger than the image area and \ufb01ve stels (S=5). The numerical results reported\nin the tables are averaged over 4 train/test splits.\nIn supervised learning the scene labels are available during the stel epitome learning. We used this\ninformation to aid both the original epitome [9] and the stel epitome modifying the models by the\naddition of an observed scene class variable c in two ways: i) by linking c in the Bayesian network\nwith e, and so learning p(e|c), and ii) by linking c with T inferring p(T|c). In the latter strategy,\nwhere we model p(T|c), we learn a single epitome, but we assume that the epitome locations are\nlinked with certain scenes, and this mapping is learned for each epitome pixel. Then, the distribution\np(c|(cid:96)) over scene labels can be used for inference of the scene label for the test data. For a previ-\nously unseen test image xt, recognition is achieved by computing the label posterior p(ct|xt) using\n\np(ct|xt) =(cid:80)\n\n(cid:96) p(c|(cid:96)) \u00b7 p((cid:96)|xt).\n\nWe compared our approach with the epitomic location recognition method presented in [9], with\nLatent Dirichlet allocation (LDA) [4], and with the Torralba approach [11]. We also compared\nwith baseline discriminative classi\ufb01ers and with the pyramid matching kernel approach [5], using\nSIFT features [3]. For the above techniques that are based on topic models, representing images\nas spatially disorganized bags of features, the codebook of SIFT features was based 16x16 pixel\npatches computed over a grid spaced by 8 pixels. We chose a number of topics Z = 45 and 200\ncodewords (W = 200). The same quantized dictionary has been employed in [5].\nTo provide a fair comparison between generative and discriminative methods, we also used the\nfree energy optimization strategy presented in [10], which provides an extra layer of discriminative\ntraining for an arbitrary generative model. The comparisons are provided in Table 1. Accuracies\nachieved using the free energy optimization strategy [10] are reported in the Opt. column.\n\n6\n\nBikeCarDining roomHome o(cid:31)ceLaptoproomKitchenWork o(cid:31)ceOutsidehomeTennis(cid:30)eldLiving room\fTable 1: Classi\ufb01cation accuracies.\n\nMethod\nStel epitome\nStel epitome\nEpitome [9]\nEpitome [9]\n\np(T|c)\np(e|c)\np(T|c)\np(e|c)\n\nAccuracy\n70,06%\n88,67%\n74,36%\n69,80%\n\n[10] Opt. Method\nLDA [4]\nGMM [11]\nSIFT + K-NN\n[5]\n\n98,70%\n\n79,14%\n\nn.a.\n\nn.a.\n\nC=3\n\nAccuracy\n74,23%\n56,81%\n79,42%\n96,67%\n\n[10] Opt.\n80,11%\n\nn.a.\nn.a.\nn.a.\n\nWe also trained both the regular epitome and the stel epitome in an unsupervised way. An illustra-\ntion of the resulting stel epitome is provided in Fig. 3. The 5 panels marked s = 1, . . . , 5 show the\nstel epitome distribution. Each of these panels is an image ei(s) for an appropriate s. On the top\nof the stel epitome, four enlarged epitome regions are shown to highlight panoramic reconstructions\nof a few classes. We also show the result of averaging all images according to their mapping to the\nstel epitome (Fig. 3D) for comparison with the traditional epitome (Fig.3C) which models colors\nrather than stels. As opposed to the stel epitome, the learned color epitome [2] has to have multiple\nversions of the same scene in different illumination conditions. Furthermore, many different scenes\ntend to overlap in the color epitome, especially indoor scenes which all look equally beige. Finally,\nin Fig. 3B we show examples of some images of different scenes mapped onto the stel epitome,\nwhose organization is illustrated by a rendering of all images averaged into the appropriate location\n(similarly to the original color epitomes). Note that the model automatically clusters images using\nthe structure, and not colors, even in face of variation of colors present in the exemplars of the \u201dCar\u201d,\nor the \u201dWork of\ufb01ce\u201d classes (See also the supplemental video that illustrates the mapping dynami-\ncally). The regular epitome cannot capture these invariances, and it clusters images based on overall\nintensity more readily than based on the structure of the scene. We evaluated the two models numer-\nically in the following way. Using the two types of unsupervised epitomes, and the known labels\nfor the images in the training set, we assigned labels to the test set using the same classi\ufb01cation rule\nexplained in the previous paragraph. This semi-supervised test reveals how consistent the clustering\ninduced by epitomes is with the human labeling. The stel epitome accuracy, 73,06%, outperforms\nthe standard epitome model [9], 69,42%, with statistical signi\ufb01cance.\nWe have also trained both types of epitomes over a real estate 35 times larger than the original image\nsize using different random sets of 5000 images taken from the dataset. The stel epitomes trained in\nan unsupervised way are qualitatively equivalent, in that they consistently capture around six of the\nmost prominent scenes from Fig. 2, whereas the traditional epitomes tended to capture only three.\n\n4 Conclusions\n\nThe idea of recording our experiences is not new. (For a review and interesting research directions\nsee [15]). It is our opinion that recording, summarizing and browsing continuous visual input is par-\nticularly interesting. With the recent substantial increases in radio connectiviy, battery life, display\nsize, and computing power of small devices, and the avilability of even greater computing power\noff line, summarizing one\u2019s total visual input is now both a practically feasible and scienti\ufb01cally\ninteresting target for vision research. In addition, a variety of applications may arise once this func-\ntionality is provided. As a step in this direction, we provide a new dataset that contains a mix of\nindoor and outdoor scenes as a result of two weeks of continuous image acquisition, as well as a\nsimple algorithm that deals with some of the invariances that have to be incorporated in a model of\nsuch data. However, it is likely that modeling the geometry of the imaging process will lead to even\nmore interesting results. Although straightforward application of panoramic stitching algorithms,\nsuch as Photosynth, did not work on this dataset, because of both the sheer number of images and\nthe signi\ufb01cant variations in the lighting conditions, such methods or insights from their development\nwill most likely be very helpful in further development of unsupervised learning algorithms for such\ntypes of datasets. The geometry constraints may lead to more reliable background alignments for\nthe next logical phase in modeling for \u201dAll-I-have-seen\u201d datasets: The learning of the foreground\nobject categories such as family members\u2019 faces. As this and other such datasets grow in size, the\nunsupervised techniques for modeling the data in a way where interesting visual components emerge\nover time will become both more practically useful and scienti\ufb01cally interesting.\n\n7\n\n\fFigure 3: Stel epitome of images captured by a wearable camera\n\n8\n\ns=1s=210,4 ptA) STEL EPITOMECar stel-panoramaKitchen stel-panoramaHome o(cid:31)ce stel-panoramaWork o(cid:31)ce stel-panoramaB) IMAGE MAPPINGS ON THE STEL EPITOMEC) EPITOMED) STEL EPITOME RECONSTRUCTION\fReferences\n[1] B. Frey and N. Jojic, \u201cTransformation-invariant clustering using the EM algorithm \u201d, TPAMI\n\n2003, vol. 25, no. 1, pp. 1-17.\n\n[2] N. Jojic, B. Frey, A. Kannan, \u201cEpitomic analysis of appearance and shape\u201d, ICCV 2003.\n[3] D. Lowe, \u201cDistinctive Image Features from Scale-Invariant Keypoints,\u201d IJCV, 2004, vol. 60,\n\nno. 2, pp. 91-110.\n\n[4] L. Fei-Fei, P. Perona, \u201cA Bayesian Hierarchical Model for Learning Natural Scene Categories,\u201d\n\nIEEE CVPR 2005, pp. 524-531.\n\n[5] S. Lazebnik, C. Schmid, J. Ponce, \u201cBeyond Bags of Features: Spatial Pyramid Matching for\n\nRecognizing Natural Scene Categories,\u201d IEEE CVPR, 2006, pp. 2169-2178.\n\n[6] N. Jojic and C. Caspi, \u201cCapturing image structure with probabilistic index maps,\u201d IEEE CVPR\n\n2004, pp. 212-219.\n\n[7] J. Winn and N. Jojic, \u201cLOCUS: Learning Object Classes with Unsupervised Segmentation\u201d\n\nICCV 2005.\n\n[8] N. Jojic, A.Perina, M.Cristani, V.Murino and B. Frey, \u201cStel component analysis: modeling\n\nspatial correlation in image class structure,\u201d IEEE CVPR 2009.\n\n[9] K. Ni, A. Kannan, A. Criminisi and J. Winn, \u201cEpitomic Location Recognition,\u201d IEEE CVPR\n\n2008.\n\n[10] A. Perina, M. Cristani, U. Castellani, V. Murino and N. Jojic, \u201cFree energy score-space,\u201d NIPS\n\n2009.\n\n[11] A. Torralba, K.P. Murphy, W.T. Freeman and M.A. Rubin, \u201cContext-based vision system for\n\nplace and object recognition,\u201d ICCV 2003, pp. 273-280.\n\n[12] C. Stauffer, E. Miller, and K. Tieu, \u201cTransform invariant image decomposition with similarity\n\ntemplates,\u201d NIPS 2003.\n\n[13] V. Ferrari , A. Zisserman, \u201cLearning Visual Attributes,\u201d NIPS 2007.\n[14] B. Russell, A. Efros, J. Sivic, B. Freeman, A. Zisserman \u201cSegmenting Scenes by Matching\n\nImage Composites,\u201d NIPS 2009.\n\n[15] G. Bell and J. Gemmell, Total Recall. Dutton Adult 2009.\n\n9\n\n\f", "award": [], "sourceid": 1298, "authors": [{"given_name": "Nebojsa", "family_name": "Jojic", "institution": null}, {"given_name": "Alessandro", "family_name": "Perina", "institution": null}, {"given_name": "Vittorio", "family_name": "Murino", "institution": null}]}