{"title": "Nonparametric Bayesian Texture Learning and Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 2313, "page_last": 2321, "abstract": "We present a nonparametric Bayesian method for texture learning and synthesis. A texture image is represented by a 2D-Hidden Markov Model (2D-HMM) where the hidden states correspond to the cluster labeling of textons and the transition matrix encodes their spatial layout (the compatibility between adjacent textons). 2D-HMM is coupled with the Hierarchical Dirichlet process (HDP) which allows the number of textons and the complexity of transition matrix grow as the input texture becomes irregular. The HDP makes use of Dirichlet process prior which favors regular textures by penalizing the model complexity. This framework (HDP-2D-HMM) learns the texton vocabulary and their spatial layout jointly and automatically. The HDP-2D-HMM results in a compact representation of textures which allows fast texture synthesis with comparable rendering quality over the state-of-the-art image-based rendering methods. We also show that HDP-2D-HMM can be applied to perform image segmentation and synthesis.", "full_text": "Nonparametric Bayesian Texture Learning and\n\nSynthesis\n\nLong (Leo) Zhu1 Yuanhao Chen2 William Freeman1 Antonio Torralba1\n2Department of Statistics, UCLA\n{leozhu, billf, antonio}@csail.mit.edu\n\nyhchen@stat.ucla.edu\n\n1CSAIL, MIT\n\nAbstract\n\nWe present a nonparametric Bayesian method for texture learning and synthesis.\nA texture image is represented by a 2D Hidden Markov Model (2DHMM) where\nthe hidden states correspond to the cluster labeling of textons and the transition\nmatrix encodes their spatial layout (the compatibility between adjacent textons).\nThe 2DHMM is coupled with the Hierarchical Dirichlet process (HDP) which al-\nlows the number of textons and the complexity of transition matrix grow as the\ninput texture becomes irregular. The HDP makes use of Dirichlet process prior\nwhich favors regular textures by penalizing the model complexity. This frame-\nwork (HDP-2DHMM) learns the texton vocabulary and their spatial layout jointly\nand automatically. The HDP-2DHMM results in a compact representation of tex-\ntures which allows fast texture synthesis with comparable rendering quality over\nthe state-of-the-art patch-based rendering methods. We also show that the HDP-\n2DHMM can be applied to perform image segmentation and synthesis. The pre-\nliminary results suggest that HDP-2DHMM is generally useful for further appli-\ncations in low-level vision problems.\n\n1 Introduction\n\nTexture learning and synthesis are important tasks in computer vision and graphics. Recent attempts\ncan be categorized into two different styles. The \ufb01rst style emphasizes the modeling and understand-\ning problems and develops statistical models [1, 2] which are capable of representing texture using\ntextons and their spatial layout. But the learning is rather sensitive to the parameter settings and the\nrendering quality and speed is still not satisfactory. The second style relies on patch-based rendering\ntechniques [3, 4] which focus on rendering quality and speed, but forego the semantic understanding\nand modeling of texture.\nThis paper aims at texture understanding and modeling with fast synthesis and high rendering qual-\nity. Our strategy is to augment the patch-based rendering method [3] with nonparametric Bayesian\nmodeling and statistical learning. We represent a texture image by a 2D Hidden Markov Model\n(2D-HMM) (see \ufb01gure (1)) where the hidden states correspond to the cluster labeling of textons and\nthe transition matrix encodes the texton spatial layout (the compatibility between adjacent textons).\nThe 2D-HMM is coupled with the Hierarchical Dirichlet process (HDP) [5, 6] which allows the\nnumber of textons (i.e. hidden states) and the complexity of the transition matrix to grow as more\ntraining data is available or the randomness of the input texture becomes large. The Dirichlet pro-\ncess prior penalizes the model complexity to favor reusing clusters and transitions and thus regular\ntexture which can be represented by compact models. This framework (HDP-2DHMM) discovers\nthe semantic meaning of texture in an explicit way that the texton vocabulary and their spatial layout\nare learnt jointly and automatically (the number of textons is fully determined by HDP-2DHMM).\nOnce the texton vocabulary and the transition matrix are learnt, the synthesis process samples the\nlatent texton labeling map according to the probability encoded in the transition matrix. The \ufb01nal\n\n1\n\n\fFigure 1: The \ufb02ow chart of texture learning and synthesis. The colored rectangles correspond to the index\n(labeling) of textons which are represented by image patches. The texton vocabulary shows the correspondence\nbetween the color (states) and the examples of image patches. The transition matrices show the probability\n(indicated by the intensity) of generating a new state (coded by the color of the top left corner rectangle),\ngiven the states of the left and upper neighbor nodes (coded by the top and left-most rectangles). The inferred\ntexton map shows the state assignments of the input texture. The top-right panel shows the sampled texton\nmap according to the transition matrices. The last panel shows the synthesized texture using image quilting\naccording to the correspondence between the sampled texton map and the texton vocabulary.\n\nimage is then generated by selecting the image patches based on the sampled texton labeling map.\nHere, image quilting [3] is applied to search and stitch together all the patches so that the boundary\ninconsistency is minimized. By contrast to [3], our method is only required to search a much smaller\nset of candidate patches within a local texton cluster. Therefore, the synthesis cost is dramatically\nreduced. We show that the HDP-2DHMM is able to synthesize texture in one second (25 times faster\nthan image quilting) with comparable quality. In addition, the HDP-2DHMM is less sensitive to the\npatch size which has to be tuned over different input images in [3].\nWe also show that the HDP-2DHMM can be applied to perform image segmentation and synthesis.\nThe preliminary results suggest that the HDP-2DHMM is generally useful for further applications\nin low-level vision problems.\n\n2 Previous Work\n\nOur primary interest is texture understanding and modeling. The FRAME model [7] provides a\nprincipled way to learn Markov random \ufb01eld models according to the marginal image statistics.\nThis model is very successful in capturing stochastic textures, but may fail for more structured\ntextures due to lack of spatial modeling. Zhu et al. [1, 2] extend it to explicitly learn the textons and\ntheir spatial relations which are represented by extra hidden layers. This new model is parametric\n(the number of texton clusters has to be tuned by hand for different texture images) and model\nselection which might be unstable in practice, is needed to avoid over\ufb01tting. Therefore, the learning\nis sensitive to the parameter settings. Inspired by recent progress in machine learning, we extend\nthe nonparametric Bayesian framework of coupling 1D HMM and HDP [6] to deal with 2D texture\nimage. A new model (HDP-2DHMM) is developed to learn texton vocabulary and spatial layouts\njointly and automatically.\nSince the HDP-2DHMM is designed to generate appropriate image statistics, but not pixel intensity,\na patch-based texture synthesis technique, called image quilting [3], is integrated into our system to\nsample image patches. The texture synthesis algorithm has also been applied to image inpainting\n[8].\n\n2\n\n\fFigure 2: Graphical representation of the\nHDP-2DHMM. \u03b1, \u03b1(cid:48), \u03b3 are hyperparameters\nset by hand. \u03b2 are state parameters. \u03b8 and\n\u03c0 are emission and transition parameters, re-\nspectively. i is the index of nodes in HMM.\nL(i) and T (i) are two nodes on the left and\ntop of node i. zi are hidden states of node i. xi\nare observations (features) of the image patch\nat position i.\n\nMalik et al. [9, 10] and Varma and Zisserman [11] study the \ufb01lter representations of textons which\nare related to our implementations of visual features. But the interactions between textons are not\nexplicitly considered. Liu et al. [12, 13] address texture understanding by discovering regularity\nwithout explicit statistical texture modeling.\nOur work has partial similarities with the epitome [14] and jigsaw [15] models for non-texture\nimages which also tend to model appearance and spatial layouts jointly. The major difference is\nthat their models, which are parametric, cannot grow automatically as more data is available. Our\nmethod is closely related to [16] which is not designed for texture learning. They use hierarchical\nDirichlet process, but the models and the image feature representations, including both the image\n\ufb01lters and the data likelihood model, are different. The structure of 2DHMM is also discussed in\n[17]. Other work using Dirichlet prior includes [18, 19].\nTree-structured vector quantization [20] has been used to speed up existing image-based rendering\nalgorithms. While this is orthogonal to our work, it may help us optimize the rendering speed. The\nmeaning of \u201cnonparametric\u201d in this paper is under the context of Bayesian framework which differs\nfrom the non-Bayesian terminology used in [4].\n\n3 Texture Modeling\n\n3.1 Image Patches and Features\nA texture image I is represented by a grid of image patches {xi} with size of 24 \u00d7 24 in this paper\nwhere i denotes the location. {xi} will be grouped into different textons by the HDP-2DHMM.\nWe begin with a simpli\ufb01ed model where the positions of textons represented by image patches are\npre-determined by the image grid, and not allowed to shift. We will remove this constraint later.\nEach patch xi is characterized by a set of \ufb01lter responses {wl,h,b\n} which correspond to values b of\nimage \ufb01lter response h at location l. More precisely, each patch is divided into 6 by 6 cells (i.e.\nl = 1..36) each of which contains 4 by 4 pixels. For each pixel in cell l, we calculate 37 (h = 1..37)\nimage \ufb01lter responses which include the 17 \ufb01lters used in [21], Difference of Gaussian (DOG, 4\n\ufb01lters), Difference of Offset Gaussian (DOOG, 12 \ufb01lters ) and colors (R,G,B and L). wl,h,b\nequals\none if the averaged value of \ufb01lter responses of the 4*4 pixels covered by cell l falls into bin b (the\nresponse values are divided into 6 bins), and zero otherwise. Therefore, each patch xi is represented\nby 7992 (= 37 \u2217 36 \u2217 6) dimensional feature responses {wl,h,b\n} in total. We let q = 1..7992 denote\nthe index of the responses of visual features.\nIt is worth emphasizing that our feature representation differs from standard methods [10, 2] where\nk-means clustering is applied to form visual vocabulary \ufb01rst. By contrast, we skip the clustering\nstep and leave the learning of texton vocabulary together with spatial layout learning into the HDP-\n2DHMM which takes over the role of k-means.\n\ni\n\ni\n\ni\n\n3.2 HDP-2DHMM: Coupling Hidden Markov Model with Hierarchical Dirichlet Process\n\nA texture is modeled by a 2D Hidden Markov Model (2DHMM) where the nodes correspond to the\nimage patches xi and the compatibility is encoded by the edges connecting 4 neighboring nodes.\nSee the graphical representation of 2DHMM in \ufb01gure 2. For any node i, let L(i), T (i), R(i), D(i)\ndenote the four neighbors, left, upper, right and lower, respectively. We use zi to index the states\n\n3\n\n(cid:533)(cid:534)(cid:302)............(cid:302)'(cid:652)(cid:537)z1z2x1z3x3z4x2x4(cid:146)z\f\u2022 \u03b2 \u223c GEM (\u03b1)\n\u2022 For each state z \u2208 {1, 2, 3, ...}\n\n\u2013 \u03b8z \u223c Dirichlet(\u03b3)\n\u2013 \u03c0zL \u223c DP (\u03b1(cid:48), \u03b2)\n\u2013 \u03c0zT \u223c DP (\u03b1(cid:48), \u03b2)\n\u2022 For each pair of states (zL, zT )\n\u2013 \u03c0zL,zT \u223c DP (\u03b1(cid:48), \u03b2)\n\u2022 For each node i in the HMM\n\u2013 if L(i) (cid:54)= \u2205 and T (i) (cid:54)= \u2205: zi|(zL(i), zT (i)) \u223c M ultinomial(\u03c0zL,zT )\n\u2013 if L(i) (cid:54)= \u2205 and T (i) = \u2205: zi|zL(i) \u223c M ultinomial(\u03c0zL )\n\u2013 if L(i) = \u2205 and T (i) (cid:54)= \u2205: zi|zT (i) \u223c M ultinomial(\u03c0zT )\n\u2013 xi \u223c M ultinomial(\u03b8zi )\n\nFigure 3: HDP-2DHMM for texture modeling\n\nof node i which correspond to the cluster labeling of textons. The likelihood model p(xi|zi) which\nspeci\ufb01es the probability of visual fetures is de\ufb01ned by multinomial distribution parameterized by\n\u03b8zi speci\ufb01c to its corresponding hidden state zi:\n\nxi \u223c M ultinomial(\u03b8zi)\n\n(1)\n\nwhere \u03b8zi specify the weights of visual features.\nFor node i which is connected to the nodes above and on the left (i.e. L(i) (cid:54)= \u2205 and T (i) (cid:54)= \u2205),\nthe probability p(zi|zL(i), zT (i)) of its state zi is only determined by the states (zL(i), zT (i)) of\nthe connected nodes. The distribution has a form of multinomial distribution parameterized by\n\u03c0zL(i),zT (i):\n\nzi \u223c M ultinomial(\u03c0zL(i),zT (i))\n\n(2)\n\nwhere \u03c0zL(i),zT (i) encodes the transition matrix and thus the spatial layout of textons.\nFor the nodes which are on the top row or the left-most column (i.e. L(i) = \u2205 or T (i) = \u2205), the\ndistribution of their states are modeled by M ultinomial(\u03c0zL(i)) or M ultinomial(\u03c0zT (i)) which\ncan be considered as simpler cases. We assume the top left corner can be sampled from any states\naccording to the marginal statistics of states. Without loss of generality, we will skip the details of\nthe boundary cases, but only focus on the nodes whose states should be determined by their top and\nleft nodes jointly.\nTo make a nonparametric Bayesian representation, we need to allow the number of states zi count-\nably in\ufb01nite and put prior distributions over the parameters \u03b8zi and \u03c0zL(i),zT (i). We can achieve this\nby tying the 2DHMM together with the hierarchical Dirichlet process [5]. We de\ufb01ne the prior of \u03b8z\nas a conjugate Dirichlet prior:\n\n\u03b8z \u223c Dirichlet(\u03b3)\n\n(3)\nwhere \u03b3 is the concentration hyperparameter which controls how uniform the distribution of \u03b8z is\n(note \u03b8z specify weights of visual features): as \u03b3 increases, it becomes more likely that the visual\nfeatures have equal probability. Since the likelihood model p(xi|zi) is of multinomial form, the\nposterior distribution of \u03b8z has a analytic form, still a Dirichlet distribtion.\nThe transition parameters \u03c0zL,zT are modeled by a hierarchical Dirichlet process (HDP):\n\n\u03b2 \u223c GEM(\u03b1)\n\u03c0zL,zT \u223c DP (\u03b1(cid:48), \u03b2)\n\n(4)\n(5)\nwhere we \ufb01rst draw global weights \u03b2 according to the stick-breaking prior distribution GEM(\u03b1).\nThe stick-breaking weights \u03b2 specify the probability of state which are globally shared among all\nnodes. The stick-breaking prior produces exponentially decayed weights in expectation such that\nsimple models with less representative clusters (textons) are favored, given few observations, but,\nthere is always a low-probability that small clusters are created to capture details revealed by large,\ncomplex textures. The concentration hyperparameter \u03b1 controls the sparseness of states: a larger\n\u03b1 leads to more states. The prior of the transition parameter \u03c0zL,zT is modeled by a Dirichlet\n\n4\n\n\fprocess DP (\u03b1(cid:48), \u03b2) which is a distribution over the other distribution \u03b2. \u03b1(cid:48) is a hyperparameter\nwhich controls the variability of \u03c0zL,zT over different states across all nodes: as \u03b1(cid:48) increases, the\nstate transitions become more regular. Therefore, the HDP makes use of a Dirichlet process prior to\nplace a soft bias towards simpler models (in terms of the number of states and the regularity of state\ntransitions) which explain the texture.\nThe generative process of the HDP-2DHMM is described in \ufb01gure (3).We now have the full repre-\nsentation of the HDP-2DHMM. But this simpli\ufb01ed model does not allow the textons (image patches)\nto be shifted. We remove this constraint by introducing two hidden variables (ui, vi) which indicate\nthe displacements of textons associated with node i. We only need to adjust the correspondence\nbetween image features xi and hidden states zi. xi is modi\ufb01ed to be xui,vi which refers to image\nfeatures located at the position with displacement of (ui, vi) to the position i. Random variables\n(ui, vi) are only connected to the observation xi (not shown in \ufb01gure 2). (ui, vi) have a uniform\nprior, but are limited to the small neighborhood of i (maximum 10% shift on one side).\n\n4 Learning HDP-2DHMM\n\nIn a Bayesian framework, the task of learning HDP-2DHMM (also called Bayesian inference) is\nto compute the posterior distribution p(\u03b8, \u03c0, z|x). It is trivial to sample the hidden variables (u, v)\nbecause of their uniform prior. For simplicity, we skip the details of sampling u, v. Here, we present\nan inference procedure for the HDP-2DHMM that is based on Gibbs sampling. Our procedure\nalternates between two sampling stages: (i) sampling the state assignments z, (ii) sampling the\nglobal weights \u03b2. Given \ufb01xed values for z, \u03b2, the posterior of \u03b8 can be easily obtained by aggregating\nstatistics of the observations assigned to each state. The posterior of \u03c0 is Dirichlet. For more details\non Dirichlet processes, see [5].\nWe \ufb01rst instantiate a random hidden state labeling and then iteratively repeat the following two steps.\nSampling z. In this stage we sample a state for each node. The probability of node i being assigned\nstate t is given by:\n\nP (zi = t|z\u2212i, \u03b2) \u221d f\n\n\u2212xi\nt\n\n(xi)P (zi = t|zL(i), zT (i))\n\n\u00b7P (zR(i)|zi = t, zT (R(i)))P (zD(i)|zL(D(i)), zi = t)\n\n(6)\n\n\u2212xi\n(xi) denotes the posterior probability of observation xi given all other observa-\nThe \ufb01rst term f\ntions assigned to state t, and z\u2212i denotes all state assignments except zi. Let nqt be the number of\nt\n\u2212xi\nobservations of feature wq with state t. f\nt\n\n(xi) is calculated by:\n\n(cid:89)\n\nq\n\n(cid:80)\n\n(\n\n\u2212xi\nf\nt\n\n(xi) =\n\n(cid:80)\n\nnqt + \u03b3q\n\nq(cid:48) nq(cid:48)t +\n\nq(cid:48) \u03b3q(cid:48)\n\n)wq\n\ni\n\n(7)\n\nwhere \u03b3q is the weight for visual feature wq.\nThe next term P (zi = t|zL(i) = r, zT (i) = s) is the probability of state of t, given the states of the\nnodes on the left and above, i.e. L(i) and T (i). Let nrst be the number of observations with state\nt whose the left and upper neighbor nodes\u2019 states are r for L(i) and s for T (i). The probability of\ngenerating state t is given by:\n\nP (zi = t|zL(i) = r, zT (i) = s) = nrst + \u03b1(cid:48)\u03b2t\nt(cid:48) nrst(cid:48) + \u03b1(cid:48)\n\n(cid:80)\n\n(8)\n\nwhere \u03b2t refers to the weight of state t. This calculation follows the properties of Dirichlet distribu-\ntion [5].\nThe last two terms P (zR(i)|zi = t, zT (R(i))) and P (zD(i)|zL(D(i)), zi = t) are the probability of the\nstates of the right and lower neighbor nodes (R(i), D(i)) given zi. These two terms can be computed\nin a similar form as equation (8).\nSampling \u03b2. In the second stage, given the assignments z = {zi}, we sample \u03b2 using the Dirichlet\ndistribution as described in [5].\n\n5\n\n\fFigure 4: The color of rectangles in columns 2 and 3 correspond to the index (labeling) of textons which\nare represented by 24*24 image patches. The synthesized images are all 384*384 (16*16 textons /patches).\nOur method captures both stochastic textures (the last two rows) and more structured textures (the \ufb01rst three\nrows, see the horizontal and grided layouts). The inferred texton maps for structured textures are simpler (less\nstates/textons) and more regular (less cluttered texton maps) than stochastic textures.\n5 Texture Synthesis\n\nOnce the texton vocabulary and the transition matrix are learnt, the synthesis process \ufb01rst samples\nthe latent texton labeling map according to the probability encoded in the transition matrix. But the\nHDP-2DHMM is generative only for image features, but not image intensity. To make it practical\nfor image synthesis, image quilting [3] is integrated with the HDP-2DHMM. The \ufb01nal image is then\ngenerated by selecting image patches according to the texton labeling map. Image quilting is applied\nto select and stitch together all the patches in a top-left-to-bottom-right order so that the boundary\ninconsistency is minimized . The width of the overlap edge is 8 pixels. By contrast to [3] which\nneed to search over all image patches to ensure high rendering quality, our method is only required\nto search the candidate patches within a local cluster. The HDP-2DHMM is capable of producing\nhigh rendering quality because the patches have been grouped based on visual features. Therefore,\nthe synthesis cost is dramatically reduced. We show that the HDP-2DHMM is able to synthesize a\n\n6\n\n\fFigure 5: More synthesized texture images (for each pair, left is input texture, right is synthesized).\n\ntexture image with size of 384*384 and with comparable quality in one second (25 times faster than\nimage quilting).\n\n6 Experimental Results\n\n6.1 Texture Learning and Synthesis\n\nWe use the texture images in [3]. The hyperparameters {\u03b1, \u03b1(cid:48), \u03b3} are set to 10, 1, and 0.5, respec-\ntively. The image patch size is \ufb01xed to 24*24. All the parameter settings are identical for all images.\nThe learning runs with 10 random initializations each of which takes about 30 sampling iterations\nto converge. A computer with 2.4 GHz CPU was used. For each image, it takes 100 seconds for\nlearning and 1 second for synthesis (almost 25 times faster than [3]).\nFigure (4) shows the inferred texton labeling maps, the sampled texton maps and the synthesized\ntexture images. More synthesized images are shown in \ufb01gure (5). The rendering quality is visually\ncomparable with [3] (not shown) for both structured textures and stochastic textures. It is interesting\nto see that the HMM-HDP captures different types of texture patterns, such as vertical, horizontal\nand grided layouts. It suggests that our method is able to discover the semantic texture meaning by\nlearning texton vocabulary and their spatial relations.\n\n7\n\n\fFigure 6: Image segmentation and synthesis. The \ufb01rst three rows show the HDP-2DHMM is able to segment images with mixture of\ntextures and synthesize new textures. The last row shows a failure example where the texton is not well aligned.\n\n6.2\n\nImage Segmentation and Synthesis\n\nWe also apply the HDP-2DHMM to perform image segmentation and synthesis. Figure (6) shows\nseveral examples of natural images which contain mixture of textured regions. The segmentation\nresults are represented by the inferred state assignments (the texton map). In \ufb01gure (6), one can see\nthat our method successfully divides images into meaningful regions and the synthesized images\nlook visually similar to the input images. These results suggest that the HDP-2DHMM framework\nis generally useful for low-level vision problems. The last row in \ufb01gure (6) shows a failure example\nwhere the texton is not well aligned.\n\n7 Conclusion\n\nThis paper describes a novel nonparametric Bayesian method for textrure learning and synthesis.\nThe 2D Hidden Markov Model (HMM) is coupled with the hierarchical Dirichlet process (HDP)\nwhich allows the number of textons and the complexity of transition matrix grows as the input tex-\nture becomes irregular. The HDP makes use of Dirichlet process prior which favors regular textures\nby penalizing the model complexity. This framework (HDP-2DHMM) learns the texton vocabu-\nlary and their spatial layout jointly and automatically. We demonstrated that the resulting compact\nrepresentation obtained by the HDP-2DHMM allows fast texture synthesis (under one second) with\ncomparable rendering quality to the state-of-the-art image-based rendering methods. Our results\non image segmentation and synthesis suggest that the HDP-2DHMM is generally useful for further\napplications in low-level vision problems.\nAcknowledgments. This work was supported by NGA NEGI-1582-04-0004, MURI Grant N00014-\n06-1-0734, ARDA VACE, and gifts from Microsoft Research and Google. Thanks to the anonymous\nreviewers for helpful feedback.\n\n8\n\n\fReferences\n[1] Y. N. Wu, S. C. Zhu, and C.-e. Guo, \u201cStatistical modeling of texture sketch,\u201d in ECCV \u201902:\nProceedings of the 7th European Conference on Computer Vision-Part III, 2002, pp. 240\u2013254.\n[2] S.-C. Zhu, C.-E. Guo, Y. Wang, and Z. Xu, \u201cWhat are textons?\u201d International Journal of\n\nComputer Vision, vol. 62, no. 1-2, pp. 121\u2013143, 2005.\n\n[3] A. A. Efros and W. T. Freeman, \u201cImage quilting for texture synthesis and transfer,\u201d in Siggraph,\n\n2001.\n\n[4] A. Efros and T. Leung, \u201cTexture synthesis by non-parametric sampling,\u201d in International Con-\n\nference on Computer Vision, 1999, pp. 1033\u20131038.\n\n[5] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, \u201cHierarchical dirichlet processes,\u201d Journal\n\nof the American Statistical Association, 2006.\n\n[6] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen, \u201cThe in\ufb01nite hidden markov model,\u201d in\n\nNIPS, 2002.\n\n[7] S. C. Zhu, Y. Wu, and D. Mumford, \u201cFilters, random \ufb01elds and maximum entropy (frame):\nTowards a uni\ufb01ed theory for texture modeling,\u201d International Journal of Computer Vision,\nvol. 27, pp. 1\u201320, 1998.\n\n[8] A. Criminisi, P. Perez, and K. Toyama, \u201cRegion \ufb01lling and object removal by exemplar-based\n\ninpainting,\u201d IEEE Trans. on Image Processing, 2004.\n\n[9] J. Malik, S. Belongie, J. Shi, and T. Leung, \u201cTextons, contours and regions: Cue integration in\n\nimage segmentation,\u201d IEEE International Conference on Computer Vision, vol. 2, 1999.\n\n[10] T. Leung and J. Malik, \u201cRepresenting and recognizing the visual appearance of materials us-\ning three-dimensional textons,\u201d International Journal of Computer Vision, vol. 43, pp. 29\u201344,\n2001.\n\n[11] M. Varma and A. Zisserman, \u201cTexture classi\ufb01cation: Are \ufb01lter banks necessary?\u201d IEEE Com-\n\nputer Society Conference on Computer Vision and Pattern Recognition, vol. 2, 2003.\n\n[12] Y. Liu, W.-C. Lin, and J. H. Hays, \u201cNear regular texture analysis and manipulation,\u201d ACM\n\nTransactions on Graphics (SIGGRAPH 2004), vol. 23, no. 1, pp. 368 \u2013 376, August 2004.\n\n[13] J. Hays, M. Leordeanu, A. A. Efros, and Y. Liu, \u201cDiscovering texture regularity as a higher-\norder correspondence problem,\u201d in 9th European Conference on Computer Vision, May 2006.\n[14] N. Jojic, B. J. Frey, and A. Kannan, \u201cEpitomic analysis of appearance and shape,\u201d in In ICCV,\n\n2003, pp. 34\u201341.\n\n[15] A. Kannan, J. Winn, and C. Rother, \u201cClustering appearance and shape by learning jigsaws,\u201d in\n\nIn Advances in Neural Information Processing Systems. MIT Press, 2007.\n\n[16] J. J. Kivinen, E. B. Sudderth, and M. I. Jordan, \u201cLearning multiscale representations of natural\nscenes using dirichlet processes,\u201d IEEE International Conference on Computer Vision, vol. 0,\n2007.\n\n[17] J. Domke, A. Karapurkar, and Y. Aloimonos, \u201cWho killed the directed model?\u201d in IEEE\n\nComputer Society Conference on Computer Vision and Pattern Recognition, 2008.\n\n[18] L. Cao and L. Fei-Fei, \u201cSpatially coherent latent topic model for concurrent object segmenta-\ntion and classi\ufb01cation,\u201d in Proceedings of IEEE International Conference on Computer Vision,\n2007.\n\n[19] X. Wang and E. Grimson, \u201cSpatial latent dirichlet allocation,\u201d in NIPS, 2007.\n[20] L.-Y. Wei and M. Levoy, \u201cFast texture synthesis using tree-structured vector quantization,\u201d\nin SIGGRAPH \u201900: Proceedings of the 27th annual conference on Computer graphics and\ninteractive techniques, 2000, pp. 479\u2013488.\n\n[21] J. Winn, A. Criminisi, and T. Minka, \u201cObject categorization by learned universal visual dictio-\nnary,\u201d in Proceedings of the Tenth IEEE International Conference on Computer Vision, 2005.\n\n9\n\n\f", "award": [], "sourceid": 173, "authors": [{"given_name": "Long", "family_name": "Zhu", "institution": null}, {"given_name": "Yuanahao", "family_name": "Chen", "institution": null}, {"given_name": "Bill", "family_name": "Freeman", "institution": null}, {"given_name": "Antonio", "family_name": "Torralba", "institution": null}]}