{"title": "Active Matting", "book": "Advances in Neural Information Processing Systems", "page_first": 4590, "page_last": 4600, "abstract": "Image matting is an ill-posed problem. It requires a user input trimap or some strokes to obtain an alpha matte of the foreground object. A fine user input is essential to obtain a good result, which is either time consuming or suitable for experienced users who know where to place the strokes. In this paper, we explore the intrinsic relationship between the user input and the matting algorithm to address the problem of where and when the user should provide the input. Our aim is to discover the most informative sequence of regions for user input in order to produce a good alpha matte with minimum labeling efforts. To this end, we propose an active matting method with recurrent reinforcement learning. The proposed framework involves human in the loop by sequentially detecting informative regions for trivial human judgement. Comparing to traditional matting algorithms, the proposed framework requires much less efforts, and can produce satisfactory results with just 10 regions. Through extensive experiments, we show that the proposed model reduces user efforts significantly and achieves comparable performance to dense trimaps in a user-friendly manner. We further show that the learned informative knowledge can be generalized across different matting algorithms.", "full_text": "Active Matting\n\nXin Yang\u2217\n\nDalian University of Technology\nCity University of Hong Kong\nxinyang@dlut.edu.cn\n\nShaozhe Chen\n\nDalian University of Technology\ncsz@mail.dlut.edu.cn\n\nKe Xu\u2217\n\nDalian University of Technology\nCity University of Hong Kong\n\nkkangwing@mail.dlut.edu.cn\n\nShengfeng He\u2020\n\nSouth China University of Technology\n\nhesfe@scut.edu.cn\n\nBaocai Yin\n\nDalian University of Technology\n\nybc@dlut.edu.cn\n\nRynson W.H. Lau\u2021\n\nCity University of Hong Kong\n\nrynson.lau@cityu.edu.hk\n\nAbstract\n\nImage matting is an ill-posed problem. It requires a user input trimap or some\nstrokes to obtain an alpha matte of the foreground object. A \ufb01ne user input is\nessential to obtain a good result, which is either time consuming or suitable for\nexperienced users who know where to place the strokes. In this paper, we explore\nthe intrinsic relationship between the user input and the matting algorithm to\naddress the problem of where and when the user should provide the input. Our\naim is to discover the most informative sequence of regions for user input in\norder to produce a good alpha matte with minimum labeling efforts. To this\nend, we propose an active matting method with recurrent reinforcement learning.\nThe proposed framework involves human in the loop by sequentially detecting\ninformative regions for trivial human judgement. Comparing to traditional matting\nalgorithms, the proposed framework requires much less efforts, and can produce\nsatisfactory results with just 10 regions. Through extensive experiments, we show\nthat the proposed model reduces user efforts signi\ufb01cantly and achieves comparable\nperformance to dense trimaps in a user-friendly manner. We further show that\nthe learned informative knowledge can be generalized across different matting\nalgorithms.\n\n1\n\nIntroduction\n\nAlpha matting (or image matting) refers to accurately extracting a foreground object of interest from\nan input image. This problem as well as its inverse process (known as image composition) have been\nwell studied by both research and industrial communities. Mathematically, alpha matting can be\nmodeled by the following under-constrained equation:\n\n(1)\nwhere z = (x, y) denotes the pixel position in the input image I. F and B refer to the output\nforeground and background images. \u03b1 is the alpha matte, whose values range between [0, 1] de\ufb01ning\nthe opacity of the foreground.\n\nIz = \u03b1zFz + (1 \u2212 \u03b1z)Bz,\n\n\u2217Joint \ufb01rst authors.\n\u2020Corresponding author.\n\u2021This work was led by Rynson Lau.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Input and GT\n\n(b) Trimap\n\n(c) Coarse\n\n(d) Inappropriate\n\n(e) Ours\n\nFigure 1: Limitations of existing matting approaches. Creating a trimap is tedious, and scribbles may\nproduce unsatisfactory mattes if they are inappropriately placed or not dense enough. The proposed\napproach is able to generate comparable results to using trimaps, with only a few labelled regions.\n\nThis highly ill-posed problem is typically addressed by introducing additional information, in the\nform of user-de\ufb01ned trimaps [6, 19] or strokes [8, 21], to con\ufb01ne the scale of unknowns. However,\nthe quality of the trimap can largely affect the accuracy of the \ufb01nal output [13, 24], and labeling a\n\ufb01ne trimap is also tedious. Although drawing strokes may be a more user-friendly way of obtaining\nthe foreground-background information [11, 13], scribbles provide only a small set of labelled pixels,\nwhich can be regarded as a less-accurate trimap and may not be able to provide suf\ufb01cient constraints\nto solve Eq. 1 (see Figure 1(c)). More importantly, if scribbles are not correctly drawn, the matting\nalgorithm may be misled to produce unsatisfactory mattes (see Figure 1(d)).\nThe above observations indicate that a good output matte requires dense labeling (i.e., manual efforts)\nas well as appropriately placing the labels (i.e., skills). In this paper, we aim to automatically\ndetermine \u201cwhere\u201d to best place the labels to minimize human efforts and the reliance on their\npro\ufb01ciency of the matting problem. To this end, we propose to learn to detect the regions that affect\nthe matting performance most. We refer to these regions as informative regions. We propose an\nactive matting model with a recurrent neural network to discover these informative regions. The user\nis involved in the loop, but only required to label a suggested region as foreground or background.\nThis strategy proposes a sequence of regions to the user for labeling one by one, and learns from\nthe user\u2019s feedbacks. We adopt the reinforcement learning strategy to allow a direct supervision\nbased on an arbitrary matting algorithm and a ground-truth alpha matte. The proposed network is\nable to output an accurate matte in a few iterations. Extensive experiments show that our model is a\npromising solution to address the problem of \u201cwhere to draw\u201d so as to improve the interaction speed\nand matting accuracy.\nThe main contributions of this paper are as follows: 1. We propose an active model to learn to detect\ninformative regions for alpha matting. Our model can actively suggest a small number of informative\nregions for the user to label. 2. We propose a recurrent network with the reinforcement learning\nstrategy. The network is trained via user feedbacks. 3. We delve into the problem of informative\nregions for matting, and show that the learned informative knowledge can be generalized to a variety\nof matting algorithms.\n\n2 Methodology\n\n2.1 Overview\n\nOur goal here is to explore the problem of where to label image regions for obtaining a good matte.\nDue to the dependency of the subsequent selection of information regions to a user\u2019s feedback, we\nconsider it inappropriate to determine all the informative regions at the same time (discussed in\nSection 3). Instead, we factorize this problem into an interactive search for a sequence of informative\n\n2\n\n\fFigure 2: Overview of the proposed active model. (All rectangular boxes of the same color share the\nsame parameters.)\n\nregions. In this way, the image uncertainty problem can be gradually resolved, depending on the\nimage content, previous suggested regions and the corresponding user feedbacks.\nWe have considered two key problems in our model design: how to detect the next most informative\nregion based on the previously suggested regions and user feedbacks, and how to allow the proposed\nmethod to work with arbitrary matting algorithms. We address the \ufb01rst problem by using a RNN unit,\nwhich is able to make sequential decisions based on previous knowledge, and the second problem by\nusing a matte solver together with our reinforcement training strategy.\nFigure 2 shows the proposed pipeline, which can be summarized as follows. An input image is\n\ufb01rst fed into the Feature Extraction Net to extract the image features g0. g0 is then fed into the\nRNN Unit to provide the \u201cvisual\u201d information for prediction, which is then decoded by the Location\nDecoding Net to obtain the \ufb01rst suggested informative region (represented as a 2D coordinate l1). At\neach iteration after a region is suggested, our model asks the user to indicate if the region belongs\nto the foreground or background layer. Each pixel within this region will then be assigned with a\ncorresponding label in the accumulated trimap. The Matte Solver takes the input image and the\naccumulated trimap as input and computes a matte, which is then fed with the 2D coordinate of the\npreviously suggested region to the Joint Encoding Net to jointly encode the relationship between\nthe previous region suggestion and the resulting matte. Finally, the RNN Unit uses the information\nthat encodes the previous region-matte relationships and the initial visual features to suggest the next\ninformative region for user input. The proposed network learns from the user feedbacks, and we\nadopt reinforcement learning to assign training reward for each detected informative region.\n\n2.2 Architecture\n\nWe \ufb01rst present the detailed architecture of our active matting model.\nThe Feature Extraction Net. This network serves as a feature extraction module. It analyzes the\ninput image I and projects it to a lower feature space: g0 = fExtra(I; \u03b8Extra), where \u03b8Extra are\nthe network parameters, and g0 is a 1 \u00d7 1000 vector. In our implementation, we adapt the VGG16\nnetwork [18] here without its \ufb01nal softmax layer.\nThe RNN Unit. We use the Long Short Term Memory (LSTM) unit [10] to fuse the image features\nand the region-matte relations to predict the location for the next informative region: vi+1 =\nfrnn({gk}; \u03b8rnn), where k = 1, 2, ...i, and \u03b8rnn represents the parameters. vi is a 128-dimension\n\n3\n\n\fvector. In this way, our model can suggest the next region by considering all the previously suggested\nregions and their resulting mattes.\nThe Location Decoding Net. It takes the predicted information vi from the RNN Unit and decodes\nit into a 2D coordinate: li = floc(vi; \u03b8loc), where i is the iteration number, and \u03b8loc refers to the\nnetwork parameters. Our model actively proposes 20 \u00d7 20 regions centered at li.\nThe Matte Solver. The accumulated trimap si is generated by the current and all previous pairs of\nsuggested regions and their corresponding user inputs. We pass si and image I through the matte\nsolver to obtain the latest matte \u03b1i: \u03b1i = fsolver(si, I). In our implementation, we use shared\nmatting [7] as the matte solver because it is ef\ufb01cient for training. However, as will be shown in\nSection 3, the sequence of informative regions produced by our model is general and can be used\nwith any matting algorithms.\nThe Joint Encoding Net. Once we have a proposed 2D coordinate li, the joint encoding net fuses\nthis coordinate with its corresponding matte, for the purpose of establishing the relation between\nthe suggested region and the resulting matte. The relation is encoded as gi = fjEnc(li, \u03b1i; \u03b8jEnc),\nwhere \u03b8jEnc refers to the network parameters. Similar to g0, gi is also a 1 \u00d7 1000 vector. In our\nimplementation, the joint encoding net takes the estimated coordinates and a local patch of the\nresulting matte centered at the estimated coordinates as input, and outputs a vector for the following\nRNN Unit. The patch size is \ufb01xed at 75 \u00d7 75.\n\n2.3 Reinforcement Learning of the Sequence\n\nThe proposed active model needs to establish a connection between the matting solver, the suggested\nregions, and the ground truth matte. To obtain supervision for the proposed model, each suggested\nregion (i.e., the accumulated trimap si) is fed to the matting solver, and its output matte is then\ncompared with the ground truth matte. However, this training process involves a matting solver,\nwhich is non-differentiable. Thus, we are not able to train it using traditional back-propagation\nstrategies, and we take an alternative solution with reinforcement learning [22].\nWe \ufb01rst measure the accuracy of the alpha matte using the Root Mean Square Error (RMSE) metric:\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nN\n\nN(cid:88)\n\nz=1\n\n(cid:13)(cid:13) \u02c6\u03b1z \u2212 \u03b1gt\n\nz\n\n(cid:13)(cid:13)2\n\n2,\n\nRM SE =\n\n(2)\n\nwhere N refers to the total number of pixels. \u02c6\u03b1 and \u03b1gt represent the estimated and ground truth\nmattes, respectively. Assuming that the alpha values have a Gaussian distribution, minimizing Eq. 2\nis approximately equivalent to maximizing log p(\u03b1|I, \u0398), where \u0398 refers to the model parameters,\nand p(\u00b7) refers to the likelihood function. We note that this log-likelihood term log p(\u03b1|I, \u0398)\ncan be maximized by marginalizing over a sequence of proposed regions {li}: log p(\u03b1|I, \u0398) =\nl p(l|I, \u0398)p(\u03b1|l, I, \u0398). This marginalized function can be learned by maximizing its lower\n(3)\n\np(l|I, \u0398) log p(\u03b1|l, I, \u0398).\n\nbound as discussed in [3]:\n\nlog(cid:80)\n\n(cid:88)\n\nF =\n\n(cid:88)\n\nl\n\nl\n\nBy taking the derivative of Eq. 3 w.r.t. the model parameters \u0398, we obtain the training rules as:\n\n\u2202F\n\u2202\u0398\n\n=\n\np(l|I, \u0398)[\n\n\u2202 log p(\u03b1|l, I, \u0398)\n\n\u2202\u0398\n\n+ log p(\u03b1|l, I, \u0398)\n\n\u2202 log p(l|I, \u0398)\n\n\u2202\u0398\n\n].\n\n(4)\n\nNote that the \ufb01rst derivation in Eq. 4 is the gradient of the matte estimation with respect to the model\nparameters. We have designed our matte solver as a plug-and-play module, and excluded this term\nfrom our \ufb01nal loss function. In this way, our model can focus on the task of \ufb01nding the informative\nregions, and will not be in\ufb02uenced by the matting method used, allowing the suggested regions to be\nsuf\ufb01ciently general for different matting methods.\nTo avoid an exponentially growing solution space of location l, we then adopt Monte Carlo sampling\ni \u223c p(li|I, \u0398) = N (li; \u02c6li, \u2126), where \u02c6li is the estimated location at the i-th\nfor approximation: \u02dclm\niteration and \u2126 is a prede\ufb01ned standard deviation. Hence, Eq. 4 can be rewritten as:\n\n\u2202F\n\u2202\u0398\n\n=\n\n1\nM\n\n[log p(\u03b1|\u02dclm\n\ni , I, \u0398)\n\ni |I, \u0398)\n\u2202 log p(\u02dclm\n\u2202\u0398\n\n],\n\n(5)\n\nM(cid:88)\n\nT(cid:88)\n\nm=1\n\ni=1\n\n4\n\n\fwhere M is the training episode, and T is the total number of proposals at each episode.\nThe last problem that we need to solve is that the log-likelihood log p(\u03b1|\u02dclm\ni , I, \u0398) may introduce\nunbounded high variance to the gradient estimator on bad regions suggested during training. This can\nbe addressed by introducing the gradient variance reduction strategy [12] to our model. We replace\nthe log-likelihood with a difference value between the reward function R and the output of a baseline\nnetwork: bi = fbase(vi; \u03b8base), where vi is the output of the RNN unit at the i-th iteration and \u03b8base\nrefers to the network parameters. The baseline network is trained to learn the expected value of R in\norder to normalize the reward to be mean zero [20]. Hence, the training loss is formulated as:\n\nM(cid:88)\n\nT(cid:88)\n\n(Rm\n\ni \u2212 bi)\n\ni |I, \u0398)\n\u2202 log p(\u02dclm\n\u2202\u0398\n\n.\n\n(6)\n\n\u2202F\n\u2202\u0398\n\n\u2248 1\nM\n\nm=1\n\ni=1\n\nIn this way, we factorize the goal of maximizing the reduction in RMSE in a stepwise manner, which\nmeans that our network is able to propose a sequence of regions and each will lead to a maximum\ndecline in RMSE. In other words, these suggested, ordered regions are informative regions for the\nmatting problem.\nTraining Strategy and Reward De\ufb01nition. At the i-th iteration, our model suggests a region li and\nreceives a reward Ri, which indicates the quality of li. Suppose that there are S possible locations to\nbe selected for suggestion at the i-th iteration, we can dynamically de\ufb01ne the reward by how much\nthe j-th region affects the estimated matte as:\n\nRj\n\ni =\n\n(cid:107)\u03b1j\n\ni\u22121 \u2212 \u03b1gt(cid:107) \u2212 (cid:107)\u03b1j\nmax{(cid:107)\u03b1i\u22121 \u2212 \u03b1gt(cid:107) \u2212 (cid:107)\u03b1\n\ni \u2212 \u03b1gt(cid:107)\ni \u2212 \u03b1gt(cid:107)} .\n{S}\n\n(7)\n\nwhere the region corresponding to the maximum RMSE reduction is regarded as the ground truth\nregion suggestion and will receive a reward of 1. The other regions will receive rewards according to\ntheir percentages of RMSE reduction.\nTo train our model at the i-th iteration, we run our network S times repeatedly and then collect S\nlatent regions as suggestions for the i-th iteration. Each latent region receives a reward according to\nthe RMSE reduction. Based on the given rewards, we select the best one as the suggested region to\nfacilitate the training of \ufb01nding informative regions. S cannot be set to 1. Otherwise, the proposed\nmodel fails to distinguish what a good region is, as it always receives a reward of 1 no matter how\nmuch RMSE declines. We empirically compare the performance by selecting S from {1, 2, 5}. While\nS = 2 performs \u223c 30% better than S = 1, S = 5 performs only \u223c 1.3% better than S = 2 but the\ncomputation time is double. In addition, a large value (e.g., S = 10) will lead a very costly training.\nAs a result, we set S to 2 during training, which achieves a good balance between accuracy and\nef\ufb01ciency. We \ufb01x S to 1 during inference.\nDuring training, each suggested region is answered automatically according to the ground truth matte.\nSpeci\ufb01cally, we only provide two possible answers, i.e., foreground or background. If a suggested\nregion contains unknown pixels, we simply skip that iteration and our model will not receive a reward.\nIn this way, we want our model to focus on the foreground/background informative regions only,\nwhich are easy for the user to label. In the test stage, our model behaves as a fully forward model,\nwhich actively suggests regions for user feedbacks.\n\n3 Experiments\n\n3.1 Experimental Setup\n\nExperiment Environment. Our active matting model is implemented using Tensor\ufb02ow [1], and\ntrained and tested on a PC with an i7-6700K CPU, 8G RAM and an NVIDIA GTX 1080 GPU. The\ninput images are resized to 400 \u00d7 400. Generally, it takes about 0.35s for our model to suggest a\nregion after receiving the user feedback. We train our network from scratch with the truncated normal\ninitializer. The learning rate is set to 10\u22123 initially and then goes through an exponential decay to the\nminimum learning rate 10\u22124. We also crop the gradients to prevent gradient explosion.\nExperiment Datasets. We have conducted our experiments on three challenging datasets, the\nportrait dataset [17], the matting benchmark [14], and our rendered-100 dataset. The portrait dataset\n\n5\n\n\f(a) Evaluation on the matting benchmark [14]\n\n(b) Evaluation on the portrait dataset [17]\n\n(c) Evaluation on the Rendered-100 dataset\n\n(d) Ours method applied in different matting algorithms\nFigure 3: Evaluation of the proposed method. (a)-(c) Comparison with the four baselines on three\ndatasets. (d) The learned informative knowledge is used with different matting algorithms. IFM\nrefers to [2], DCNN refers to [5], KNN refers to [4], ClosedForm refers to [11], and SharedMatting\nrefers to [7].\n\ncontains 1,700 training images, 300 testing images, and their corresponding ground truth mattes. The\nmatting benchmark consists of 27 images with user-de\ufb01ned trimaps and ground truth mattes, and 8\nimages without trimaps nor mattes. We use the portrait testing images and 27 images of the matting\nbenchmark for evaluation.\nWe train our model using the training set of the portrait dataset. To avoid over\ufb01tting, we propose a\nrendered-100 dataset for \ufb01ne tuning, which has 100 images and their corresponding ground truth\nmattes. We use 90 images for \ufb01ne tuning with data augmentation, and 10 images for testing. To build\nthe rendered-100 dataset, we select different 3D models as foreground objects (e.g., bunny, hairball\nand metal sphere), and use natural images as backgrounds. In particular, we select foreground objects\nwith thin structures (e.g., furs and leaves), or having similar textures with the background, to simulate\nchallenging scenarios. We show sample images in our supplemental. The complete render-100\ndataset (including rendered images, extracted foreground objects, and ground-truth alpha mattes) can\nbe found at 4.\nFour Baselines. We have constructed the following baselines in our experiments:\n\nB1: Since we sequentially generate the informative regions, and their effectiveness is measured by\nthe RMSE between the produced alpha matte and the ground truth matte, there must be an\nideal sequence that would produce the minimum RMSE (or maximum RMSE reduction)\nat each step. We exhaustively search every region in the image at each step. It takes a few\nhours per image in order to obtain this ideal sequence, which can be used to show the upper\nbound performance.\n\nB2: This baseline is to compute the informative regions without the recurrent mechanism. We\ntrain seven CNNs such that each of them computes 1, 5, 10, 15, 20, 25 or 30 regions\nsimultaneously. We refer to them as Region-CNNs. Each of them takes an image as input,\nencodes the extracted features to a 2 \u00d7 N vector using a fully-connected layer, and then\noutputs N coordinates representing N regions. They are all trained on the top-N ideal\nregions, where N=1, 5, 10, 15, 20, 25 or 30.\n\n4http://www.cs.cityu.edu.hk/~rynson/projects/matting/ActiveMatting.html\n\n6\n\n\fTable 1: RMSE comparison of different input strategies with different matting methods. (\u201cActive\u201d\nrefers to our sequence of 20 informative regions, while \u201cIdeal Sequence\u201d refers to the sequence of 20\nground truth regions.) We report the RMSEs of 8 example images and the average RMSEs for all 27\nimages from the matting benchmark [14]. The best results of different types of input are shown in\ngray background. The time costs (seconds) are shown below Trimap+IFM [2] and Active+IFM [2].\n\nTrimap+IFM [2]\n\nIdeal Sequence+IFM [2]\n\nTrimap+DCNN [5]\n\nIdeal Sequence+DCNN [5]\n\nActive+IFM [2]\n\nActive+DCNN [5]\nTrimap+Shared [7]\nActive+Shared [7]\nTrimap+KNN [4]\nTrimap+Global [9]\nActive+KNN [4]\n\nTrimap+Learning [23]\n\nTrimap+ClosedForm [11]\n\n0.094\n181s\n0.096\n0.117\n0.121\n0.099\n55s\n0.121\n0.104\n0.111\n0.097\n0.111\n0.102\n0.126\n0.136\n\n0.046\n209s\n0.048\n0.049\n0.051\n0.051\n66s\n0.052\n0.059\n0.062\n0.053\n0.059\n0.058\n0.107\n0.117\n\n0.021\n195s\n0.023\n0.021\n0.022\n0.024\n58s\n0.024\n0.023\n0.027\n0.029\n0.024\n0.034\n0.028\n0.029\n\n0.018\n192s\n0.019\n0.019\n0.022\n0.019\n62s\n0.023\n0.021\n0.023\n0.031\n0.023\n0.035\n0.015\n0.016\n\n0.051\n215s\n0.064\n0.079\n0.081\n0.071\n54s\n0.081\n0.071\n0.078\n0.066\n0.072\n0.074\n0.084\n0.097\n\n0.017\n187s\n0.019\n0.019\n0.021\n0.021\n55s\n0.021\n0.024\n0.026\n0.022\n0.018\n0.023\n0.022\n0.023\n\n0.022\n205s\n0.022\n0.021\n0.021\n0.024\n53s\n0.023\n0.022\n0.024\n0.037\n0.024\n0.044\n0.015\n0.016\n\nMean\nRMSEs\n0.031\n204s\n0.032\n0.033\n0.034\n0.034\n57s\n0.035\n0.035\n0.039\n0.043\n0.045\n0.049\n0.051\n0.056\n\nB3: We develop this baseline based on the clustering strategy. The input image is divided into 5, 10,\n15, 20, 25 and 30 clusters based on the pre-de\ufb01ned features suggested in [16]. After that, the\ncenter grids of each cluster are proposed for user labeling.\nB4: We use the randomly generated sequence as another baseline.\n\n3.2 Comparison with Baselines\n\nFigure 3(a-c) compare the proposed method with four baselines in terms of root mean squared\nerror (RMSE). Results show that the proposed method has comparable performance to the ideal\nbaseline B1 on three datasets, particularly after 15 iterations. Although the proposed method is\ntrained on the extremely unbalanced datasets (1,700 portrait and 90 rendered images), it still shows\ngood generalization to natural images on the matting benchmark.\nWe also validate the proposed method against two baselines that directly generate multiple informative\nregions. The \ufb01rst uses CNNs to automatically learn the features for predicting regions (B2) and the\nsecond uses hand-crafted features with a clustering strategy (B3). Figure 3(a-c) show that our model\nwith sequential learning performs better than these two baselines. This is because generating multiple\ninformative regions simultaneously does not consider the dependency between informative regions\nand user feedbacks. While some regions may be informative by themselves, they may provide similar\ninformation to the matting solver. Sequentially generating regions produce complementary results.\nFinally, we can see that the learned informative knowledge outperforms random sequence (B4).\n\n3.3 Comparison with Trimaps\n\nHere, we show the comparison between using a \ufb01ne trimap and 20 informative regions as input.\nDifferent matting algorithms are fed with a trimap or our informative regions. The ideal sequence is\nalso shown for reference.\nIn order to have a fair comparison on generating the trimap in terms of interaction time, we asked\n10 users to generate a trimap for each image from scratch. Speci\ufb01cally, users were \ufb01rst asked to\ndraw scribbles to indicate foreground/background and a bounding box of the foreground object.\nGrabcut [15] was then used to generate a trimap based on the user input. Regions in the trimap with\nlow con\ufb01dence were set to unknown regions. Users were able to iteratively re\ufb01ne the input trimap\nbased on the alpha matting result. This process stopped when the users were satis\ufb01ed with the output\n\n7\n\n\fFigure 4: Step-wise comparison between our active model and user scribbles. In each step, the left\ncolumn is generated by our active matting, the middle and right columns are the scribbles drawn by\nthe experienced and inexperienced users, respectively. We report the labeling time (in seconds) and\nRMSE for each iteration.\n\nmatte. The time taken for the entire process was recorded, including drawing, Grabcut computation,\nmatting computation and re\ufb01nement. The average RMSEs and times are reported.\nTable 1 shows the comparison of different input strategies (i.e., trimaps, ideal sequence, and the\nproposed sequence). We report 8 images from the matting benchmark [14] as well as the average\nperformance on the whole matting benchmark (27 images). The interaction time cost is reported\nbelow the best methods of the trimap input and active matting. Note that the proposed method\nhas not been trained on these images. Results show that the proposed 20 informative regions yield\ncomparable matting performance to a \ufb01ne-labelled trimap in terms of quality. On the other hand,\ngenerating a \ufb01ne trimap costs much users efforts, and it takes about 3 minutes to obtain a good alpha\nmatte. Instead, our active model is free from trimaps, and it takes around 1 minute from feeding an\nimage input to getting an alpha matte. We show that the proposed informative regions can achieve\ncomparable results to the trimap-based method with less labeling efforts.\nFurthermore, although we use shared matting [7] for computing mattes during training, results\nshow that our informative regions are general across different matting algorithms. As shown in\nFigure 3(d), the RMSEs drop signi\ufb01cantly with the \ufb01rst 5 informative regions for all the methods,\n\n8\n\n\f(a) Image\n\n(b) GT Matte\n\n(c) Suggested Regions\n\n(d) Matte from (c)\n\nFigure 5: A failure case. The input image in (a) contains thin structures that are smaller than the size\nof the suggested regions (red boxes in (c)), resulting in an unsatisfactory matte as shown in (d).\n\nand the subsequent regions gradually re\ufb01ne the resulted alpha mattes. This result implies that our\ninformative regions are independent of the matting algorithms used and can be used as an image\nfeature for other applications.\n\n3.4 Comparison with Scribbles\n\nWe conduct a stepwise comparison with user scribbles. As shown in Figure 4, the active model\nproposes 10 regions to a user (one with little knowledge about matting) for labeling, and two users\n(one is experienced user familiar with matting and the other is inexperienced) are asked to draw 10\nscribbles on the image for comparison. The results of each step and the interaction time are shown.\nAt the beginning (step 1), the proposed active model can accurately \ufb01nd the right region such that the\nmatting algorithm [7] can distinguish the foreground and background layers by color distributions.\nOn the contrary, both the scribbles from the users are not suf\ufb01cient to generate the mattes. Moreover,\nthe proposed model can generate satisfactory result within 5 steps, while the results from scribbles\nstill suffer from background-foreground separation. Our resulting matte is further re\ufb01ned in the\nsubsequent 5 steps and achieves better alpha matte with less labeling time than using scribbles.\n\n4 Conclusion and Limitation\n\nIn this paper, we propose a novel active model for matting. Our model can actively \ufb01nd and propose\ninformative regions for users to label the foreground or background layer. After several question-and-\nanswer iterations, a high quality matte can be obtained with minimum human efforts. We integrate\nthe idea of informative regions to the matting process by formulating it as a reinforcement learning\nproblem. The proposed informative regions are general across different matting algorithms.\nThe limitation of our active matting model is that it does not perform well on thin structures. As\nshown in Figure 5, the foreground object contains thin structures that are smaller than the suggested\nregions, leading to an unsatisfactory result. A possible solution to this problem is to incorporate\nscribbles, in addition to suggested regions, for labeling \ufb01ne details. This will be left as a future work.\nFor further research, we plan to extend our model to a realtime scribbling-based system, i.e., providing\nsuggestions to the user directly on the image with the probability of being informative in realtime.\nThis provides meaningful guidance for users to allow them to directly draw scribbles on the most\ninformative regions. On the other hand, we plan to develop a differentiable model using convolutional\nneural networks, instead of existing matting solvers, to fully exploit the information obtained from\nthe proposed informative regions.\n\n5 Acknowledgements\n\nWe thank the anonymous reviewers for the insightful and constructive comments, and NVIDIA for\ngenerous donation of GPU cards for our experiments. This work is in part supported by an SRG grant\nfrom City University of Hong Kong (Ref. 7004889), NSFC grants from National Natural Science\nFoundation of China (Ref. 91748104, 61632006, 61425002, 61702194), and the Guangzhou Key\nIndustrial Technology Research fund (No. 201802010036).\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In OSDI, pages 265\u2013283, 2016.\n\n[2] Yagiz Aksoy, Tunc Ozan Aydin, and Marc Pollefeys. Designing effective inter-pixel information\n\n\ufb02ow for natural image matting. In IEEE CVPR, pages 29\u201337, 2017.\n\n[3] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual\n\nattention. arXiv:1412.7755, 2014.\n\n[4] Qifeng Chen, Dingzeyu Li, and Chi-Keung Tang. KNN matting. IEEE TPAMI, pages 2175\u2013\n\n2188, 2013.\n\n[5] Donghyeon Cho, Yu-Wing Tai, and Inso Kweon. Natural image matting using deep convolu-\n\ntional neural networks. In ECCV, pages 626\u2013643, 2016.\n\n[6] Yung-Yu Chuang, B. Curless, D. H. Salesin, and R. Szeliski. A bayesian approach to digital\n\nmatting. In IEEE CVPR, pages 264\u2013271, 2001.\n\n[7] Eduardo S. L. Gastal and Manuel M. Oliveira. Shared sampling for real-time alpha matting.\n\nComputer Graphics Forum, pages 575\u2013584, 2010.\n\n[8] Yu Guan, Wei Chen, Xiao Liang, Zi\u2019ang Ding, and Qunsheng Peng. Easy matting-a stroke\nbased approach for continuous image matting. Computer Graphics Forum, 25(3):567\u2013576,\n2006.\n\n[9] Kaiming He, Christoph Rhemann, Carsten Rother, Xiaoou Tang, and Jian Sun. A global\n\nsampling method for alpha matting. In IEEE CVPR, pages 2049\u20132056, 2011.\n\n[10] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation,\n\npages 1735\u20131780, 1997.\n\n[11] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting. IEEE\n\nTPAMI, pages 228\u2013242, 2008.\n\n[12] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In\n\nNIPS, pages 2204\u20132212, 2014.\n\n[13] C. Rhemann, C. Rother, A. Rav-Acha, and T. Sharp. High resolution matting via interactive\n\ntrimap segmentation. In IEEE CVPR, pages 1\u20138, 2008.\n\n[14] Christoph Rhemann, Carsten Rother, Jue Wang, Margrit Gelautz, Pushmeet Kohli, and Pamela\nRott. A perceptually motivated online benchmark for image matting. In IEEE CVPR, pages\n1826\u20131833, 2009.\n\n[15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. \"grabcut\": interactive foreground\n\nextraction using iterated graph cuts. In ACM SIGGRAPH, pages 309\u2013314, 2004.\n\n[16] Ehsan Shahrian, Deepu Rajan, Brian Price, and Scott Cohen. Improving image matting using\n\ncomprehensive sampling sets. In CVPR, pages 636\u2013643, 2013.\n\n[17] Xiaoyong Shen, Xin Tao, Hongyun Gao, Chao Zhou, and Jiaya Jia. Deep automatic portrait\n\nmatting. In ECCV, pages 92\u2013107, 2016.\n\n[18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv:1409.1556, 2014.\n\n[19] Jian Sun, Jiaya Jia, Chi-Keung Tang, and Heung-Yeung Shum. Poisson matting. In ACM\n\nSIGGRAPH, pages 315\u2013321, 2004.\n\n[20] Richard Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient\nmethods for reinforcement learning with function approximation. In NIPS, pages 1057\u20131063,\n2000.\n\n10\n\n\f[21] J. Wang and M. F. Cohen. An iterative optimization approach for uni\ufb01ed image segmentation\n\nand matting. In IEEE ICCV, pages 936\u2013943, 2005.\n\n[22] Ronald Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, pages 229\u2013256, 1992.\n\n[23] Yuanjie Zheng and Chandra Kambhamettu. Learning based digital matting. In IEEE ICCV,\n\npages 889\u2013896, 2009.\n\n[24] Q. Zhu, L. Shao, X. Li, and L. Wang. Targeting accurate object extraction from an image: A\n\ncomprehensive study of natural image matting. IEEE TNNLS, pages 185\u2013207, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2234, "authors": [{"given_name": "Xin", "family_name": "Yang", "institution": "Dalian University of Technology"}, {"given_name": "Ke", "family_name": "Xu", "institution": "Dalian University of Technology;City University of Hong Kong"}, {"given_name": "Shaozhe", "family_name": "Chen", "institution": "Dalian University of Technology"}, {"given_name": "Shengfeng", "family_name": "He", "institution": "South China University of Technology"}, {"given_name": "Baocai Yin", "family_name": "Yin", "institution": "Dalian University of Technology"}, {"given_name": "Rynson", "family_name": "Lau", "institution": "City University of Hong Kong"}]}