{"title": "Learning to Agglomerate Superpixel Hierarchies", "book": "Advances in Neural Information Processing Systems", "page_first": 648, "page_last": 656, "abstract": "An agglomerative clustering algorithm merges the most similar pair of clusters at every iteration. The function that evaluates similarity is traditionally hand- designed, but there has been recent interest in supervised or semisupervised settings in which ground-truth clustered data is available for training. Here we show how to train a similarity function by regarding it as the action-value function of a reinforcement learning problem. We apply this general method to segment images by clustering superpixels, an application that we call Learning to Agglomerate Superpixel Hierarchies (LASH). When applied to a challenging dataset of brain images from serial electron microscopy, LASH dramatically improved segmentation accuracy when clustering supervoxels generated by state of the boundary detection algorithms. The naive strategy of directly training only supervoxel similarities and applying single linkage clustering produced less improvement.", "full_text": "Learning to Agglomerate Superpixel Hierarchies\n\nViren Jain\n\nJanelia Farm Research Campus\nHoward Hughes Medical Institute\n\nSrinivas C. Turaga\n\nBrain & Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nKevin L. Briggman, Moritz N. Helmstaedter, Winfried Denk\n\nDepartment of Biomedical Optics\n\nMax Planck Institute for Medical Research\n\nH. Sebastian Seung\n\nHoward Hughes Medical Institute\n\nMassachusetts Institute of Technology\n\nAbstract\n\nAn agglomerative clustering algorithm merges the most similar pair of clusters\nat every iteration. The function that evaluates similarity is traditionally hand-\ndesigned, but there has been recent interest in supervised or semisupervised set-\ntings in which ground-truth clustered data is available for training. Here we show\nhow to train a similarity function by regarding it as the action-value function of a\nreinforcement learning problem. We apply this general method to segment images\nby clustering superpixels, an application that we call Learning to Agglomerate\nSuperpixel Hierarchies (LASH). When applied to a challenging dataset of brain\nimages from serial electron microscopy, LASH dramatically improved segmen-\ntation accuracy when clustering supervoxels generated by state of the boundary\ndetection algorithms. The naive strategy of directly training only supervoxel sim-\nilarities and applying single linkage clustering produced less improvement.\n\n1\n\nIntroduction\n\nA clustering is de\ufb01ned as a partitioning of a set of elements into subsets called clusters. Roughly\nspeaking, similar elements should belong to the same cluster and dissimilar ones to different clusters.\nIn the traditional unsupervised formulation of clustering, the true membership of elements in clusters\nis completely unknown. Recently there has been interest in the supervised or semisupervised setting\n[5], in which true membership is known for some elements and can serve as training data. The goal\nis to learn a clustering algorithm that generalizes to new elements and new clusters. A convenient\nobjective function for learning is the agreement between the output of the algorithm and the true\nclustering, for which the standard measurement is the Rand index [25].\nClustering is relevant for many application domains. One prominent example is image segmenta-\ntion, the division of an image into clusters of pixels that correspond to distinct objects in the scene.\nTraditional approaches treated image segmentation as unsupervised clustering. However, it is be-\ncoming popular to utilize a supervised clustering approach in which a segmentation algorithm is\ntrained on a set of images for which ground truth is known [23, 32]. The Rand index has become\nincreasingly popular for evaluating the accuracy of image segmentation [34, 3, 13, 15, 35], and has\nrecently been used as an objective function for supervised learning of this task [32].\nThis paper focuses on agglomerative algorithms for clustering, which iteratively merge pairs of clus-\nters that maximize a similarity function. Equivalently, the merged pairs may be those that minimize\n\n1\n\n\fa distance or dissimilarity function, which is like a similarity function up to a change of sign. Speed\nis a chief advantage of agglomerative algorithms. The number of evaluations of the similarity func-\ntion is polynomial in the number of elements to be clustered. In contrast, the popular approach of\nusing a Markov random \ufb01eld to partition a graph with nodes that are the elements to be clustered,\nand edge weights given by their similarities, involves a computation that can be NP-hard [18].\nInef\ufb01cient inference becomes even more costly for learning, which generally involves many itera-\ntions of inference. To deal with this problem, many researchers have developed learning methods\nfor graphical models that depend on ef\ufb01cient approximate inference. However, once such approx-\nimations are introduced, many of the desirable theoretical properties of this framework no longer\napply and performance in practice may be arbitrarily poor, as several authors have recently noted\n[36, 19, 8]. Here we avoid such issues by basing learning on agglomerative clustering, which is an\nef\ufb01cient inference procedure in the \ufb01rst place.\nWe show that an agglomerative clustering algorithm can be regarded as a policy for a deterministic\nMarkov decision process (DMDP) in which a state is a clustering, an action is a merging of two\nclusters, and the immediate reward is the change in the Rand index with respect to the ground truth\nclustering. In this formulation, the optimal action-value function turns out to be the optimal simi-\nlarity function for agglomerative clustering. This DMDP formulation is helpful because it enables\nthe application of ideas from reinforcement learning (RL) to \ufb01nd an approximation to the optimal\nsimilarity function.\nOur formalism is generally applicable to any type of clustering, but is illustrated with a speci\ufb01c\napplication to segmenting images by clustering superpixels. These are de\ufb01ned as groups of pixels\nfrom an oversegmentation produced by some other algorithm [27]. Recent research has shown\nthat agglomerating superpixels using a hand-designed similarity function can improve segmentation\naccuracy [3]. It is plausible that it would be even more powerful to learn the similarity function from\ntraining data. Here we apply our RL framework to accomplish this, yielding a new method called\nLearning Agglomeration of Superpixel Hierarchies (LASH). LASH works by iteratively updating\nan approximation to the optimal similarity function. It uses the current approximation to generate\na sequence of clusterings, and then improves the approximation on all possible actions on these\nclusterings.\nLASH is an instance of a strategy called on-policy control in RL. This strategy has seen many empir-\nical successes, but the theoretical guarantees are rather limited. Furthermore, LASH is implemented\nhere for simplicity using in\ufb01nite temporal discounting, though it could be extended to the case of\n\ufb01nite discounting. Therefore we empirically evaluated LASH on the problem of segmenting images\nof brain tissue from serial electron microscopy, which has recently attracted a great deal of inter-\nest [6, 15]. We \ufb01nd that LASH substantially improves upon state of the art convolutional network\nand random forest boundary-detection methods for this problem, reducing segmentation error (as\nmeasured by the Rand error) by 50% as compared to the next best technique.\nWe also tried the simpler strategy of directly training superpixel similarities, and then applying single\nlinkage clustering [2]. This produced less accurate test set segmentations than LASH.\n\n2 Agglomerative clustering as reinforcement learning\n\nA Markov decision process (MDP) is de\ufb01ned by a state s, a set of actions A(s) at each state, a\nfunction P (s, a, s\uffff) specifying the probability of the s \u2192 s\uffff transition after taking action a \u2208 A(s),\nand a function R(s, a, s\uffff) specifying the immediate reward. A policy \u03c0 is a map from states to\nactions, a = \u03c0(s). The goal of reinforcement learning (RL) is to \ufb01nd a policy \u03c0 that maximizes the\nexpected value of total reward.\n\nTotal reward is de\ufb01ned as the sum of immediate rewards\uffffT\u22121\nt=0 R(st, at) up to some time horizon\nT . Alternatively, it is de\ufb01ned as the sum of discounted immediate rewards,\uffff\u221et=0 \u03b3tR(st, at), where\n0 \u2264 \u03b3 \u2264 1 is the discount factor. Many RL methods are based on \ufb01nding an optimal action-value\nfunction Q\u2217(s, a), which is de\ufb01ned as the sum of discounted rewards obtained by taking action a\nat state s and following the optimal policy thereafter. An optimal policy can be extracted from this\nfunction by \u03c0\u2217(s) = argmaxa Q\u2217(s, a).\n\n2\n\n\fWe can de\ufb01ne agglomerative clustering as an MDP. Its state s is a clustering of a set of objects. For\neach pair of clusters in st, there is an action at \u2208 A(st) that merges them to yield the clustering\nst+1 = at(st). Since the merge action is deterministic, we have the special case of a deterministic\nMDP, rather than a stochastic one. To de\ufb01ne the rewards of the MDP, we make use of the Rand\nindex, a standard measure of agreement between two clusterings of the same set [25]. A clustering\nis equivalent to classifying all pairs of objects as belonging to the same cluster or different clusters.\nThe Rand index RI(s, s\uffff) is the fraction of object pairs on which the clusterings s and s\uffff agree.\nTherefore, we can de\ufb01ne the immediate reward of action a as the resulting increase in the Rand\nindex with respect to a ground truth clustering s\u2217, R(s, a) = RI(a(s), s\u2217) \u2212 RI(s, s\u2217).\nAn agglomerative clustering algorithm is a policy of this MDP, and the optimal similarity function\nis given by the optimal action-value function Q\u2217. The sum of undiscounted immediate rewards\n\u201ctelescopes\u201d to the simple result\uffffT\u22121\nt=0 R(st, at) = RI(sT , s\u2217) \u2212 RI(s0, s\u2217) [21]. Therefore RL\nfor a \ufb01nite time horizon T is equivalent to maximizing the Rand index RI(sT , s\u2217) of the clustering\nat time T .\nWe will focus on the simple case of in\ufb01nite discounting (\u03b3 = 0). Then the optimal action-value\nfunction Q\u2217(s, a) is equal to R(s, a). In other words, R(s, a) is the best similarity function. We\nknow R(s, a) exactly for the training data, but we would also like it to apply to data for which ground\ntruth is unknown. Therefore we train a function approximator Q\u03b8 so that Q\u03b8(s, a) \u2248 R(s, a) on the\ntraining data, and hope that it generalizes to the test data. The following procedure is a simple way\nof doing this.\n\n1. Generate an initial sequence of clusterings (s1, . . . , sT ) by using R(s, a) as a similar-\niterate at = argmaxa R(st, a) and st+1 = at(st), terminating when\n\nity function:\nmaxa R(st, a) \u2264 0.\n\n2. Train the parameters \u03b8 so that Q\u03b8(st, a) \u2248 R(st, a) for all st and for all a \u2208 A(st).\n3. Generate a new sequence of clusterings by using Q\u03b8(s, a) as a similarity function: iterate\nat = argmaxa Q\u03b8(st, a) and st+1 = at(st), terminating when maxa Q\u03b8(st, a) \u2264 0.\n\n4. Goto 2.\n\nHere the clustering s1 is the trivial one in which each element is its own cluster. (The termination\nof the clustering is equivalent to the continued selection of a \u201cdo-nothing\u201d action that leaves the\nclustering the same, st+1 = st.) This is an example of \u201con-policy\u201d learning, because the function\napproximator Q\u03b8 is trained on clusterings generated by using it as a policy. It makes intuitive sense\nto optimize Q\u03b8 for the kinds of clusterings that it actually sees in practice, rather than for all possible\nclusterings. However, there is no theoretical guarantee that such on-policy learning will converge,\nsince we are using a nonlinear function approximation. Guarantees only exist if the action-value\nfunction is represented by a lookup table or a linear approximation. Nevertheless, the nonlinear\napproach has achieved practical success in a number of problem domains. Later we will present\nempirical results supporting the effectiveness of on-policy learning in our application.\nThe assumption of in\ufb01nite discounting removes a major challenge of RL, dealing with temporally\ndelayed reward. Are we losing anything by this assumption? If our approximation to the action-\nvalue function were perfect, Q\u03b8(s, a) = R(s, a), then agglomerative clustering would amount to\ngreedy maximization of the Rand index. It is straightforward to show that this yields the clustering\nthat is the global maximum. In practice, the approximation will be imperfect, and extending the\nabove procedure to \ufb01nite discounting could be helpful.\n\n3 Agglomerating superpixels for image segmentation\n\nThe introduction of the Berkeley segmentation database (BSD) provoked a renaissance of the bound-\nary detection and segmentation literature. The creation of a ground-truth segmentation database en-\nabled learning-driven methods for low-level boundary detection, which were found to outperform\nclassic methods such as Canny\u2019s [23, 10]. Global and multi-scale features were added to improve\nperformance even further [26, 22, 29], and recently learning methods have been developed that\ndirectly optimize measures of segmentation performance [32, 13].\n\n3\n\n\fi\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\n1\n\n0.99\n\n0.98\n\n0.97\n\n0.96\n\n0.95\n\n0.94\n\n0.93\n\n0.92\n\n0.91\n\n0.9\n0.86\n\n(cid:51)(cid:85)(cid:72)(cid:70)(cid:76)(cid:86)(cid:76)(cid:82)(cid:81)(cid:239)(cid:53)(cid:72)(cid:70)(cid:68)(cid:79)(cid:79)(cid:3)(cid:82)(cid:73)(cid:3)(cid:38)(cid:82)(cid:81)(cid:81)(cid:72)(cid:70)(cid:87)(cid:72)(cid:71)(cid:3)(cid:51)(cid:76)(cid:91)(cid:72)(cid:79)(cid:16)(cid:51)(cid:68)(cid:76)(cid:85)(cid:86)\n\n(cid:54)(cid:87)(cid:68)(cid:81)(cid:71)(cid:68)(cid:85)(cid:71)(cid:3)(cid:38)(cid:49)\n(cid:76)(cid:79)(cid:68)(cid:86)(cid:87)(cid:76)(cid:78)\n(cid:37)(cid:47)(cid:50)(cid:55)(cid:38)(cid:3)(cid:38)(cid:49)\n(cid:48)(cid:36)(cid:47)(cid:44)(cid:54)(cid:3)(cid:38)(cid:49)\n(cid:54)(cid:76)(cid:81)(cid:74)(cid:79)(cid:72)(cid:3)(cid:47)(cid:76)(cid:81)(cid:78)(cid:68)(cid:74)(cid:72)\nLASH\n0.88\n\n0.9\n\n0.92\n\nRecall\n\n0.94\n\n0.96\n\n0.98\n\n1\n\nBaseline\n\nCN [14, 33]\nilastik [31]\n\nBLOTC CN [13]\nMALIS CN [32]\nSingle Linkage\n\nLASH\n\nRand Error\n.0499\n.0084\n.0064\n.0056\n.0055\n.0049\n.0029\n\nPair Recall\n\nPair Precision\n\n0\n\n91.85\n87.85\n94.32\n94.97\n96.48\n94.83\n\nn/a\n91.33\n99.31\n94.57\n93.41\n93.78\n99.30\n\nFigure 1: Performance comparison on a one megavoxel test set; parameters, such as the binarization\nthreshold for the convolutional network (CN) af\ufb01nity graphs, were determined based on the optimal\nvalue on the training set. CN\u2019s used a \ufb01eld of view of 16 \u00d7 16 \u00d7 16, ilastik used a \ufb01eld of view of\n23\u00d7 23\u00d7 23, and LASH used a \ufb01eld of view of 50\u00d7 50\u00d7 50. LASH leads to a substantial decrease\nin Rand error (1-Rand Index), and much higher connected pixel-pair precision at similar levels of\nrecall as compared to other state of the art methods. The \u2018Connected pixel-pairs\u2019 curve measures\nthe accuracy of connected pixel pairs pairs relative to ground truth. This measure corrects for the\nimbalance in the Rand error for segmentations in which most pixels are disconnected from one\nanother, as in the case of EM reconstruction of dense brain wiring. For example, \u2018Trivial Baseline\u2019\nabove represents the trivial segmentation in which all pixels are disconnected from one another, and\nachieves relatively low Rand error but of course zero connected-pair recall.\n\nHowever, boundary detectors alone have so far failed to produce segmentations that rival human\nlevels of accuracy. Therefore many recent studies use boundary detectors to generate an overseg-\nmentation of the image into fragments, and then attempt to cluster the \u201csuperpixels\u201d . This approach\nhas been shown to improve the accuracy of segmenting natural images [3, 30].\nA similar approach [2, 1, 17, 35, 16] has also been employed to segment 3d nanoscale images from\nserial electron microscopy [11, 9]. In principle, it should be possible to map the connections between\nneurons by analyzing these images [20, 12, 28]. Since this analysis is highly laborious, it would be\ndesirable to have automated computer algorithms for doing so [15]. First, each synapse must be\nidenti\ufb01ed. Second, the \u201cwires\u201d of the brain, its axons and dendrites, must be traced, i.e., segmented.\nIf these two tasks are solved, it is then possible to establish which pairs of neurons are connected by\nsynapses.\nFor our experiments, images of rabbit retina inner plexiform layer were acquired using Serial Block\nFace Scanning Electron Microscopy (SBF-SEM) [9, 4]. The tissue was specially stained to enhance\ncell boundaries while suppressing contrast from intracellular structures (e.g., mitochondria). The\nimage volume was acquired at 22 \u00d7 22 \u00d7 25 nm resolution, yielding a nearly isotropic 3d dataset\nwith excellent slice-to-slice registration. Two training sets were created by human tracing and proof-\nreading of subsets of the 3d image. The training sets were augmented with their eight 3d orthogonal\nrotations and re\ufb02ections to yield 16 training images that contained roughly 80 megavoxels of labeled\ntraining data. A separate one megavoxel labeled test set was used to evaluate algorithm performance.\n\n3.1 Boundary Detectors\n\nFor comparison purposes, as well as to provide supervoxels for LASH, we tested several state of the\nart boundary detection algorithms on the data. A convolutional network (CN) was trained to produce\naf\ufb01nity graphs that can be segmented using connected components or watershed [14, 33]. We also\ntrained CNs using MALIS and BLOTC, which are recently proposed machine learning algorithms\nthat optimize true metrics of segmentation performance. MALIS directly optimizes the Rand index\n[32]. BLOTC, originally introduced for 2d boundary maps and here generalized to 3d af\ufb01nity graphs,\n\n4\n\n\fSBF-SEM Z-X Reslice\n\nHuman Labeling\n\nBLOTC CN \n\nLASH\n\nz\n\nx\n\ni\n\nd\ne\np\nu\nc\nc\nO\ne\nm\nu\no\nV\n\n \n\nl\n\n \nf\n\n \n\no\n%\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\nSupervoxel Sizes\n\n0 to 100\n\n100 to 1,000 1,000 to 1e4 1e4 to 1e5 More than 1e5\n\nSupervoxel size (in voxels)\n\nFigure 2: (Left) Visual comparison of output from a state of the boundary detector, BLOTC CN [13], and\nLearning to Agglomerate Superpixel Hierarchies (LASH). Image and segmentations are from a Z-X axis re-\nsectioning of the 100 \u00d7 100 \u00d7 100 voxel test set. Segmentations were performed in 3d though only a single\n2d 100 \u00d7 100 reslice is shown here. White circle shows an example location in which BLOTC CN merged\ntwo separate objects due to weak staining in an adjacent image slice; orange ellipse shows an example location\nin which BLOTC CN split up a single thin object. LASH avoids both of these errors. (Right) Distribution of\nsupervoxel sizes, as measured by percentage of image volume occupied by speci\ufb01c size ranges of supervoxels.\n\noptimizes \u2018warping error,\u2019 a measure of topological disagreement derived from concepts introduced\nin digital topology [13].\nFinally, we trained \u2018ilastik,\u2019 a random-forest based boundary detector [31]. Unlike the CNs, which\noperated on the raw image and learned features as part of the training process, ilastik uses a pre-\nde\ufb01ned set of image features that represented low-level image structure such as intensity gradients\nand texture. The CNs used a \ufb01eld of view of 16 \u00d7 16 \u00d7 16 voxels to make decisions about any\nparticular image location, while ilastik used features from a \ufb01eld of view of up to 23 \u00d7 23 \u00d7 23\nvoxels.\nTo generate segmentations of the test set, we found connected components of the thresholded bound-\nary detector output, and then performed marker-based watershed to grow out these regions until they\ntouched. Figure 1 shows the Rand index attained by the CNs and ilastik. Here we convert the index\ninto an error measure by subtracting it from 1. Segmentation performance is sensitive to the thresh-\nold used to binarize boundary detector output, so we used the threshold that minimized Rand error\non the training set.\n\n3.2 Supervoxel Agglomeration\n\nSupervoxels were generated from BLOTC convolutional network output, using connected compo-\nnents applied at a high threshold (0.9) to avoid undersegmented regions (in the test set, there was\nonly one supervoxel in the initial oversegmentation which contained more than one ground truth\nregion). Regions were then grown out using marker-based watershed. The size of the supervoxels\nvaried considerably, but the majority of the image volume was assigned to supervoxels larger than\n1, 000 voxels in size (as shown in Figure 3).\nFor each pair of neighboring supervoxels, we computed a 138 dimensional feature vector, as de-\nscribed in the Appendix. This was used as input to the learned similarity function Q\u03b8, which we\nrepresented by a decision-tree boosting classi\ufb01er [7]. We followed the procedure given in Section\n2, but with two modi\ufb01cations. First the examples used in each training iteration were collected by\nsegmenting all the images in the training set, not only a single image. Second, Q\u03b8 was trained to ap-\nproximate H(R(st, a)) rather than R(st, a), where H is the Heaviside step function and the log-loss\nwas optimized. This was done because our function approximator was suitable for classi\ufb01cation, but\n\n5\n\n\fsome other approximator suitable for regression could also be used. The loop in the procedure of\nSection 2 was terminated when training error stopped decreasing by a signi\ufb01cant amount, after 3\ncycles. Then the learned similarity function was applied to agglomerate supervoxels in the test set\nto yield the results in Figure 1. The agglomeration terminated after around 5000 steps.\nThe results show substantial decrease in Rand error compared to state of the art techniques (MALIS\nand BLOTC CN). A potential sublety in interpreting these results is the small absolute values of the\nRand error for all of these techniques. The Rand error is de\ufb01ned as the probability of classifying\npairs of voxels as belonging to the same or different clusters. This classi\ufb01cation task is highly\nimbalanced, because the vast majority of voxel pairs belong to different ground truth clusters. Hence\neven a completely trivial segmentation in which every voxel is its own cluster can achieve fairly low\nRand error (Figure 1). Precision and recall are better quanti\ufb01cations of performance at imbalanced\nclassi\ufb01cation[23]. Figure 1 shows that LASH achieves much higher precision at similar recall.\nFor the task of segmenting neurons in EM images, high precision is especially important as false\npositives can lead to false positive neuron-pair connections.\nVisual comparison of segmentation performance is shown in Figure 2. LASH avoids both split and\nmerge errors that result from segmenting BLOTC CN output. BLOTC CN in turn was previously\nshown to outperform other techniques such as Boosted Edge Learning, multi-scale normalized cut,\nand gPb-OWT-UCM [13].\n\n3.3 Naive training of the similarity function on superpixel pairs\n\nIn the conventional algorithms for agglomerative clustering, the similarity S(A, B) of two clusters\nA and B can be reduced to the similarities S(x, y) of elements x \u2208 A and y \u2208 B. For example,\nsingle linkage clustering assumes that S(A, B) = maxx\u2208A,y\u2208B S(x, y). The maximum operation\nis replaced by the minimum or average in other common algorithms. LASH does not impose any\nsuch constraint of reducibility on the similarity function. Consequently, LASH must truly compute\nnew similarities after each agglomerative step. In contrast, conventional algorithms can start by\ncomputing the matrix of similarities between the elements to be clustered, and all further similarities\nbetween clusters follow from trivial computations.\nTherefore another method of learning agglomerative clustering is to train a similarity function on\npairs of superpixels only, and then apply a standard agglomerative algorithm such as single linkage\nclustering. This has been previously been done for images from serial electron microscopy [2].\n(Note that single linkage clustering is equivalent to creating a graph in which nodes are superpixels\nand edge weights are their similarities, and then \ufb01nding the connected components of the thresholded\ngraph.) As shown in Figure 1, clustering superpixels in this way improves upon boundary detection\nalgorithms. However, the improvement is substantially less than achieved by LASH.\n\nDiscussion\n\nWhy did LASH achieve better accuracy than other approaches? One might argue that the comparison\nis unfair, because the CNs and ilastik detected boundaries using a \ufb01eld of view considerably smaller\nthan that used in the LASH features (up to 50 \u00d7 50 \u00d7 50 for the SVF feature computation). If these\ncompeting methods were allowed to use the same context, perhaps their accuracy would improve\ndramatically. This is possible, but training time would also increase dramatically. Training a CN\nwith MALIS or BLOTC on 80 megavoxels of training data with a 163 \ufb01eld of view already takes\non the order of a week, using an optimized GPU implementation [24]. Adding the additional layers\nto the CN required to achieve a \ufb01eld of view of 503 might require months of additional training.1\nIn constrast, the entire LASH training process is completed within roughly one day. This can be\nattributed to the ef\ufb01ciency gains associated with computations on supervoxels rather than voxels.\nIn short, LASH is more accurate because it is ef\ufb01cient enough to utilize more image context in its\ncomputations.\nWhy does LASH outperform the naive method of directly training superpixel similarities used in\nsingle linkage clustering? The naive method uses the same amount of image context. In this case,\n\n1Using a much larger \ufb01eld of view with a CN will likely require new architectures that incorporate multi-\n\nscale capabilities.\n\n6\n\n\fFigure 3: Example of SVF feature computation. Blue and red are two different supervoxels. Left panel shows\nrendering of the objects, right panel shows smoothed vector \ufb01elds (thin arrows), along with chosen center-of-\nmass orientation vectors (thick blue/red lines) and line connecting the two center of masses (thick green line).\nThe angle between the thick blue/red and green lines is used as a feature during LASH.\n\nLASH is probably superior because it trains the similarities by optimizing the clustering that they\nactually yield. The naive method resembles LASH, but with the modi\ufb01cation that the action-value\nfunction is trained only for the actions possible on the clustering s1 rather than on the entire sequence\nof clusterings (see Step 2 of the procedure in Section 2).\nWe have conceptualized LASH in the framework of reinforcement learning. Previous work has\napplied reinforcement learning to other structured prediction problems [21]. An additional closely\nrelated approach to structured prediction is SEARN, introduced by Daume et al [8]. As in our\napproach, SEARN uses a single classi\ufb01er repeatedly on a (structured) input to iteratively solve an\ninference problem. The major difference between our approach and theirs is the way the classi\ufb01er\nis trained. In paticular, SEARN begins with a manually speci\ufb01ed policy (given by ground truth or\nheuristics) and then iteratively degrades the policy as a classi\ufb01er is trained and \u2018replaces\u2019 the initial\npolicy. In our approach, the initial policy may exhibit poor performance (i.e., for random initial \u03b8),\nand then improves through training.\nWe have implemented LASH with in\ufb01nite discounting of future rewards, but extending to \ufb01nite dis-\ncounting might produce better results. Generalizing the action space to include splitting of clusters\nas well as agglomeration might also be advantageous. Finally, the objective function optimized by\nlearning might be tailored to better re\ufb02ect more task-speci\ufb01c criteria, such as the number of locations\nthat human might have to correct (\u2018proofread\u2019) to yield an error-free segmentation by semiautomated\nmeans. These directions will be explored in future work.\n\nAppendix\n\nFeatures of supervoxel pairs used by the similarity function\n\nThe similarity function that we trained with LASH required as input a set of features for each su-\npervoxel pair that might be merged. For each supervoxel pair, we \ufb01rst computed a \u2018decision point,\u2019\nde\ufb01ned as the midpoint of the shortest line that connects any two points of the supervoxels. From\nthis decision point, we computed several types of features that encodes information about the un-\nderlying af\ufb01nity graph as well the shape of the supervoxel objects near the decision point: (1) size\nof each supervoxel in the pair, (2) distance between the two supervoxels, (3) analog af\ufb01nity value\nof the graph edge at which the two supervoxels would merge if grown out using watershed, and\nthe distance from the decision point to this edge, (4) \u2018Smoothed Vector Field\u2019 (SVF), a novel shape\nfeature described below, computed at various spatial scales (maximum 50 \u00d7 50 \u00d7 50). This feature\nmeasures the orientation of each supervoxel near the decision point.\nFinally, for each supervoxel in the pair we also included the above features for the closest 4 other\ndecision points that involve that supervoxel. Overall, this feature set yielded a 138 dimensional\nfeature vector for each supervoxel pair.\nThe smoothed vector \ufb01eld (SVF) shape feature attempts to determine the orientation of a supervoxel\nnear some speci\ufb01c location (e.g., the decision point used in reference to some other supervoxel).\n\n7\n\n\fThe main challenge in computing such an orientation is dealing with high-frequency noise and\nirregularities in the precise shape of the supervoxel. We developed a novel approach that deals with\nthis issue by smoothing a vector \ufb01eld derived from image moments. For a binary 3d image, SVF is\ncomputed in the following manner:\n\n1. A spherical mask of radius 5 is selected around each image location Ix,y,z, and \uffffvx,y,z is\nthen computed as the largest eigenvector of the 3 \u00d7 3 second order image moment matrix\nfor that window.\n2. The vector \ufb01eld is smoothed via 3 iterations of \u2018ising-like\u2019 interactions among nearest\nk=z\u22121 \uffffvi,j,k\uffff, where f represents\n\nneighbor vector \ufb01elds: \uffffvx,y,z \u2190 f\uffff\uffffx+1\n\na (non-linear) renormalization such that the magnitude of each vector remains 1.\n\ni=x\u22121\uffffy+1\n\nj=y\u22121\uffffz+1\n\n3. The smoothed vector at the center of mass of the supervoxel is used to compute angular\n\norientation of the supervoxel (see Figure 3).\n\nReferences\n[1] B. Andres, J. H. Kappes, U. K\u00f6the, C. Schn\u00f6rr, and F. A. Hamprecht. An empirical comparison of\ninference algorithms for graphical models with higher order factors using opengm. In M. Goesele, S. Roth,\nA. Kuijper, B. Schiele, and K. Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in\nComputer Science, pages 353\u2013362. Springer, 2010.\n\n[2] B. Andres, U. Koethe, M. Helmstaedter, W. Denk, and F. Hamprecht. Segmentation of SBFSEM volume\nIn Proceedings of the 30th DAGM symposium on\n\ndata of neural tissue by hierarchical classi\ufb01cation.\nPattern Recognition, pages 142\u2013152. Springer, 2008.\n\n[3] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From contours to regions: An empirical evaluation.\nComputer Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:2294\u20132301, 2009.\n[4] K. L. Briggman and W. Denk. Towards neural circuit reconstruction with volume electron microscopy\n\ntechniques. Current opinion in neurobiology, 16(5):562\u2013570, 2006.\n\n[5] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised learning. The MIT Press, page 528, 2010.\n[6] D. Chklovskii, S. Vitaladevuni, and L. Scheffer. Semi-automated reconstruction of neural circuits using\n\nelectron microscopy. Current Opinion in Neurobiology, 2010.\n\n[7] M. Collins, R. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine\n\nLearning, 48(1):253\u2013285, 2002.\n\n[8] H. Daum\u00e9, III, J. Langford, and D. Marcu. Search-based structured prediction. Machine Learning,\n\n75:297\u2013325, 2009. 10.1007/s10994-009-5106-x.\n\n[9] W. Denk and H. Horstmann. Serial block-face scanning electron microscopy to reconstruct three-\n\ndimensional tissue nanostructure. PLoS Biol, 2(11):e329, 2004.\n\n[10] P. Dollar, Z. Tu, and S. Belongie. Supervised learning of edges and object boundaries. Computer Vision\n\nand Pattern Recognition, IEEE Computer Society Conference on, 2:1964\u20131971, 2006.\n\n[11] K. J. Hayworth, N. Kasthuri, R. Schalek, and J. W. Lichtman. Automating the collection of ultrathin serial\n\nsections for large volume TEM reconstructions. Microscopy and Microanalysis, 12(S02):86\u201387, 2006.\n\n[12] M. Helmstaedter, K. L. Briggman, and W. Denk. 3D structural imaging of the brain with photons and\n\nelectrons. Current Opinion in Neurobiology, 18(6):633\u2013641, 2008.\n\n[13] V. Jain, B. Bollmann, M. Richardson, D. Berger, M. Helmstaedter, K. Briggman, W. Denk, J. Bowden,\nJ. Mendenhall, W. Abraham, K. Harris, N. Kasthuri, K. Hayworth, R. Schalek, J. Tapia, J. Lichtman, and\nH. Seung. Boundary Learning by Optimization with Topological Constraints. In Computer Vision and\nPattern Recognition, IEEE Computer Society Conference on, 2010.\n\n[14] V. Jain, J. F. Murray, F. Roth, S. C. Turaga, V. Zhigulin, K. L. Briggman, M. N. Helmstaedter, W. Denk,\nand H. S. Seung. Supervised learning of image restoration with convolutional networks. Computer Vision,\nIEEE International Conference on, 0:1\u20138, 2007.\n\n[15] V. Jain, H. Seung, and S. Turaga. Machines that learn to segment images: a crucial technology for\n\nconnectomics. Current opinion in neurobiology, 2010.\n\n[16] E. Jurrus, R. Whitaker, B. W. Jones, R. Marc, and T. Tasdizen. An optimal-path approach for neural circuit\nreconstruction. In Biomedical Imaging: From Nano to Macro, 2008. ISBI 2008. 5th IEEE International\nSymposium on, pages 1609\u20131612, May 2008.\n\n[17] V. Kaynig, T. Fuchs, and J. M. Buhmann. Neuron geometry extraction by perceptual grouping in sstem\n\nimages. In Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 2010.\n\n[18] V. Kolmogorov and R. Zabih. What energy functions can be minimizedvia graph cuts? IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, pages 147\u2013159, 2004.\n\n[19] A. Kulesza, F. Pereira, et al. Structured learning with approximate inference. Advances in neural infor-\n\nmation processing systems, 20, 2007.\n\n8\n\n\f[22] M. Maire, P. Arbel\u00e1ez, C. Fowlkes, and J. Malik. Using contours to detect and localize junctions in natural\nimages. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1\u20138.\nIEEE, 2008.\n\n[23] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local\n\nbrightness, color, and texture cues. IEEE Trans. Patt. Anal. Mach. Intell., pages 530\u2013549, 2004.\n\n[24] J. Mutch, U. Knoblich, and T. Poggio. CNS: a GPU-based framework for simulating cortically-organized\n\nnetworks. Technical report, Massachussetts Institute of Technology, 2010.\n\n[25] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statis-\n\ntical association, 66(336):846\u2013850, 1971.\n\n[26] X. Ren. Multi-scale improves boundary detection in natural images. In Proceedings of the 10th European\n\nConference on Computer Vision: Part III, pages 533\u2013545. Springer-Verlag, Springer, 2008.\n\n[27] X. Ren and J. Malik. Learning a Classi\ufb01cation Model for Segmentation. In Proceedings of the Ninth\n\nIEEE International Conference on Computer Vision-Volume 2, page 10. IEEE Computer Society, 2003.\n\n[28] H. Seung. Reading the Book of Memory: Sparse Sampling versus Dense Mapping of Connectomes.\n\nNeuron, 62(1):17\u201329, 2009.\n\n[20] J. W. Lichtman and J. R. Sanes. Ome sweet ome: what can the genome tell us about the connectome?\n\nCurr. Opin. Neurobiol., 18(3):346\u2013353, Jun 2008.\n\n[21] F. Maes, L. Denoyer, and P. Gallinari. Structured prediction with reinforcement learning. Machine\n\nlearning, 77(2):271\u2013301, 2009.\n\n[29] E. Sharon, A. Brandt, and R. Basri. Fast multiscale image segmentation. In Computer Vision and Pattern\n\nRecognition, 2000. Proceedings. IEEE Conference on, volume 1, pages 70\u201377. IEEE, 2000.\n\n[30] E. Sharon, M. Galun, D. Sharon, R. Basri, and A. Brandt. Hierarchy and adaptivity in segmenting visual\n\nscenes. Nature, 442(7104):810\u2013813, 2006.\n\n[31] C. Sommer, C. Straehle, U. K\u00f6the, and F. A. Hamprecht. \"ilastik: Interactive learning and segmentation\n\ntoolkit\". In 8th IEEE International Symposium on Biomedical Imaging (ISBI 2011), in press, 2011.\n\n[32] S. C. Turaga, K. L. Briggman, M. Helmstaedter, W. Denk, and H. S. Seung. Maximin af\ufb01nity learning of\n\nimage segmentation. In NIPS, 2009.\n\n[33] S. C. Turaga, J. F. Murray, V. Jain, F. Roth, M. Helmstaedter, K. Briggman, W. Denk, and H. S. Seung.\nConvolutional networks can learn to generate af\ufb01nity graphs for image segmentation. Neural Computa-\ntion, 22(2):511\u2013538, 2010.\n\n[34] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward objective evaluation of image segmentation algo-\n\nrithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):929, 2007.\n\n[35] S. N. Vitaladevuni and R. Basri. Co-clustering of image segments using convex optimization applied\nto EM neuronal reconstruction. In Computer Vision and Pattern Recognition, IEEE Computer Society\nConference on, 2010.\n\n[36] M. Wainwright. Estimating the wrong graphical model: Bene\ufb01ts in the computation-limited setting. The\n\nJournal of Machine Learning Research, 7:1829\u20131859, 2006.\n\n9\n\n\f", "award": [], "sourceid": 451, "authors": [{"given_name": "Viren", "family_name": "Jain", "institution": null}, {"given_name": "Srinivas", "family_name": "Turaga", "institution": null}, {"given_name": "K", "family_name": "Briggman", "institution": null}, {"given_name": "Moritz", "family_name": "Helmstaedter", "institution": null}, {"given_name": "Winfried", "family_name": "Denk", "institution": null}, {"given_name": "H.", "family_name": "Seung", "institution": null}]}