{"title": "Combinatorial Energy Learning for Image Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 1966, "page_last": 1974, "abstract": "We introduce a new machine learning approach for image segmentation that uses a neural network to model the conditional energy of a segmentation given an image. Our approach, combinatorial energy learning for image segmentation (CELIS) places a particular emphasis on modeling the inherent combinatorial nature of dense image segmentation problems. We propose efficient algorithms for learning deep neural networks to model the energy function, and for local optimization of this energy in the space of supervoxel agglomerations. We extensively evaluate our method on a publicly available 3-D microscopy dataset with 25 billion voxels of ground truth data. On an 11 billion voxel test set, we find that our method improves volumetric reconstruction accuracy by more than 20% as compared to two state-of-the-art baseline methods: graph-based segmentation of the output of a 3-D convolutional neural network trained to predict boundaries, as well as a random forest classifier trained to agglomerate supervoxels that were generated by a 3-D convolutional neural network.", "full_text": "Combinatorial Energy Learning for Image\n\nSegmentation\n\nJeremy Maitin-Shepard\nUC Berkeley Google\n\njbms@google.com\n\nViren Jain\n\nGoogle\n\nviren@google.com\n\nMichal Januszewski\n\nGoogle\n\nmjanusz@google.com\n\nPeter Li\nGoogle\n\nphli@google.com\n\nPieter Abbeel\nUC Berkeley\n\npabbeel@cs.berkeley.edu\n\nAbstract\n\nWe introduce a new machine learning approach for image segmentation that uses a\nneural network to model the conditional energy of a segmentation given an image.\nOur approach, combinatorial energy learning for image segmentation (CELIS)\nplaces a particular emphasis on modeling the inherent combinatorial nature of\ndense image segmentation problems. We propose ef\ufb01cient algorithms for learning\ndeep neural networks to model the energy function, and for local optimization of\nthis energy in the space of supervoxel agglomerations. We extensively evaluate\nour method on a publicly available 3-D microscopy dataset with 25 billion voxels\nof ground truth data. On an 11 billion voxel test set, we \ufb01nd that our method\nimproves volumetric reconstruction accuracy by more than 20% as compared to\ntwo state-of-the-art baseline methods: graph-based segmentation of the output\nof a 3-D convolutional neural network trained to predict boundaries, as well as a\nrandom forest classi\ufb01er trained to agglomerate supervoxels that were generated by\na 3-D convolutional neural network.\n\n1\n\nIntroduction\n\nMapping neuroanatomy, in the pursuit of linking hypothesized computational models consistent\nwith observed functions to the actual physical structures, is a long-standing fundamental problem\nin neuroscience. One primary interest is in mapping the network structure of neural circuits by\nidentifying the morphology of each neuron and the locations of synaptic connections between\nneurons, a \ufb01eld called connectomics. Currently, the most promising approach for obtaining such\nmaps of neural circuit structure is volume electron microscopy of a stained and \ufb01xed block of\ntissue. [4, 16, 17, 10] This technique was \ufb01rst used successfully decades ago in mapping the structure\nof the complete nervous system of the 302-neuron Caenorhabditis elegans; due to the need to\nmanually cut, image, align, and trace all neuronal processes in about 8000 50 nm serial sections, even\nthis small circuit required over 10 years of labor, much of it spent on image analysis. [31] At the time,\nscaling this approach to larger circuits was not practical.\nRecent advances in volume electron microscopy [11, 20, 15] make feasible the imaging of large\ncircuits, potentially containing hundreds of thousands of neurons, at suf\ufb01cient resolution to discern\neven the smallest neuronal processes. [4, 16, 17, 10] The high image quality and near-isotropic\nresolution achievable with these methods enables the resultant data to be treated as a true 3-D volume,\nwhich signi\ufb01cantly aids reconstruction of processes that do not run parallel to the sectioning axis, and\nis potentially more amenable to automated image processing.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Illustration of computation of global energy for a single candidate segmentation S. The\nlocal energy Es(x; S; I) \u2208 [0, 1], computed by a deep neural network, is summed over all shape\ndescriptor types s and voxel positions x.\n\nImage analysis remains a key challenge, however. The primary bottleneck is in segmenting the\nfull volume, which is \ufb01lled almost entirely by heavily intertwined neuronal processes, into the\nvolumes occupied by each individual neuron. While the cell boundaries shown by the stain provide\na strong visual cue in most cases, neurons can extend for tens of centimeters in path length while\nin some places becoming as narrow as 40 nm; a single mistake anywhere along the path can render\nconnectivity information for the neuron largely inaccurate. Existing automated and semi-automated\nsegmentation methods do not suf\ufb01ciently reduce the amount of human labor required: a recent\nreconstruction of 950 neurons in the mouse retina required over 20000 hours of human labor, even\nwith an ef\ufb01cient method of tracing just a skeleton of each neuron [18]; a recent reconstruction of\n379 neurons in the Drosophila medulla column (part of the visual pathway) required 12940 hours of\nmanual proof-reading/correction of an automated segmentation [26].\nRelated work: Algorithmic approaches to image segmentation are often formulated as variations on\nthe following pipeline: a boundary detection step establishes local hypotheses of object boundaries, a\nregion formation step integrates boundary evidence into local regions (i.e. superpixels or supervoxels),\nand a region agglomeration step merges adjacent regions based on image and object features. [1, 19,\n30, 2] Although extensive integration of machine learning into such pipelines has begun to yield\npromising segmentation results [3, 14, 22], we argue that such pipelines, as previously formulated,\nfundamentally neglect two potentially important aspects of achieving accurate segmentation: (i) the\ncombinatorial nature of reasoning about dense image segmentation structure,1 and (ii) the fundamental\nimportance of shape as a criterion for segmentation quality.\nContributions: We propose a method that attempts to overcome these de\ufb01ciencies. In particular,\nwe propose an energy-based model that scores segmentation quality using a deep neural network\nthat \ufb02exibly integrates shape and image information: Combinatorial Energy Learning for Image\nSegmentation (CELIS). In pursuit of such a model this paper makes several speci\ufb01c contributions:\na novel connectivity region data structure for ef\ufb01ciently computing the energy of con\ufb01gurations of\n3-D objects; a binary shape descriptor for ef\ufb01cient representation of 3-D shape con\ufb01gurations; a\nneural network architecture that splices the intermediate unit output from a trained convolutional\nnetwork as input to a deep fully-connected neural network architecture that scores a segmentation\nand 3-D image; a training procedure that uses pairwise object relations within a segmentation to\nlearn the energy-based model. an experimental evaluation of the proposed and baseline automated\nreconstruction methods on a massive and (to our knowledge) unprecedented scale that re\ufb02ects the\ntrue size of connectomic datasets required for biological analysis (many billions of voxels).\n\n2 Conditional energy modeling of segmentations given images\n\nWe de\ufb01ne a global, translation-invariant energy model for predicting the cost of a complete seg-\nmentation S given a corresponding image I. This cost can be seen as analogous to the negative\n\n1While prior work [30, 14, 2] has recognized the importance of combinatorial reasoning, the previously\n\nproposed global optimization methods allow local decisions to interact only in a very limited way.\n\n2\n\nImage(I)Initialover-segmentationCandidatesegmentation(S)ShapedescriptorsLocalenergyGlobalenergyFully-connectedlayerConvolutionalneuralnetworkE(S;I)Es1Es2Es3Boundaryclassi\ufb01cationAgglomerationPallvoxelpositionsx\flog-likelihood of the segmentation given the image, but we do not actually treat it probabilistically.\nOur goal is to de\ufb01ne a model such that the true segmentation corresponding to a given image can be\nfound by minimizing the cost; the energy can re\ufb02ect both a prior over object con\ufb01gurations alone, as\nwell as compatibility between object con\ufb01gurations and the image.\nAs shown in Fig. 1, we de\ufb01ne the global energy E(S; I) as the sum over local energy models (de\ufb01ned\nby a deep neural network) Es(x; S; I) at several different scales s computed in sliding-window\nfashion centered at every position x within the volume:\n\nE(S; I) :=(cid:88)s (cid:88)x\n\nEs(x; S; I),\n\nEs(x; S; I) := \u02c6Es (rs(x; S); \u03c6(x; I)) .\n\nThe local energy Es(x; S; I) depends on the local image context centered at position x by way\nof a vector representation \u03c6(x; I) computed by a deep convolutional neural network, and on the\nlocal shape/object con\ufb01guration at scale s by way of a novel local binary shape descriptor rs(x; S),\nde\ufb01ned in Section 3.\nTo \ufb01nd (locally) minimal-cost segmentations under this model, we use local search over the space of\nagglomerations starting from some initial supervoxel segmentation. Using a simple greedy policy,\nat each step we consider all possible agglomeration actions, i.e. merges between any two adjacent\nsegments, and pick the action that results in the lowest energy.\nNa\u00efvely, computing the energy for just a single segmentation requires computing shape descriptors\nand then evaluating the energy model at every voxel position with the volume; a small volume may\nhave tens or hundreds of millions of voxels. At each stage of the agglomeration, there may be\nthousands, or tens of thousands, of potential next agglomeration steps, each of which results in a\nunique segmentation. In order to choose the best next step, we must know the energy of all of these\npotential next segmentations. The computational cost to perform these computations directly would\nbe tremendous, but in the supplement, we prove a collection of theorems that allow for an ef\ufb01cient\nimplementation that computes these energy terms incrementally.\n\n3 Representing 3-D Shape Con\ufb01gurations with Local Binary Descriptors\n\nWe propose a binary shape descriptor based on subsampled pairwise connectivity information: given\na speci\ufb01cation s of k pairs of position offsets {a1, b1}, . . . ,{ak, bk} relative to the center of some\n\ufb01xed-size bounding box of size Bs, the corresponding k-bit binary shape descriptor r(U ) for a\nparticular segmentation U of that bounding box is de\ufb01ned by\n\nri(U ) :=(cid:26)1\n\n0\n\nif ai is connected to bi in U;\notherwise.\n\nfor i \u2208 [1, k].\n\nAs shown in Fig. 2a, each bit of the descriptor speci\ufb01es whether a particular pair of positions are\npart of the same segment, which can be determined in constant time by the use of a suitable data\n\nbox, no information is lost and the Hamming distance between two descriptors is precisely equal to\n\nstructure. In the limit case, if we use the list of all(cid:0)n\nthe Rand index. [23] In general we can sample a subset of only k pairs out of the(cid:0)n\n\n2(cid:1) pairs of positions within an n-voxel bounding\n2(cid:1) possible; if we\n\nsample uniformly at random, we retain the property that the expected Hamming distance between\ntwo descriptors is equal to the Rand index. We found that picking k = 512 bits provides a reasonable\ntrade-off between \ufb01delity and representation size. While the pairs may be randomly sampled initially,\nnaturally to obtain consistent results when learning models based on these descriptors we must use\nthe same \ufb01xed list of positions for de\ufb01ning the descriptor at both training and test time. 2\nNote that this descriptor serves in general as a type of sketch of a full segmentation of a given\nbounding box. By restricting one of the two positions of each pair to be the center position of the\nbounding box, we instead obtain a sketch of just the single segment containing the center position.\nWe refer to the descriptor in this case as center-based, and to the general case as pairwise, as shown\nin Fig. 2b. We will use these shape descriptors to represent only local sub-regions of a segmentation.\nTo represent shape information throughout a large volume, we compute shape descriptors densely at\nall positions in a sliding window fashion, as shown in Fig. 2c.\n\n2The BRIEF descriptor [5] is similarly de\ufb01ned as a binary descriptor based on a subset of the pairs of points\n\nwithin a patch, but each bit is based on the intensity difference, rather than connectivity, between each pair.\n\n3\n\n\fr = 1 . . .\n\nr = 100000000110 . . .\n\nr = 10000000011000000110100000101001\n\n(a) Sequence showing computation of a shape descriptor.\n\nr = 00001000001011100111100100001000\n(b) Shape descriptors are computed at multiple scales. Pairwise descriptors (shown left and center) consider\narbitrary pairwise connectivity, while center-based shape descriptors (shown right) restrict one position of each\npair to be the center point.\n\nr = 00000000000101110000010000110010\n\nr = 10001001101100010100000010000111\n\nr = 10000001110010100110100001011001\n\nr = 11000011110011100100100011011011\n\nr = 10000011100111100100110011011111\n\n(c) Shape descriptors are computed densely at every position within the volume.\n\nFigure 2: Illustration of shape descriptors. The connected components of the bounding box U for\nwhich the descriptor is computed are shown in distinct colors. The pairwise connectivity relationships\nthat de\ufb01ne the descriptor are indicated by dashed lines; connected pairs are shown in white, while\ndisconnected pairs are shown in black. Connectivity is determined based on the connected components\nof the underlying segmentation, not the geometry of the line itself. While this illustration is 2-D, in\nour experiments shape descriptors are computed fully in 3-D.\n\nConnectivity Regions\n\nAs de\ufb01ned, a single shape descriptor represents the segmentation within its \ufb01xed-size bounding box;\nby shifting the position of the bounding box we can obtain descriptors corresponding to different\nlocal regions of some larger segmentation. The size of the bounding box determines the scale of the\nlocal representation. This raises the question of how connectivity should be de\ufb01ned within these local\nregions. Two voxels may be connected only by a long path well outside the descriptor bounding box.\nAs we would like the shape descriptors to be consistent with the local topology, such pairs should\nbe considered disconnected. Shape descriptors are, therefore, de\ufb01ned with respect to connectivity\nwithin some larger connectivity region, which necessarily contains one or more descriptor bounding\nboxes but may in general be signi\ufb01cantly smaller than the full segmentation; conceptually, the shape\ndescriptor bounding box slides around to all possible positions contained within the connectivity\nregion. (This sliding necessarily results in some minor inconsistency in context between different\npositions, but reduces computational and memory costs.) To obtain shape descriptors at all positions,\nwe simply tile the space with overlapping rectangular connectivity regions of appropriate uniform\nsize and stride, as shown in the supplement. The connectivity region size determines the degree\nof locality of the connectivity information captured by the shape descriptor (independent of the\ndescriptor bounding box size). It also affects computational costs, as described in the supplement.\n\n4\n\n\fEi(cid:20)(cid:96)(\u2206ei\n(cid:96)(\u2212\u2206ei\n\nSi error(Si, S\n\n\u2217\n\n), \u02c6Es (rs(xi; Si + e); \u03c6(xi; I)))+\n\nSi error(Si, S\n\n\u2217\n\n), \u02c6Es (rs(x; Si); \u03c6(xi; I)))(cid:21),\n\n4 Energy model learning\n\nSf (S) := f (S + e) \u2212 f (S).\n\nWe de\ufb01ne the local energy model \u02c6Es (r; v) for each shape descriptor type/scale s by a learned neural\nnetwork model that computes a real-valued score in [0, 1] from a shape descriptor r and image feature\nvector v.\nTo simplify the presentation, we de\ufb01ne the following notation for the forward discrete derivative of f\nwith respect to S: \u2206e\nBased on this notation, we have the discrete derivative of the energy function \u2206e\nSE(S; I) =\nE(S + e; I) \u2212 E(S; I), where S + e denotes the result of merging the two supervoxels corre-\nsponding to e in the existing segmentation S. To agglomerate, our greedy policy simply chooses at\nstep t the action e that minimizes \u2206e\nStE(St; I), where St denotes the current segmentation at step t.\nAs in prior work [22], we treat this as a classi\ufb01cation problem, with the goal of matching the sign of\nSterror(St, S\u2217), the corresponding change in segmentation error with respect to a\n\u2206e\nStE(St; I) to \u2206e\nground truth segmentation S\u2217, measured using Variation of Information [21].\n\n4.1 Local training procedure\n\nBecause the \u2206e\nStE(St; I) term is simply the sum of the change in energies from each position\nand descriptor type s, as a heuristic we optimize the parameters of the energy model \u02c6Es (r; v)\nindependently for each shape descriptor type/scale s. We seek to minimize the expectation\n\nwhere i indexes over training examples that correspond to a particular sampled position xi and a\nmerge action ei applied to a segmentation Si. (cid:96)(y, a) denotes a binary classi\ufb01cation loss function,\nwhere a \u2208 [0, 1] is the predicted probability that the true label y is positive, weighted by |y|. Note that\nSi error(Si, S\u2217) < 0, then action e improved the score and therefore we want a low predicted\nif \u2206ei\nscore for the post-merge descriptor rs(xi; Si + e) and a high predicted score for the pre-merge\nSi error(Si, S\u2217) > 0 the opposite applies. We tested the standard log loss\ndescriptor rs(xi; Si); if \u2206ei\n(cid:96)(y, a) := |y| \u00b7 [1y>0 log(a) + 1y<0 log(1 \u2212 a)], as well as the signed linear loss (cid:96)(y, a) := y \u00b7 a,\nwhich more closely matches how the Es(x; Si; I) terms contribute to the overall \u2206e\nSE(S; I) scores.\nStochastic gradient descent (SGD) is used to perform the optimization.\nWe obtain training examples by agglomerating using the expert policy that greedily optimizes\nerror(St, S\u2217). At each segmentation state St during an agglomeration step (including the initial state),\nfor each possible agglomeration action e, and each position x within the volume, we compute the shape\ndescriptor pair rs(x; St) and rs(x; St +e) re\ufb02ecting the pre-merge and post-merge states, respectively.\nIf rs(x; St) (cid:54)= rs(x; St + e), we emit a training example corresponding to this descriptor pair. We\nSt error(St, S\u2217), \u03c6(x; I), rs(x; St), rs(x; St +\nthereby obtain a conceptual stream of examples (cid:104)e, \u2206e\ne)(cid:105).\nThis stream of examples may contain billions of examples (and many highly correlated), far more\nthan required to learn the parameters of Es. To reduce resource requirements, we use priority\nS error(S, S\u2217)|, to obtain a \ufb01xed number of weighted samples without\nsampling [12], based on |\u2206e\nreplacement for each descriptor type s. We equalize the total weight of true merge examples\nS error(S, S\u2217) > 0) in order to avoid learning\n(\u2206e\ndegenerate models.3\n\nS error(S, S\u2217) < 0) and false merge examples (\u2206e\n\n5 Experiments\n\nWe tested our approach on a large, publicly available electron microscopy dataset, called Janelia FIB-\n25, of a portion of the Drosophila melangaster optic lobe. The dataset was collected at 8 \u00d7 8 \u00d7 8 nm\n3For example, if most of the weight is on false merge examples, as would often occur without balancing, the\n\nmodel can simply learn to assign a score that increases with the number of 1 bits in the shape descriptor.\n\n5\n\n\fCELIS (this paper)\n3d-CNN+GALA\n3d-CNN+Watershed\n\n\u2022\u25e6 7colseg1\n\nOracle\n\nVI\n1.672\n2.069\n2.143\n2.981\n0.428\n\nRand F1\n0.691\n0.597\n0.629\n0.099\n0.901\n\nFigure 3: Segmentation accuracy on 11-gigavoxel FIB-25 test set. Left: Pareto frontiers of\ninformation-theoretic split/merge error, as used previously to evaluate segmentation accuracy. [22]\nRight: Comparison of Variation of Information (lower is better) and Rand F1 score (higher is better).\nFor CELIS, 3d-CNN+GALA, and 3d-CNN+watershed, the hyperparameters were optimized for each\nmetric on the training set.\n\nresolution using Focused Ion Beam Scanning Electron Microscopy (FIB-SEM); a labor-intensive\nsemi-automated approach was used to segment all of the larger neuronal processes within a \u2248 20,000\ncubic micron volume (comprising about 25 billion voxels). [27] To our knowledge, this challenging\ndataset is the largest publicly available electron microscopy dataset of neuropil with a corresponding\n\u201cground truth\u201d segmentation.\nFor our experiments, we split the dataset into separate training and testing portions along the z axis:\nthe training portion comprises z-sections 2005\u20135005, and the testing portion comprises z-sections\n5005\u20138000 (about 11 billion voxels).\n\n5.1 Boundary classi\ufb01cation and oversegmentation\n\nTo obtain image features and an oversegmentation to use as input for agglomeration, we trained\nconvolutional neural networks to predict, based on a 35\u00d7 35\u00d7 9 voxel image context region, whether\nthe center voxel is part of the same neurite as the adjacent voxel in each of the x, y, and z directions, as\nin prior work. [29] We optimized the parameters of the network using stochastic gradient descent with\nlog loss. We trained several different networks, varying as hyperparameters the amount of dilation of\nboundaries in the training data (in order to increase extracellular space) from 0 to 8 voxels and whether\ncomponents smaller than 10000 voxels were excluded. See the supplementary information for a\ndescription of the network architecture. Using these connection af\ufb01nities, we applied a watershed\nalgorithm [33, 34] to obtain an (approximate) oversegmentation. We used parameters Tl = 0.95,\nTh = 0.95, Te = 0.5, and Ts = 1000 voxels.\n\n5.2 Energy model architecture\n\nWe used \ufb01ve types of 512-dimensional shape descriptors: three pairwise descriptor types with 93,\n173, and 333 bounding boxes, and two center-based descriptor types with 173 and 333 bounding\nboxes, respectively. The connectivity positions within the bounding boxes for each descriptor type\nwere sampled uniformly at random.\nWe used the 512-dimensional fully-connected penultimate layer output of the low-level classi\ufb01cation\nconvolutional neural network as the image feature vector \u03c6(x; I). For each shape descriptor type s,\nwe used the following architecture for the local energy model \u02c6Es (r; v): we concatenated the shape\ndescriptor vector and the image feature vector to obtain a 1024-dimensional input vector. We used\ntwo 2048-dimensional fully-connected recti\ufb01ed linear hidden layers, followed by a logistic output\nunit, and applied dropout (with p = 0.5) after the last hidden layer. While this effectively computes a\n\n6\n\n0.00.51.01.52.02.53.03.5Mergeerror(H(t|p))0.00.51.01.52.02.53.03.5Spliterror(H(p|t))\fscore from a raw image patch and a shape descriptor, by segregating expensive convolutional image\nprocessing that does not depend on the shape descriptor, this architecture allows us to bene\ufb01t from\npre-training and precomputation of the intermediate image feature vector \u03c6(x; I) for each position x.\nTraining for both the energy models and the boundary classi\ufb01er was performed using asynchronous\nSGD using a distributed architecture. [9]\n\n5.3 Evaluation\n\nWe compared our method to the state-of-the-art agglomeration method GALA [22], which trains\na random forest classi\ufb01er to predict merge decisions using image features derived from boundary\nprobabilities. 4 To obtain such probabilities from our low-level convolutional neural network classi\ufb01er,\nwhich predicts edge af\ufb01nities between adjacent voxels rather than per-voxel predictions, we compute\nfor each voxel the minimum connection probability to any voxel in its 6-connectivity neighborhood,\nand treat this as the probability/score of it being cell interior.\nFor comparison, we also evaluated a watershed procedure applied to the CNN af\ufb01nity graph output,\nunder varying parameter choices, to measure the accuracy of the deep CNN boundary classi\ufb01cation\nwithout the use of an agglomeration procedure. Finally, we evaluated the accuracy of the publicly\nreleased automated segmentation of FIB-25 (referred to as 7colseg1) [13] that was the basis of\nthe proofreading process used to obtain the ground truth; it was produced by applying watershed\nsegmentation and a variant of GALA agglomeration to the predictions made by an Ilastik [25]-trained\nvoxel classi\ufb01er.\nWe tested both GALA and CELIS using the same initial oversegmentations for the training and test\nregions. To compare the accuracy of the reconstructions, we computed two measures of segmentation\nconsistency relative to the ground truth: Variation of Information [21] and Rand F1 score, de\ufb01ned as\nthe F1 classi\ufb01cation score over connectivity between all voxel pairs within the volumes; these are the\nprimary metrics used in prior work. [28, 8, 22] The former has the advantage of weighing segments\nlinearly in their size rather than quadratically.\nBecause any agglomeration method is ultimately limited by the quality of the initial oversegmentation,\nwe also computed the accuracy of an oracle agglomeration policy that greedily optimizes the\nerror metric directly. (Computing the true globally-optimal agglomeration under either metric is\nintractable.) This serves as an (approximate) upper bound that is useful for separating the error due to\nagglomeration from the error due to the initial oversegmentation.\n\n6 Results\n\nFigure 3 shows the Pareto optimal trade-offs between test set split and merge error of each method\nobtained by varying the choice of hyperparameters and agglomeration thresholds, as well as the\nVariation of Information and Rand F1 scores obtained from the training set-optimal hyperparameters.\nCELIS consistently outperforms all other methods by a signi\ufb01cant margin under both metrics. The\nlarge gap between the Oracle results and the best automated reconstruction indicates, however, that\nthere is still large room for improvement in agglomeration.\nWhile the evaluations are done on a single dataset, it is a single very large dataset; to verify that\nthe improvement due to CELIS is broad and general (rather than localized to a very speci\ufb01c part of\nthe image volume), we also evaluated accuracy independently on 18 non-overlapping 5003-voxel\nsubvolumes evenly spaced within the test region. On all subvolumes CELIS outperformed the best\nexisting method under both metrics, with a median reduction in Variation of Information error of 19%\nand in Rand F1 error of 22%. This suggests that CELIS is improving accuracy in many parts of the\nvolume that span signi\ufb01cant variations in shape and image characteristics.\n\n4GALA also supports multi-channel image features, potentially representing predicted probabilities of\nadditional classes, such as mitochondria, but we did not make use of this functionality as we did not have training\ndata for additional classes.\n\n7\n\n\f7 Discussion\n\nWe have introduced CELIS, a framework for modeling image segmentations using a learned energy\nfunction that speci\ufb01cally exploits the combinatorial nature of dense segmentation. We have described\nhow this approach can be used to model the conditional energy of a segmentation given an image, and\nhow the resulting model can be used to guide supervoxel agglomeration decisions. In our experiments\non a challenging 3d microscopy reconstruction problem, CELIS improved volumetric reconstruction\naccuracy by 20% over the best existing method, and offered a strictly better trade-off between split\nand merge errors, by a wide margin, compared to existing methods.\nThe experimental results are unique in the scale of the evaluations: the 11-gigavoxel test region is 2\u20134\norders of magnitude larger than used for evaluation in prior work, and we believe this large scale of\nevaluation to be critically important; we have found evaluations on smaller volumes, containing only\nshort neurite fragments, to be unreliable at predicting accuracy on larger volumes (where propagation\nof merge errors is a major challenge). While more computationally expensive than many prior\nmethods, CELIS is nonetheless practical: we have successfully run CELIS on volumes approaching\n\u2248 1 teravoxel in a matter of hours, albeit using many thousands of CPU cores.\nIn addition to advancing the state of the art in learning-based image segmentation, this work also has\nsigni\ufb01cant implications for the application area we have studied, connectomic reconstruction. The\nFIB-25 dataset re\ufb02ects state-of-the-art techniques in sample preparation and imaging for large-scale\nneuron reconstruction, and in particular is highly representative of much larger datasets actively\nbeing collected (e.g. of a full adult \ufb02y brain). We expect, therefore, that the signi\ufb01cant improvements\nin automated reconstruction accuracy made by CELIS on this dataset will directly translate to a\ncorresponding decrease in human proof-reading effort required to reconstruct a given volume of tissue,\nand a corresponding increase in the total size of neural circuit that may reasonably be reconstructed.\nFuture work in several speci\ufb01c areas seems particularly fruitful:\n\n\u2022 End-to-end training of the CELIS energy modeling pipeline, including the CNN model\nfor computing the image feature representation and the aggregation of local energies at\neach position and scale. Because the existing pipeline is fully differentiable, it is directly\namenable to end-to-end training.\n\n\u2022 Integration of the CELIS energy model with discriminative training of a neural network-\nbased agglomeration policy. Such a policy could depend on the distribution of local energy\nchanges, rather than just the sum, as well as other per-object and per-action features proposed\nin prior work. [22, 3]\n\n\u2022 Use of a CELIS energy model for \ufb01xing undersegmentation errors. While the energy\nminimization procedure proposed in this paper is based on a greedy local search limited to\nperforming merges, the CELIS energy model is capable of evaluating arbitrary changes to\nthe segmentation. Evaluation of candidate splits (based on a hierarchical initial segmentation\nor other heuristic criteria) would allow for the use of a potentially more robust simulated\nannealing energy minimization procedure capable of both splits and merges.\n\nSeveral recent works [24, 32, 7, 6] have integrated deep neural networks into pairwise-potential\nconditional random \ufb01eld models. Similar to CELIS, these approaches combine deep learning with\nstructured prediction, but differ from CELIS in several key ways:\n\n\u2022 Through a restriction to models that can be factored into pairwise potentials, these ap-\nproaches are able to use mean \ufb01eld and pseudomarginal approximations to perform ef\ufb01cient\napproximate inference. The CELIS energy model, in contrast, sacri\ufb01ces factorization for the\nricher combinatorial modeling provided by the proposed 3-D shape descriptors.\n\n\u2022 More generally, these prior CRF methods are focused on re\ufb01ning predictions (e.g. improving\nboundary localization/detail for semantic segmentation) made by a feed-forward neural\nnetwork that are correct at a high level. In contrast, CELIS is designed to correct fundamental\ninaccuracy of the feed-forward convolutional neural network in critical cases of ambiguity,\nwhich is re\ufb02ected in the much greater complexity of the structured model.\n\nAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1118055.\n\n8\n\n\fReferences\n[1] B. Andres, U. K\u00f6the, M. Helmstaedter, W. Denk, and F. Hamprecht. Segmentation of SBFSEM volume data of neural tissue by hierar-\n\nchical classi\ufb01cation. Pattern recognition, pages 142\u2013152, 2008. 2\n\n[2] Bjoern Andres, Thorben Kroeger, Kevin L Briggman, Winfried Denk, Natalya Korogod, Graham Knott, Ullrich Koethe, and Fred A\nHamprecht. Globally optimal closed-surface segmentation for connectomics. In Computer Vision\u2013ECCV 2012, pages 778\u2013791. Springer,\n2012. 2\n\n[3] John A Bogovic, Gary B Huang, and Viren Jain. Learned versus hand-designed feature representations for 3d agglomeration.\n\narXiv:1312.6159, 2013. 2, 8\n\n[4] Kevin L Briggman and Winfried Denk. Towards neural circuit reconstruction with volume electron microscopy techniques. Current\n\nOpinion in Neurobiology, 16(5):562 \u2013 570, 2006. Neuronal and glial cell biology / New technologies. 1\n\n[5] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features.\n\nEuropean conference on computer vision, pages 778\u2013792. Springer, 2010. 3\n\nIn\n\n[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep\n\nconvolutional nets and fully connected crfs. CoRR, abs/1412.7062, 2014. 8\n\n[7] Liang-Chieh Chen, Alexander G Schwing, Alan L Yuille, and Raquel Urtasun. Learning deep structured models. In Proc. ICML, 2015.\n\n8\n\n[8] Dan Claudiu Ciresan, Alessandro Giusti, Luca Maria Gambardella, and J\u00fcrgen Schmidhuber. Deep neural networks segment neuronal\n\nmembranes in electron microscopy images. In NIPS, pages 2852\u20132860, 2012. 7\n\n[9] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc\u2019Aurelio Ranzato, Andrew Senior, Paul Tucker,\nKe Yang, Quoc V. Le, and Andrew Y. Ng. Large scale distributed deep networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223\u20131231. Curran Associates, Inc., 2012. 7\n\n[10] Winfried Denk, Kevin L Briggman, and Moritz Helmstaedter. Structural neurobiology: missing link to a mechanistic understanding of\n\nneural computation. Nature Reviews Neuroscience, 13(5):351\u2013358, 2012. 1\n\n[11] Winfried Denk and Heinz Horstmann. Serial block-face scanning electron microscopy to reconstruct three-dimensional tissue nanostruc-\n\nture. PLoS Biol, 2(11):e329, 10 2004. 1\n\n[12] Nick Duf\ufb01eld, Carsten Lund, and Mikkel Thorup. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM (JACM),\n\n54(6):32, 2007. 5\n\n[13] Janelia FlyEM. https://www.janelia.org/project-team/flyem/data-and-software-release. Accessed: 2016-05-19. 7\n[14] Jan Funke, Bjoern Andres, Fred A Hamprecht, Albert Cardona, and Matthew Cook. Ef\ufb01cient automatic 3d-reconstruction of branching\nneurons from em data. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1004\u20131011. IEEE, 2012.\n2\n\n[15] KJ Hayworth, N Kasthuri, R Schalek, and JW Lichtman. Automating the collection of ultrathin serial sections for large volume tem\n\nreconstructions. Microscopy and Microanalysis, 12(Supplement,S02):86\u201387, 2006. 1\n\n[16] Moritz Helmstaedter, Kevin L Briggman, and Winfried Denk. 3d structural imaging of the brain with photons and electrons. Current\n\nOpinion in Neurobiology, 18(6):633 \u2013 641, 2008. 1\n\n[17] Moritz Helmstaedter, Kevin L Briggman, and Winfried Denk. High-accuracy neurite reconstruction for high-throughput neuroanatomy.\n\nNature neuroscience, 14(8):1081\u20131088, 2011. 1\n\n[18] Moritz Helmstaedter, Kevin L Briggman, Srinivas C Turaga, Viren Jain, H Sebastian Seung, and Winfried Denk. Connectomic recon-\n\nstruction of the inner plexiform layer in the mouse retina. Nature, 500(7461):168\u2013174, 2013. 2\n\n[19] Viren Jain, Srinivas C Turaga, Kevin L Briggman, Moritz N Helmstaedter, Winfried Denk, and H Sebastian Seung. Learning to agglom-\n\nerate superpixel hierarchies. Advances in Neural Information Processing Systems, 2(5), 2011. 2\n\n[20] Graham Knott, Herschel Marchman, David Wall, and Ben Lich. Serial section scanning electron microscopy of adult brain tissue using\n\nfocused ion beam milling. The Journal of Neuroscience, 28(12):2959\u20132964, 2008. 1\n\n[21] Marina Meil\u02d8a. Comparing clusterings\u2014an information based distance. Journal of Multivariate Analysis, 98(5):873\u2013895, 2007. 5, 7\n[22] Juan Nunez-Iglesias, Ryan Kennedy, Tou\ufb01q Parag, Jianbo Shi, and Dmitri B Chklovskii. Machine learning of hierarchical clustering to\n\nsegment 2d and 3d images. PloS one, 8(8):e71715, 2013. 2, 5, 6, 7, 8\n\n[23] William M. Rand. Objective criteria for the evaluation of clustering methods.\n\n66(336):846\u2013850, 1971. 3\n\nJournal of the American Statistical Association,\n\n[24] Alexander G Schwing and Raquel Urtasun. Fully connected deep structured networks. arXiv preprint arXiv:1503.02351, 2015. 8\n[25] Christoph Sommer, Christoph Straehle, Ullrich Kothe, and Fred A Hamprecht. ilastik: Interactive learning and segmentation toolkit. In\n\nBiomedical Imaging: From Nano to Macro, 2011 IEEE International Symposium on, pages 230\u2013233. IEEE, 2011. 7\n\n[26] Shin-ya Takemura, Arjun Bharioke, Zhiyuan Lu, Aljoscha Nern, Shiv Vitaladevuni, Patricia K Rivlin, William T Katz, Donald J Olbris,\nStephen M Plaza, Philip Winston, et al. A visual motion detection circuit suggested by drosophila connectomics. Nature, 500(7461):175\u2013\n181, 2013. 2\n\n[27] Shin-ya Takemura, C Shan Xu, Zhiyuan Lu, Patricia K Rivlin, Tou\ufb01q Parag, Donald J Olbris, Stephen Plaza, Ting Zhao, William T Katz,\nLowell Umayam, et al. Synaptic circuits and their variations within different columns in the visual system of drosophila. Proceedings of\nthe National Academy of Sciences, 112(44):13711\u201313716, 2015. 6\n\n[28] Srinivas Turaga, Kevin Briggman, Moritz Helmstaedter, Winfried Denk, and Sebastian Seung. Maximin af\ufb01nity learning of image\n\nsegmentation. In Advances in Neural Information Processing Systems 22, pages 1865\u20131873. MIT Press, Cambridge, MA, 2009. 7\n\n[29] Srinivas C. Turaga, Joseph F. Murray, Viren Jain, Fabian Roth, Moritz Helmstaedter, Kevin Briggman, Winfried Denk, and H. Sebastian\nSeung. Convolutional networks can learn to generate af\ufb01nity graphs for image segmentation. Neural Comput., 22(2):511\u2013538, 2010. 6\n[30] Amelio Vazquez-Reina, Michael Gelbart, Daniel Huang, Jeff Lichtman, Eric Miller, and Hanspeter P\ufb01ster. Segmentation fusion for\n\nconnectomics. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 177\u2013184. IEEE, 2011. 2\n\n[31] J. G. White, E. Southgate, J. N. Thomson, and S. Brenner. The Structure of the Nervous System of the Nematode Caenorhabditis elegans.\n\nPhilosophical Transactions of the Royal Society of London. B, Biological Sciences, 314(1165):1\u2013340, 1986. 1\n\n[32] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS\nTorr. Conditional random \ufb01elds as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision,\npages 1529\u20131537, 2015. 8\n\n[33] Aleksandar Zlateski. A design and implementation of an ef\ufb01cient, parallel watershed algorithm for af\ufb01nity graphs. PhD thesis, Mas-\n\nsachusetts Institute of Technology, 2011. 6\n\n[34] Aleksandar Zlateski and H. Sebastian Seung. Image segmentation by size-dependent single linkage clustering of a watershed basin graph.\n\nCoRR, 2015. 6\n\n9\n\n\f", "award": [], "sourceid": 1067, "authors": [{"given_name": "Jeremy", "family_name": "Maitin-Shepard", "institution": "Google"}, {"given_name": "Viren", "family_name": "Jain", "institution": "Google"}, {"given_name": "Michal", "family_name": "Januszewski", "institution": "Google"}, {"given_name": "Peter", "family_name": "Li", "institution": "Google"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}]}