{"title": "Recurrent Registration Neural Networks for Deformable Image Registration", "book": "Advances in Neural Information Processing Systems", "page_first": 8758, "page_last": 8768, "abstract": "Parametric spatial transformation models have been successfully applied to image\nregistration tasks. In such models, the transformation of interest is parameterized\nby a fixed set of basis functions as for example B-splines. Each basis function\nis located on a fixed regular grid position among the image domain because the\ntransformation of interest is not known in advance. As a consequence, not all basis\nfunctions will necessarily contribute to the final transformation which results in a\nnon-compact representation of the transformation. We reformulate the pairwise\nregistration problem as a recursive sequence of successive alignments. For each\nelement in the sequence, a local deformation defined by its position, shape, and\nweight is computed by our recurrent registration neural network. The sum of all lo-\ncal deformations yield the final spatial alignment of both images. Formulating the\nregistration problem in this way allows the network to detect non-aligned regions in\nthe images and to learn how to locally refine the registration properly. In contrast to\ncurrent non-sequence-based registration methods, our approach iteratively applies\nlocal spatial deformations to the images until the desired registration accuracy\nis achieved. We trained our network on 2D magnetic resonance images of the\nlung and compared our method to a standard parametric B-spline registration. The\nexperiments show, that our method performs on par for the accuracy but yields a\nmore compact representation of the transformation. Furthermore, we achieve a\nspeedup of around 15 compared to the B-spline registration.", "full_text": "Recurrent Registration Neural Networks for\n\nDeformable Image Registration\n\nRobin Sandk\u00fchler\n\nDepartment of Biomedical Engineering\n\nUniversity of Basel, Switzerland\nrobin.sandkuehler@unibas.ch\n\nSimon Andermatt\n\nDepartment of Biomedical Engineering\n\nUniversity of Basel, Switzerland\nsimon.andermatt@unibas.ch\n\nGrzegorz Bauman\n\nDivision of Radiological Physics\n\nDepartment of Radiology\n\nUniversity of Basel Hospital, Switzerland\n\ngrzegorz.bauman@usb.ch\n\nSylvia Nyilas\n\nPediatric Respiratory Medicine\n\nDepartment of Pediatrics\n\nInselspital, Bern University Hospital\n\nUniversity of Bern, Switzerland\nsylvia.nyilas@insel.ch\n\nChristoph Jud\n\nPhilippe C. Cattin\n\nDepartment of Biomedical Engineering\n\nDepartment of Biomedical Engineering\n\nUniversity of Basel, Switzerland\nchristoph.jud@unibas.ch\n\nUniversity of Basel, Switzerland\nphilippe.cattin@unibas.ch\n\nAbstract\n\nParametric spatial transformation models have been successfully applied to image\nregistration tasks. In such models, the transformation of interest is parameterized\nby a \ufb01xed set of basis functions as for example B-splines. Each basis function\nis located on a \ufb01xed regular grid position among the image domain because the\ntransformation of interest is not known in advance. As a consequence, not all basis\nfunctions will necessarily contribute to the \ufb01nal transformation which results in a\nnon-compact representation of the transformation. We reformulate the pairwise\nregistration problem as a recursive sequence of successive alignments. For each\nelement in the sequence, a local deformation de\ufb01ned by its position, shape, and\nweight is computed by our recurrent registration neural network. The sum of all lo-\ncal deformations yield the \ufb01nal spatial alignment of both images. Formulating the\nregistration problem in this way allows the network to detect non-aligned regions in\nthe images and to learn how to locally re\ufb01ne the registration properly. In contrast to\ncurrent non-sequence-based registration methods, our approach iteratively applies\nlocal spatial deformations to the images until the desired registration accuracy is\nachieved. We trained our network on 2D magnetic resonance images of the lung\nand compared our method to a standard parametric B-spline registration. The\nexperiments show, that our method performs on par for the accuracy but yields a\nmore compact representation of the transformation. Furthermore, we achieve a\nspeedup of around 15 compared to the B-spline registration.\n\n1\n\nIntroduction\n\nImage registration is essential for medical image analysis methods, where corresponding anatomical\nstructures in two or more images need to be spatially aligned. The misalignment often occurs in\nimages from the same structure between different imaging modalities (CT, SPECT, MRI) or during\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fg1\n\ng2\n\ngt\n\nR2N2\n\nR2N2\n\n\u00b7\u00b7\u00b7\n\nR2N2\n\n(F, M \u25e6 f0)\n\n(F, M \u25e6 f1)\n\nf0\n\n+\n\nf1\n\n+\n\n(F, M \u25e6 ft\u22121)\n+\n\nft\u22121\n\n\u00b7\u00b7\u00b7\n\n+\n\nft\n\nFigure 1: Sequence-based registration process for pairwise deformable image registration of a \ufb01xed\nimage F and a moving image M.\n\nthe acquisition of dynamic time series (2D+t, 4D). An overview of registration methods and their\ndifferent categories is given in [24]. In this work, we will focus on parametric transformation models\nin combination with learning-based registration methods. There are mainly two major classes of\nparametric transformation models used in medical image registration. The \ufb01rst class are the dense\ntransformation models or so-called optical-\ufb02ow [11]. Here, the transformation of each pixel in the\nimage is directly estimated (Figure 2a). The second class of models are interpolating transformation\nmodels (Figure 2b). Interpolating transformation models approximate the transformation between\nboth images with a set of \ufb01xed basis functions (e.g. Gaussian, B-spline) among a \ufb01xed grid of\nthe image domain [22, 27, 15, 14]. These models reduce the number of free parameters for the\noptimization, but restrict the space of admissible transformations. Both transformation models\nhave advantages and disadvantages. Dense models allow preservation of local discontinuities of\nthe transformation, while the interpolating models achieve a global smoothness if the chosen basis\nfunction is smooth.\nAlthough the computation time for the registration has been reduced in the past, image registration is\nstill computationally costly, because a non-linear optimization problem needs to be solved for each\npair of images. In order to reduce the computation time and to increase the accuracy of the registration\nresult, learning-based registration methods have been recently introduced. As the registration is\nnow separated in a training and an inference part, a major advantage in computation time for the\nregistration is achieved. A detailed overview of deep learning methods for image registration is\ngiven in [8]. The FlowNet [6] uses a convolutional neural network (CNN) to learn the optical \ufb02ow\nbetween two input images. They trained their network in a supervised fashion using ground-truth\ntransformations from synthetic data sets. Based on the idea of the spatial transformer networks\n[13], unsupervised learning-based registration methods were introduced [5, 4, 26, 12]. All of these\nmethods have in common that the output of the network is directly the \ufb01nal transformation. In\ncontrast, sequence-based methods do not estimate the \ufb01nal transformation in one step but rather in a\nseries of transformations based on observations of the previous transformation result. This process is\niteratively continued until the desired accuracy is achieved. Applying a sequence of local or global\ndeformations is inspired by how a human would align two images by applying a sequence of local or\nglobal deformations. Sequence-based methods for rigid [18, 20] and for deformable [17] registration\nusing reinforcement learning methods were introduced in the past. However, the action space for\ndeformable image registration can be very large and the training of deep reinforcement learning\nmethods is still very challenging.\nIn this work, we present the Recurrent Registration Neural Network (R2N2), a novel sequence-based\nregistration method for deformable image registration. Figure 1 shows the registration process with\nthe R2N2. Instead of learning the transformation as a whole, we iteratively apply a network to\ndetect local differences between two images and determine how to align them using a parameterized\nlocal deformation. Modeling the \ufb01nal transformation of interest as a sequence of local parametric\ntransformations instead of a \ufb01xed set of basis functions enables our method to extend the space of\nadmissible transformations, and to achieve a global smoothness. Furthermore, we are able to achieve\na compact representation of the \ufb01nal transformation. As we de\ufb01ne the resulting transformation as a\nrecursive sequence of local transformations, we base our architecture on recurrent neural networks.\nTo the best of our knowledge, recurrent neural networks are not used before for deformable image\nregistration.\n\n2\n\n\f(a) dense\n\n(b) interpolating\n\nfT\n\n(c) proposed\n\nFigure 2: Dense, interpolating, and proposed transformation models.\n\nT\u22122)\n\nl(g\u03b8\nT\u22121)\n\nl(g\u03b8\n\n2 Background\nGiven two images that need to be aligned, the \ufb01xed image F : X \u2192 R and the moving image\nM : X \u2192 R on the image domain X \u2282 Rd, the pairwise registration problem can be de\ufb01ned as a\nregularized minimization problem\n\nf\u2217 = arg min\n\nS[F, M \u25e6 f ] + \u03bbR[f ].\n\nf\n\n(1)\nHere, f\u2217 : X \u2192 Rd is the transformation of interest and a minimizer of (1). The image loss\nS : X \u00d7 X \u2192 R determines the image similarity of F and M \u25e6 f, with (M \u25e6 f )(x) = M (x + f (x)).\nIn order to restrict the transformation space by using prior knowledge of the transformation, a regu-\nlarization loss R : Rd \u2192 R and the regularization weight \u03bb are added to the optimization problem.\nThe regularizer is chosen depending on the expected transformation characteristics (e.g. global\nsmoothness or piece-wise smoothness).\n\n2.1 Transformation\n\nIn order to optimize (1) a transformation model f\u03b8 is needed. The minimization problem then\nbecomes\n\n\u03b8\u2217 = arg min\n\nS[F, M \u25e6 f\u03b8] + \u03bbR[f\u03b8],\n\n\u03b8\n\n(2)\n\nwhere \u03b8 are the parameters of the transformation model. There are two major classes of transformation\nmodels used in image registration: dense and interpolating. In the dense case, the transformation at\nposition x in the image is de\ufb01ned by a displacement vector\n\n(3)\nwith \u03b8x = (\u03d11, \u03d12, . . . , \u03d1d) \u2208 Rd. For the interpolating case the transformation at position x is\nnormally de\ufb01ned in a smooth basis\n\nf\u03b8(x) = \u03b8x,\n\nN(cid:88)\n\nf\u03b8(x) =\n\n(4)\ni=1, ci \u2208 X are the positions of the \ufb01xed regular grid points in the image domain,\nHere, {ci}N\nk : X \u00d7 X \u2192 R the basis function, and N the number of grid points. The transformation between\nthe control points ci is an interpolation of the control point values \u03b8i \u2208 Rd with the basis function k.\nA visualization of a dense and an interpolating transformation model is shown in Figure 2.\n\n\u03b8ik(x, ci).\n\ni\n\n2.2 Recurrent Neural Networks\n\nRecurrent Neural Networks (RNNs) are a class of neural networks designed for sequential data. A\nsimple RNN has the form\n\nht = \u03c6(W xt + U ht\u22121),\n\n(5)\n\n3\n\n\f8\n2\n1\n\n6\n5\n2\n\n2\n1\n5\n\n2\n=\nS\n3\n=\nK\n\nh\nn\na\n\nt\n\n4\n6\n\nGR2U\n\nI\n\nPosition\nNetwork\n\nI\n\n2\n=\nS\n3\n=\nK\n\nh\nn\na\n\nt\n\n8\n2\n1\n\nGR2U\n\nII\n\nPosition\nNetwork\n\nII\n\nGR2U\n\nIII\n\nPosition\nNetwork\n\nIII\n\n1\n=\nS\n1\n=\nK\n\nh\nn\na\n\nt\n\n6\n5\n2\n\nParameter\nNetwork\n\n(xt, yt, wt)1\n\n(xt, yt, wt)2\n\n(xt, yt, wt)3\n\n\u03a3\n\n\u03c3x\nt\n\u03c3y\nt\nvx\nt\nvy\nt\n\u03b1t\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n(cid:18)xt\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8\n(cid:19)\n\nyt\n\n4\n\u00d7\n6\n5\n2\n\u00d7\n6\n5\n2\n\nd\ni\nr\n\nG\ne\ng\na\nm\n\nI\n\ne\ng\na\nm\n\nI\n\nd\ne\nx\nF\n\ni\n\ne\ng\na\nm\n\nI\n\ni\n\ng\nn\nv\no\nM\n\n4\n6\n\n2\n=\nS\n7\n=\nK\n\nh\nn\na\n\nt\n\n4\n\nInput Channel\n\nConvolution Layer\n\nOutput Channel\n\nK: Kernel Size\nS: Stride\n\nFigure 3: Network architecture of the presented Recurrent Registration Neural Network.\n\nwhere W is a weighting matrix of the input at time t, U is the weight matrix of the last output at time\nt \u2212 1, and \u03c6 is an activation function like the hyperbolic tangent or the logistic function. Since the\noutput at time t directly depends on the weighted previous output ht\u22121, RNNs are well suited for the\ndetection of sequential information which is encoded in the sequence itself. RNNs provide an elegant\nway of incorporating the whole previous sequence without adding a large number of parameters.\nBesides the advantage of RNNs for sequential data, there are some dif\ufb01culties to address e.g. the\nproblem to learn long-term dependencies. The long short-term memory (LSTM) architecture was\nintroduced in order to overcome these problems of the basic RNN [10]. A variation of the LSTM, the\ngated recurrent unit (GRU) was presented by [3].\n\n3 Methods\n\nIn the following, we will present our Recurrent Registration Neural Network (R2N2) for the applica-\ntion of sequence-based pairwise medical image registration of 2D images.\n\n3.1 Sequence-Based Image Registration\n\nSequence-based registration methods do not estimate the \ufb01nal transformation in one step but rather in\na series of local transformations. The minimization problem for the sequence-based registration is\ngiven as\n\n\u03b8\u2217 = arg min\n\nS[F, M \u25e6 f \u03b8\n\nt ] + \u03bbR[fT ].\n\n(6)\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n\u03b8\n\n(cid:26)0,\n\n(cid:18)\n\nt\n0\n\n0\n\u03c3y\nt\n\n4\n\nCompared to the registration problem (2) the transformation f \u03b8\nof the form\n\nt is now de\ufb01ned as a recursive function\n\nf \u03b8\nt (x, F, M ) =\n\nt\u22121 + l(x, g\u03b8(F, M \u25e6 f \u03b8\nf \u03b8\n\nt\u22121))\n\nif t = 0,\nelse.\n\n(7)\n\nHere, g\u03b8 is the function that outputs the parameter of the next local transformation given the two\nimages F and M \u25e6 f \u03b8\nt . In each time step t, a local transformation l : X \u00d7 X \u2192 R2 is computed and\nadded to the transformation f \u03b8\nt , the result is used\nas input for the next time step, in order to compute the next local transformation as shown in Figure 1.\nThis procedure is repeated until both input images are aligned. We de\ufb01ne a local transformation as a\nGaussian function\n\nt . After transforming the moving image M with f \u03b8\n\nl(x, \u02dcxt, \u0393t, vt) = vt exp\n\n\u2212 1\n2\nwhere \u02dcxt = (xt, yt) \u2208 X is the position, vt = (vx\nt , vy\nthe shape parameter with\n\n(cid:20)cos(\u03b1t) \u2212 sin(\u03b1t)\n\n(cid:21)(cid:20)\u03c3x\n\n\u03a3(\u0393t) =\n\nsin(\u03b1t)\n\ncos(\u03b1t)\n\n(x \u2212 \u02dcxt)T \u03a3(\u0393t)\u22121(x \u2212 \u02dcxt)\n\n,\n\n(8)\nt , \u03b1t}\n\nt ) \u2208 [\u22121, 1]2 the weight, and \u0393t = {\u03c3x\n\nt , \u03c3y\n\n(cid:21)(cid:20)cos(\u03b1t) \u2212 sin(\u03b1t)\n(cid:21)T\n\nsin(\u03b1t)\n\ncos(\u03b1t)\n\n.\n\n(9)\n\n(cid:19)\n\n\ft , \u03c3y\n\nt \u2208 R>0 control the width and \u03b1t \u2208 [0, \u03c0] the rotation of the Gaussian function. The\nHere, \u03c3x\noutput of g\u03b8 is de\ufb01ned as g\u03b8 = {\u02dcxt, \u0393t, vt}. Compared to the interpolating registration model shown\nin Figure 2b, the position \u02dcxt and shape \u0393t of the basis functions are not \ufb01xed during the registration\nin our method (Figure 2c).\n\nM\n\n1\n\nK=1 S=1\n\nSpatial Softmax\n\nM\n\n1\n\n1\n\n1\n\nK=1 S=1\n\nSpatial Softmax\n\n1\n\n1\n\nPixel\n\nCoord. \u00d7\n\n\u03a3\n\npl\n\npr\n\nSimilarity\n\n(xt, yt)\n\nwt\n(a) Position Network\n\nK=3 S=1\nResidual\n\nK=3 S=1\nC-GRU\n\n+\n\nh\nn\na\n\nt\n\n512\n\nK=3 S=1\n\n512\n\n256\n\n1, . . . , 256\n\n257, . . . , 512\n\n256\n\nSpatial Softmax\n\n256\n\ntanh\n\nK=3 S=1\n\ntanh\n\n256\n\n\u00d7\n\n\u03a3\n\n256\n\n512\n\nFC\ntanh\n\nFC\n\n512\n\n5\n\n(c1\n\nt , c2\n\nt , c3\n\nt , c4\n\nt , c5\nt )\n\n(c) Gated Recurrent Registration Unit\n\n(b) Parameter Network\n\nFigure 4: Architectures for the position network, the parameter network, and the gated recurrent\nregistration unit.\n\n3.2 Network Architecture\n\nWe developed a network architecture to approximate the unknown function g\u03b8, where \u03b8 are the\nparameters of the network. Since the transformation of the registration is de\ufb01ned as a recursive\nsequence, we base our network up on GRUs due to their ef\ufb01cient gated architecture. An overview of\nthe complete network architecture is shown in Figure 3. The input of the network are two images,\nthe \ufb01xed image F and the moving image M \u25e6 ft. As suggested in [19], we attached the position of\neach pixel as two additional coordinate channels to improve the convolution layers for the handling\nof spatial representations. Our network contains three major sub-networks to generate the parameters\nof the local transformation: the gated recurrent registration unit (GR2U), the position network, and\nthe parameter network.\n\nGated Recurrent Registration Unit Our network contains three GR2U for different spatial reso-\nlutions (128 \u00d7 128, 64 \u00d7 64, 32 \u00d7 32). Each GR2U has an internal structure as shown in Figure 4c.\nThe input of the GR2U block is passed through a residual network, with three stacked residual blocks\n[9]. If not stated otherwise, we use the hyperbolic tangent as activation function in the network. The\ncore of each GR2U is the C-GRU block. For this, we adopt the original GRU equations shown in\n[3] in order to use convolutions instead of a fully connected layer as presented in [1]. In contrast to\n[1], we adapt the proposal gate (12) for use with convolutions, but without factoring rj out of the\nconvolution. The C-GRU is then de\ufb01ned by:\n\n5\n\n\frj = \u03c8\n\nzj = \u03c8\n\n(cid:32) I(cid:88)\n(cid:32) I(cid:88)\n(cid:32) I(cid:88)\n\ni\n\ni\n\nt\u22121 \u2217 uk,j\n\nr\n\nt\u22121 \u2217 uk,j\n\nz\n\nr\n\nk\n\n(cid:0)hk\n(cid:0)hk\n(cid:0)(rj (cid:12) hk\n\n(cid:1) +\n(cid:0)x \u2217 wi,j\n(cid:0)x \u2217 wi,j\n(cid:1) +\n(cid:0)x \u2217 wi,j(cid:1) +\n\nJ(cid:88)\nJ(cid:88)\nJ(cid:88)\nt\u22121 + zj (cid:12) \u02dchj\nt .\n\nk\n\nk\n\nz\n\nr\n\n,\n\n(cid:33)\n(cid:1) + bj\n(cid:33)\n(cid:1) + bj\nt\u22121) \u2217 uk,j(cid:1) + bj\n\n,\n\nz\n\n\u02dchj\nt = \u03c6\nt = (1 \u2212 zj) (cid:12) hj\nhj\n\ni\n\n(cid:33)\n\n,\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\nHere, r represents the reset gate, z the update gate, \u02dcht the proposal state, and ht the output at time t.\nWe de\ufb01ne \u03c6(\u00b7) as the hyperbolic tangent, \u03c8(\u00b7) represents the logistic function, and (cid:12) is the Hadamard\nproduct. The convolution is denoted as \u2217 and u., w., b. are the parameters to be learned. The indices\ni, j, k correspond to the input and output/state channel index. We also applied a skip connection from\nthe output of the residual block to the output of the C-GRU.\n\nPosition Network The architecture of the position network is shown in Figure 4a and contains two\npaths. In the left path, the position of the local transformation xn\nt is calculated using a convolution\nlayer followed by the spatial softmax function [7]. Here, n is the level of the spatial resolution. The\nspatial softmax function is de\ufb01ned as\n\n(cid:80)\ni(cid:48)(cid:80)\n\nexp(ck\n\nij)\n\nj(cid:48) exp(ck\n\ni(cid:48)j(cid:48))\n\n(cid:88)\n\n(cid:88)\n\npk(ck\n\nij) =\n\n\uf8eb\uf8ed(cid:88)\n\n(cid:88)\n\n\uf8f6\uf8f8 ,\n\n,\n\n(14)\n\nwhere i and j are the spatial indices of the k-th feature map c. The position is then calculated by\n\n(cid:80)3\nt(cid:80)3\n\nn xn\nt wn\nn wn\nt\n\nxn\nt =\n\np(cij)X n\nij,\n\np(cij)Y n\nij\n\ni\n\nj\n\ni\n\nj\n\n(15)\n\nij, Y n\n\nij ) \u2208 X are the coordinates of the image pixel grid. As shown in Figure 3 an estimate\nwhere (X n\nof the current transformation position is computed on all three spatial levels. The \ufb01nal position is\ncalculated as a weighted sum\n\n(16)\nt \u2208 R are calculated on the right side of the position block. For this, a second\nThe weights wn\nconvolution layer and a second spatial softmax layer are applied to the input of the block. We\ncalculated the similarity of the left spatial softmax pl(cij) and the right spatial softmax pr(cij) as the\nweight of the position at each spatial location\n\n\u02dcxt =\n\n.\n\nt = 2 \u2212(cid:88)\n\nwn\n\n(cid:88)\n\n(cid:12)(cid:12)pl(cij) \u2212 pr(cij)(cid:12)(cid:12) .\n\n(17)\n\ni\n\nj\n\nThis weighting factor can be interpreted as certainty measure of the estimation of the current position\nat each spatial resolution.\n\nParameter Network The parameter network is located at the end of the network. Its detailed\nstructure is shown in Figure 4b. The input of the parameter block is \ufb01rst passed through a convolution\nlayer. After the convolution layer, the \ufb01rst half of the output feature maps is passed through a second\nconvolution layer. The second half is applied to a spatial softmax layer. For each element in both\noutputs, a point-wise multiplication is applied, followed by an average pooling layer down to a spatial\nresolution of 1 \u00d7 1. We use a fully connected layer with one hidden layer in order to reduce the\noutput to the number of needed parameters. The \ufb01nal output parameters are then de\ufb01ned as\n\n\u03c3x\nt = \u03c8(c1\n\nt )\u03c3max,\n\n(18)\nwhere \u03c6(\u00b7) is the hyperbolic tangent, \u03c8(\u00b7) the logistic function, and \u03c3max the maximum extension of\nthe shape.\n\n\u03b1t = \u03c8(c5\n\nt )\u03c3max,\n\nt )\u03c0,\n\nvx\nt = \u03c6(c3\n\n\u03c3y\nt = \u03c8(c2\n\nvy\nt = \u03c6(c4\n\nt ),\n\nt ),\n\n6\n\n\fFigure 5: Maximum inspiration (top row) and maximum expiration (bottom row) for different slice\npositions of one patient from back to front.\n\n4 Experiments and Results\n\nImage Data We trained our network on images of a 2D+t magnetic resonance (MR) image series of\nthe lung. Due to the low proton density of the lung parenchyma in comparison to other body tissues\nas well as strong magnetic susceptibility effects, it is very challenging to acquire MR images with a\nsuf\ufb01cient signal-to-noise ratio. Recently, a novel MR pulse sequence called ultra-fast steady-state free\nprecession (ufSSFP) was proposed [2]. ufSSFP allows detecting physiological signal changes in lung\nparenchyma caused by respiratory and cardiac cycles, without the need for intravenous contrast agents\nor hyperpolarized gas tracers. Multi-slice 2D+t ufSSFP acquisitions are performed in free-breathing.\nFor a complete chest volume coverage, the lung is scanned at different slice positions as shown in\nFigure 5. At each slice position, a dynamic 2D+t image series with 140 images is acquired. For the\nfurther analysis of the image data, all images of one slice position need to be spatially aligned. We\nchoose the image which is closest to the mean respiratory cycle as \ufb01xed image of the series. The other\nimages of the series are then registered to this image. Our data set consists of 48 lung acquisitions\nof 42 different patients. Each lung scan contains between 7 and 14 slices. We used the data of 34\npatients for the training set, 4 for the evaluation set, and 4 for the test set.\n\nNetwork Training The network was trained in an unsupervised fashion for \u223c 180,000 iterations\nwith a \ufb01xed sequence length of t = 25. Figure 6 shows an overview of the training procedure. We\nused the Adam optimizer [16] with the AMSGrad option [21] and a learning rate of 0.0001. The\nmaximum shape size is set to \u03c3max = 0.3 and the regularization weight to \u03bbR2N2 = 0.1. For the\nregularization of the network parameter, we use a combination of [25] particularly the use of Gaussian\nmultiplicative noise and dropconnect [28]. We apply multiplicative Gaussian noise N (1,\n0.5/0.5)\nto the parameter of the proposal and the output of the C-GRU. As image loss function S the mean\nsquared error (MSE) loss is used and as transformation regularizer R the isotropic total variation\n(TV). The training of the network was performed on an NVIDIA Tesla V100 GPU.\n\n\u221a\n\nFixed Image (F)\n\nF\n\nMt\n\nR2N2\n\ng\u03b8\nt\n\nDense\n\nDisplacement\n\nSpatial\n\nTransformer\n\nWt\n\nImage Loss\n\nMoving Image (M)\n\nUpdate input image Mt+1 with Wt\n\nFigure 6: Unsupervised training setup (Wt is the transformed moving image).\n\n7\n\n\fTable 1: Mean target registration error (TRE) for the proposed method R2N2 and a standard B-spline\nregistration (BS) for the test data set in millimeter. The small number is the maximum TRE for all\nimages for this slice.\n\nPatient Slice 1 Slice 2 Slice 3 Slice 4 Slice 5 Slice 6 Slice 7 Slice 8 mean\nR2N2 1.26 1.85 1.08 2.14 1.13 1.82 1.23 2.58 1.47 2.74 1.12 1.51 0.92 1.33 1.04 1.87 1.16\n1.28 1.81 1.16 2.0 1.40 2.52 1.15 2.67 0.96 1.71 0.99 1.41 0.84 1.14 1.02 1.65 1.10\nBS\n0.82\nR2N2 0.84 1.99 0.92 2.49 0.79 1.04 0.81 1.2 0.74 1.43\nBS\n1.50 5.07 0.69 1.73 0.73 1.05 0.77 1.13 0.86 1.76\n0.91\n0.99\nR2N2 1.65 3.88 1.06 2.55 0.86 2.08 0.83 1.48 0.80 1.39 0.73 1.08\n0.84\nBS\n1.15 2.73 0.81 1.42 0.75 1.64 0.79 1.14 0.72 0.94 0.83 1.95\n0.96\nR2N2 1.30 3.03 0.77 0.98 0.79 2.07 1.09 1.92 0.84 1.12\nBS\n1.09 3.15 0.78 1.01 0.73 1.73 1.09 2.5 0.79 1.13\n0.90\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\nExperiments We compare our method against a standard B-spline registration method (BS) imple-\nmented in the AirLab framework [23]. The B-spline registration use three spatial resolutions (64, 128,\n256) with a kernel size of (7, 21, 57) pixels. As image loss the MSE and as regularizer the isotropic\nTV is used, with the regularization weight \u03bbBS = 0.01. We use the Adam optimizer [16] with the\nAMSGrad option [21], a learning rate of 0.001, and we perform 250 iterations per resolution level.\nFrom the test set we select 21 images of each slice position, which corresponds to one breathing\ncycle. We then select corresponding landmarks in all 21 images in order to compute the registration\naccuracy. The target registration error (TRE) of the registration is de\ufb01ned as the mean root square\nerror of the landmark distance after the registration. The results in Table 1 show that our presented\nmethod performed on par with the standard B-spline registration in terms of accuracy. Since the\nslice positions are manually selected for each patient, we are not able to provide the same amount\nof slices for each patient. Despite that the image data is different at each slice position, we see a\ngood generalization ability of our network to perform an accurate registration independently of the\nslice position at which the images are acquired. Our method achieve a compact representation of\n\n(a) Fixed Image\n\n(b) Moving Image\n\n(c) Warped Image\n\n(d) Final Displacement\n\n(e) Displacement t = 2 (f) Displacement t = 4 (g) Displacement t = 8(h) Displacement t = 25\n\nFigure 7: Top Row: Registration result of the proposed recurrent registration neural network for one\nimage pair. Bottom Row: Sequence of local transformations after different time steps.\nthe \ufb01nal transformation, by using only \u223c 7.6% of the amount of parameters than the \ufb01nal B-spline\ntransformation. Here, the number of parameters of the network are not taken into account only\n\n8\n\n\fthe number of parameters needed to describe the \ufb01nal transformation. For the evaluation of the\ncomputation time for the registration of one image pair, we run both methods on an NVIDIA GeForce\nGTX 1080. The computation of the B-spline registration takes \u223c 4.5s compared to \u223c 0.3s for our\nmethod.\nAn example registration result of our presented method is shown in Figure 7. It can be seen that the\n\ufb01rst local transformations the network creates are placed below the diaphragm (white dashed line)\n(Figure 7a), where the magnitude of the motion between the images is maximal. Also the shape and\nrotation of the local transformations are computed optimally in order to apply a transformation only\nat the liver and the lung and not on the rips. During the next time steps, we can observe that the shape\nof the local transformation is reduced to align \ufb01ner details of the images (Figure 7g-h).\n\n5 Conclusion\n\nIn this paper, we presented the Recurrent Registration Neural Network for the task of deformable\nimage registration. We de\ufb01ne the registration process of two images as a recursive sequence of local\ndeformations. The sum of all local deformations yields the \ufb01nal spatial alignment of both images\nOur designed network can be trained end-to-end in an unsupervised fashion. The results show that\nour method is able to accurately register two images with a similar accuracy compared to a standard\nB-spline registration method. We achieve a speedup of \u223c15 for the computation time compared to\nthe B-spline registration. In addition, we need only \u223c7.6% of the amount of parameters to describe\nthe \ufb01nal transformation than the \ufb01nal transformation of the standard B-spline registration. In this\npaper, we have shown that our method is able to register two images in a recursive manner using a\n\ufb01xed number of steps. For future work we will including uncertainty measures for the registration\nresult as a possible stopping criteria. This could then be used to automatically determine the number\nof steps needed for the registration. Furthermore, we will extend our method for the registration of\n3D volumes.\n\nAcknowledgements\n\nWe would like to thank Oliver Bieri, Orso Pusterla (Division of Radiological Physics, Department\nof Radiology, University Hospital Basel, Switzerland), and Philipp Latzin (Pediatric Respiratory\nMedicine, Department of Pediatrics, Inselspital, Bern University Hospital, University of Bern,\nSwitzerland) for there support during the development of this work. Furthermore, we thank the Swiss\nNational Science Foundation for funding this project (SNF 320030_149576).\n\nReferences\n[1] Simon Andermatt, Simon Pezold, and Philippe Cattin. Automated Segmentation of Multiple\nSclerosis Lesions using Multi-Dimensional Gated Recurrent Units. In International Workshop\non Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Springer,\n2017.\n\n[2] Grzegorz Bauman, Orso Pusterla, and Oliver Bieri. Ultra-fast steady-state free precession\npulse sequence for Fourier decomposition pulmonary MRI. Magnetic Resonance in Medicine,\n75(4):1647\u20131653, 2016.\n\n[3] Kyunghyun Cho, Bart van Merrienboer, \u00c7aglar G\u00fcl\u00e7ehre, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical\nmachine translation. CoRR, abs/1406.1078, 2014.\n\n[4] Adrian V. Dalca, Guha Balakrishnan, John Guttag, and Mert R. Sabuncu. Unsupervised learning\nfor fast probabilistic diffeomorphic registration. In Alejandro F. Frangi, Julia A. Schnabel,\nChristos Davatzikos, Carlos Alberola-L\u00f3pez, and Gabor Fichtinger, editors, Medical Image\nComputing and Computer Assisted Intervention \u2013 MICCAI 2018, pages 729\u2013738, Cham, 2018.\nSpringer International Publishing.\n\n[5] Bob D. de Vos, Floris F. Berendsen, Max A. Viergever, Marius Staring, and Ivana I\u0161gum.\nEnd-to-end unsupervised deformable image registration with a convolutional neural network.\n\n9\n\n\fIn M. Jorge Cardoso, Tal Arbel, Gustavo Carneiro, Tanveer Syeda-Mahmood, Jo\u00e3o Manuel R.S.\nTavares, Mehdi Moradi, Andrew Bradley, Hayit Greenspan, Jo\u00e3o Paulo Papa, Anant Madabhushi,\nJacinto C. Nascimento, Jaime S. Cardoso, Vasileios Belagiannis, and Zhi Lu, editors, Deep\nLearning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support,\npages 204\u2013212, Cham, 2017. Springer International Publishing.\n\n[6] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov,\nPatrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical \ufb02ow with\nconvolutional networks. In Proceedings of the IEEE International Conference on Computer\nVision, pages 2758\u20132766, 2015.\n\n[7] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep\nIn 2016 IEEE International Conference on\n\nspatial autoencoders for visuomotor learning.\nRobotics and Automation (ICRA), pages 512\u2013519. IEEE, 2016.\n\n[8] Grant Haskins, Uwe Kruger, and Pingkun Yan. Deep learning in medical image registration: A\n\nsurvey. arXiv preprint arXiv:1903.02026, 2019.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[10] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9:1735\u201380, 12 1997.\n\n[11] Berthold KP Horn and Brian G Schunck. Determining optical \ufb02ow. Arti\ufb01cial intelligence,\n\n17(1-3):185\u2013203, 1981.\n\n[12] Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin Ghavami, Ester Bonmati, Guotai Wang,\nSteven Bandula, Caroline M. Moore, Mark Emberton, S\u00e9bastien Ourselin, J. Alison Noble,\nDean C. Barratt, and Tom Vercauteren. Weakly-supervised convolutional neural networks for\nmultimodal image registration. Medical Image Analysis, 49:1 \u2013 13, 2018.\n\n[13] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In\n\nAdvances in neural information processing systems, pages 2017\u20132025, 2015.\n\n[14] Christoph Jud, Nadia M\u00f6ri, Benedikt Bitterli, and Philippe C. Cattin. Bilateral regularization in\nreproducing kernel hilbert spaces for discontinuity preserving image registration. In Li Wang,\nEhsan Adeli, Qian Wang, Yinghuan Shi, and Heung-Il Suk, editors, Machine Learning in\nMedical Imaging, pages 10\u201317, Cham, 2016. Springer International Publishing.\n\n[15] Christoph Jud, Nadia M\u00f6ri, and Philippe C. Cattin. Sparse kernel machines for discontinuous\nregistration and nonstationary regularization. In 2016 IEEE Conference on Computer Vision\nand Pattern Recognition Workshops (CVPRW), pages 449\u2013456, June 2016.\n\n[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[17] Julian Krebs, Tommaso Mansi, Herv\u00e9 Delingette, Li Zhang, Florin C Ghesu, Shun Miao,\nAndreas K Maier, Nicholas Ayache, Rui Liao, and Ali Kamen. Robust non-rigid registration\nthrough agent-based action learning. In International Conference on Medical Image Computing\nand Computer-Assisted Intervention, pages 344\u2013352. Springer, 2017.\n\n[18] Rui Liao, Shun Miao, Pierre de Tournemire, Sasa Grbic, Ali Kamen, Tommaso Mansi, and\nIn Thirty-First AAAI\n\nDorin Comaniciu. An arti\ufb01cial agent for robust image registration.\nConference on Arti\ufb01cial Intelligence, 2017.\n\n[19] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev,\nand Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv\nsolution. In Advances in Neural Information Processing Systems, pages 9605\u20139616, 2018.\n\n[20] Shun Miao, Sebastien Piat, Peter Fischer, Ahmet Tuysuzoglu, Philip Mewes, Tommaso Mansi,\nand Rui Liao. Dilated fcn for multi-agent 2d/3d medical image registration. In Thirty-Second\nAAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n10\n\n\f[21] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond.\n\narXiv preprint arXiv:1904.09237, 2019.\n\n[22] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. G. Hill, M. O. Leach, and D. J. Hawkes. Nonrigid\nregistration using free-form deformations: application to breast mr images. IEEE Transactions\non Medical Imaging, 18(8):712\u2013721, Aug 1999.\n\n[23] Robin Sandk\u00fchler, Christoph Jud, Simon Andermatt, and Philippe C. Cattin. Airlab: Autograd\n\nimage registration laboratory. arXiv preprint arXiv:1806.09907, 2018.\n\n[24] Aristeidis Sotiras, Christos Davatzikos, and Nikos Paragios. Deformable medical image\n\nregistration: A survey. IEEE transactions on medical imaging, 32(7):1153\u20131190, 2013.\n\n[25] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: a simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[26] Christodoulidis Stergios, Sahasrabudhe Mihir, Vakalopoulou Maria, Chassagnon Guillaume,\nRevel Marie-Pierre, Mougiakakou Stavroula, and Paragios Nikos. Linear and deformable image\nregistration with 3d convolutional neural networks. In Image Analysis for Moving Organ, Breast,\nand Thoracic Images, pages 13\u201322. Springer, 2018.\n\n[27] V. Vishnevskiy, T. Gass, G. Szekely, C. Tanner, and O. Goksel.\n\nIsotropic total variation\nregularization of displacements in parametric image registration. IEEE Transactions on Medical\nImaging, 36(2):385\u2013395, Feb 2017.\n\n[28] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of\nneural networks using dropconnect. In International conference on machine learning, pages\n1058\u20131066, 2013.\n\n11\n\n\f", "award": [], "sourceid": 4724, "authors": [{"given_name": "Robin", "family_name": "Sandk\u00fchler", "institution": "University of Basel"}, {"given_name": "Simon", "family_name": "Andermatt", "institution": "Center for medical Image Analysis and Navigation"}, {"given_name": "Grzegorz", "family_name": "Bauman", "institution": "University of Basel Hospital"}, {"given_name": "Sylvia", "family_name": "Nyilas", "institution": "Bern University Hospital"}, {"given_name": "Christoph", "family_name": "Jud", "institution": "University of Basel"}, {"given_name": "Philippe C.", "family_name": "Cattin", "institution": "University of Basel"}]}