{"title": "Learning Spherical Convolution for Fast Features from 360\u00b0 Imagery", "book": "Advances in Neural Information Processing Systems", "page_first": 529, "page_last": 539, "abstract": "While 360\u00b0 cameras offer tremendous new possibilities in vision, graphics, and augmented reality, the spherical images they produce make core feature extraction non-trivial. Convolutional neural networks (CNNs) trained on images from perspective cameras yield \u201cflat\" filters, yet 360\u00b0 images cannot be projected to a single plane without significant distortion. A naive solution that repeatedly projects the viewing sphere to all tangent planes is accurate, but much too computationally intensive for real problems. We propose to learn a spherical convolutional network that translates a planar CNN to process 360\u00b0 imagery directly in its equirectangular projection. Our approach learns to reproduce the flat filter outputs on 360\u00b0 data, sensitive to the varying distortion effects across the viewing sphere. The key benefits are 1) efficient feature extraction for 360\u00b0 images and video, and 2) the ability to leverage powerful pre-trained networks researchers have carefully honed (together with massive labeled image training sets) for perspective images. We validate our approach compared to several alternative methods in terms of both raw CNN output accuracy as well as applying a state-of-the-art \u201cflat\" object detector to 360\u00b0 data. Our method yields the most accurate results while saving orders of magnitude in computation versus the existing exact reprojection solution.", "full_text": "Learning Spherical Convolution\n\nfor Fast Features from 360\u00b0 Imagery\n\nYu-Chuan Su\n\nKristen Grauman\n\nThe University of Texas at Austin\n\nAbstract\n\nWhile 360\u00b0 cameras offer tremendous new possibilities in vision, graphics, and\naugmented reality, the spherical images they produce make core feature extrac-\ntion non-trivial. Convolutional neural networks (CNNs) trained on images from\nperspective cameras yield \u201c\ufb02at\" \ufb01lters, yet 360\u00b0 images cannot be projected to a\nsingle plane without signi\ufb01cant distortion. A naive solution that repeatedly projects\nthe viewing sphere to all tangent planes is accurate, but much too computationally\nintensive for real problems. We propose to learn a spherical convolutional network\nthat translates a planar CNN to process 360\u00b0 imagery directly in its equirectan-\ngular projection. Our approach learns to reproduce the \ufb02at \ufb01lter outputs on 360\u00b0\ndata, sensitive to the varying distortion effects across the viewing sphere. The key\nbene\ufb01ts are 1) ef\ufb01cient feature extraction for 360\u00b0 images and video, and 2) the\nability to leverage powerful pre-trained networks researchers have carefully honed\n(together with massive labeled image training sets) for perspective images. We\nvalidate our approach compared to several alternative methods in terms of both raw\nCNN output accuracy as well as applying a state-of-the-art \u201c\ufb02at\" object detector\nto 360\u00b0 data. Our method yields the most accurate results while saving orders of\nmagnitude in computation versus the existing exact reprojection solution.\n\nIntroduction\n\n1\nUnlike a traditional perspective camera, which samples a limited \ufb01eld of view of the 3D scene\nprojected onto a 2D plane, a 360\u00b0 camera captures the entire viewing sphere surrounding its optical\ncenter, providing a complete picture of the visual world\u2014an omnidirectional \ufb01eld of view. As such,\nviewing 360\u00b0 imagery provides a more immersive experience of the visual content compared to\ntraditional media.\n360\u00b0 cameras are gaining popularity as part of the rising trend of virtual reality (VR) and augmented\nreality (AR) technologies, and will also be increasingly in\ufb02uential for wearable cameras, autonomous\nmobile robots, and video-based security applications. Consumer level 360\u00b0 cameras are now common\non the market, and media sharing sites such as Facebook and YouTube have enabled support for\n360\u00b0 content. For consumers and artists, 360\u00b0 cameras free the photographer from making real-time\ncomposition decisions. For VR/AR, 360\u00b0 data is essential to content creation. As a result of this great\npotential, computer vision problems targeting 360\u00b0 content are capturing the attention of both the\nresearch community and application developer.\nImmediately, this raises the question: how to compute features from 360\u00b0 images and videos?\nArguably the most powerful tools in computer vision today are convolutional neural networks (CNN).\nCNNs are responsible for state-of-the-art results across a wide range of vision problems, including\nimage recognition [17, 42], object detection [12, 30], image and video segmentation [16, 21, 28], and\naction detection [10, 32]. Furthermore, signi\ufb01cant research effort over the last \ufb01ve years (and really\ndecades [27]) has led to well-honed CNN architectures that, when trained with massive labeled image\ndatasets [8], produce \u201cpre-trained\" networks broadly useful as feature extractors for new problems.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Two existing strategies for applying CNNs to 360\u00b0 images. Top: The \ufb01rst strategy unwraps the\n360\u00b0 input into a single planar image using a global projection (most commonly equirectangular projection),\nthen applies the CNN on the distorted planar image. Bottom: The second strategy samples multiple tangent\nplanar projections to obtain multiple perspective images, to which the CNN is applied independently to obtain\nlocal results for the original 360\u00b0 image. Strategy I is fast but inaccurate; Strategy II is accurate but slow. The\nproposed approach learns to replicate \ufb02at \ufb01lters on spherical imagery, offering both speed and accuracy.\n\nIndeed such networks are widely adopted as off-the-shelf feature extractors for other algorithms and\napplications (c.f., VGG [33], ResNet [17], and AlexNet [25] for images; C3D [36] for video).\nHowever, thus far, powerful CNN features are awkward if not off limits in practice for 360\u00b0 imagery.\nThe problem is that the underlying projection models of current CNNs and 360\u00b0 data are different.\nBoth the existing CNN \ufb01lters and the expensive training data that produced them are \u201c\ufb02at\", i.e., the\nproduct of perspective projection to a plane. In contrast, a 360\u00b0 image is projected onto the unit\nsphere surrounding the camera\u2019s optical center.\nTo address this discrepancy, there are two common, though \ufb02awed, approaches. In the \ufb01rst, the\nspherical image is projected to a planar one,1 then the CNN is applied to the resulting 2D image [19,26]\n(see Fig. 1, top). However, any sphere-to-plane projection introduces distortion, making the resulting\nconvolutions inaccurate. In the second existing strategy, the 360\u00b0 image is repeatedly projected\nto tangent planes around the sphere, each of which is then fed to the CNN [34, 35, 38, 41] (Fig. 1,\nbottom). In the extreme of sampling every tangent plane, this solution is exact and therefore accurate.\nHowever, it suffers from very high computational cost. Not only does it incur the cost of rendering\neach planar view, but also it prevents amortization of convolutions: the intermediate representation\ncannot be shared across perspective images because they are projected to different planes.\nWe propose a learning-based solution that, unlike the existing strategies, sacri\ufb01ces neither accuracy\nnor ef\ufb01ciency. The main idea is to learn a CNN that processes a 360\u00b0 image in its equirectangular\nprojection (fast) but mimics the \u201c\ufb02at\" \ufb01lter responses that an existing network would produce on\nall tangent plane projections for the original spherical image (accurate). Because convolutions are\nindexed by spherical coordinates, we refer to our method as spherical convolution (SPHCONV). We\ndevelop a systematic procedure to adjust the network structure in order to account for distortions.\nFurthermore, we propose a kernel-wise pre-training procedure which signi\ufb01cantly accelerates the\ntraining process.\nIn addition to providing fast general feature extraction for 360\u00b0 imagery, our approach provides a\nbridge from 360\u00b0 content to existing heavily supervised datasets dedicated to perspective images.\nIn particular, training requires no new annotations\u2014only the target CNN model (e.g., VGG [33]\npre-trained on millions of labeled images) and an arbitrary collection of unlabeled 360\u00b0 images.\nWe evaluate SPHCONV on the Pano2Vid [35] and PASCAL VOC [9] datasets, both for raw convolu-\ntion accuracy as well as impact on an object detection task. We show that it produces more precise\noutputs than baseline methods requiring similar computational cost, and similarly precise outputs as\nthe exact solution while using orders of magnitude less computation. Furthermore, we demonstrate\nthat SPHCONV can successfully replicate the widely used Faster-RCNN [30] detector on 360\u00b0 data\nwhen training with only 1,000 unlabeled 360\u00b0 images containing unrelated objects. For a similar cost\nas the baselines, SPHCONV generates better object proposals and recognition rates.\n\n1e.g., with equirectangular projection, where latitudes are mapped to horizontal lines of uniform spacing\n\n2\n\n\u03c6\u03b8Input:360\u25e6imageStrategyI\u03c6\u03b8EquirectangularProjection\u00b7\u00b7\u00b7FullyConvolutionNpSample\u02c6nPerspectiveProjectionStrategyII\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7OutputNp\u00b7\u00b7\u00b7Np\u00b7\u00b7\u00b7Np\f2 Related Work\n360\u00b0 vision Vision for 360\u00b0 data is quickly gaining interest in recent years. The SUN360 project\nsamples multiple perspective images to perform scene viewpoint recognition [38]. PanoContext [41]\nparses 360\u00b0 images using 3D bounding boxes, applying algorithms like line detection on perspective\nimages then backprojecting results to the sphere. Motivated by the limitations of existing interfaces\nfor viewing 360\u00b0 video, several methods study how to automate \ufb01eld-of-view (FOV) control for\ndisplay [19, 26, 34, 35], adopting one of the two existing strategies for convolutions (Fig. 1). In these\nmethods, a noted bottleneck is feature extraction cost, which is hampered by repeated sampling of\nperspective images/frames, e.g., to represent the space-time \u201cglimpses\" of [34, 35]. This is exactly\nwhere our work can have positive impact. Prior work studies the impact of panoramic or wide angle\nimages on hand-crafted features like SIFT [11, 14, 15]. While not applicable to CNNs, such work\nsupports the need for features speci\ufb01c to 360\u00b0 imagery, and thus motivates SPHCONV.\nKnowledge distillation Our approach relates to knowledge distillation [3, 5, 13, 18, 29, 31, 37],\nthough we explore it in an entirely novel setting. Distillation aims to learn a new model given existing\nmodel(s). Rather than optimize an objective function on annotated data, it learns the new model\nthat can reproduce the behavior of the existing model, by minimizing the difference between their\noutputs. Most prior work explores distillation for model compression [3, 5, 18, 31]. For example,\na deep network can be distilled into a shallower [3] or thinner [31] one, or an ensemble can be\ncompressed to a single model [18]. Rather than compress a model in the same domain, our goal is to\nlearn across domains, namely to link networks on images with different projection models. Limited\nwork considers distillation for transfer [13, 29]. In particular, unlabeled target-source paired data can\nhelp learn a CNN for a domain lacking labeled instances (e.g., RGB vs. depth images) [13], and\nmulti-task policies can be learned to simulate action value distributions of expert policies [29]. Our\nproblem can also be seen as a form of transfer, though for a novel task motivated strongly by image\nprocessing complexity as well as supervision costs. Different from any of the above, we show how to\nadapt the network structure to account for geometric transformations caused by different projections.\nAlso, whereas most prior work uses only the \ufb01nal output for supervision, we use the intermediate\nrepresentation of the target network as both input and target output to enable kernel-wise pre-training.\nSpherical image projection Projecting a spherical image into a planar image is a long studied\nproblem. There exists a large number of projection approaches (e.g., equirectangular, Mercator,\netc.) [4]. None is perfect; every projection must introduce some form of distortion. The properties of\ndifferent projections are analyzed in the context of displaying panoramic images [40]. In this work,\nwe unwrap the spherical images using equirectangular projection because 1) this is a very common\nformat used by camera vendors and researchers [1, 35, 38], and 2) it is equidistant along each row and\ncolumn so the convolution kernel does not depend on the azimuthal angle. Our method in principle\ncould be applied to other projections; their effect on the convolution operation remains to be studied.\nCNNs with geometric transformations There is an increasing interest in generalizing convolu-\ntion in CNNs to handle geometric transformations or deformations. Spatial transformer networks\n(STNs) [20] represent a geometric transformation as a sampling layer and predict the transformation\nparameters based on input data. STNs assume the transformation is invertible such that the subsequent\nconvolution can be performed on data without the transformation. This is not possible in spheri-\ncal images because it requires a projection that introduces no distortion. Active convolution [22]\nlearns the kernel shape together with the weights for a more general receptive \ufb01eld, and deformable\nconvolution [7] goes one step further by predicting the receptive \ufb01eld location. These methods are\ntoo restrictive for spherical convolution, because they require a \ufb01xed kernel size and weight. In\ncontrast, our method adapts the kernel size and weight based on the transformation to achieve better\naccuracy. Furthermore, our method exploits problem-speci\ufb01c geometric information for ef\ufb01cient\ntraining and testing. Some recent work studies convolution on a sphere [6, 24] using spectral analysis,\nbut those methods require manually annotated spherical images as training data, whereas our method\ncan exploit existing models trained on perspective images as supervision. Also, it is unclear whether\nCNNs in the spectral domain can reach the same accuracy and ef\ufb01ciency as CNNs on a regular grid.\n\n3 Approach\nWe describe how to learn spherical convolutions in equirectangular projection given a target network\ntrained on perspective images. We de\ufb01ne the objective in Sec. 3.1. Next, we introduce how to adapt\nthe structure from the target network in Sec. 3.2. Finally, Sec. 3.3 presents our training process.\n\n3\n\n\fFigure 2: Inverse perspective projections P\u22121 to equirectangular projections at different polar angles \u03b8. The\nsame square image will distort to different sizes and shapes depending on \u03b8. Because equirectangular projection\nunwraps the 180\u00b0 longitude, a line will be split into two if it passes through the 180\u00b0 longitude, which causes\nthe double curve in \u03b8 = 36\u00b0.\n\n3.1 Problem De\ufb01nition\nLet Is be the input spherical image de\ufb01ned on spherical coordinates (\u03b8, \u03c6), and let Ie \u2208 IWe\u00d7He\u00d73\nbe the corresponding \ufb02at RGB image in equirectangular projection. Ie is de\ufb01ned by pixels on the\nimage coordinates (x, y) \u2208 De, where each (x, y) is linearly mapped to a unique (\u03b8, \u03c6). We de\ufb01ne\nthe perspective projection operator P which projects an \u03b1-degree \ufb01eld of view (FOV) from Is to\nW pixels on the the tangent plane \u02c6n = (\u03b8, \u03c6). That is, P(Is, \u02c6n) = Ip \u2208 IW\u00d7W\u00d73. The projection\noperator is characterized by the pixel size \u2206p\u03b8 = \u03b1/W in Ip, and Ip denotes the resulting perspective\nimage. Note that we assume \u2206\u03b8 = \u2206\u03c6 following common digital imagery.\nGiven a target network2 Np trained on perspective images Ip with receptive \ufb01eld (Rf) R \u00d7 R, we\nde\ufb01ne the output on spherical image Is at \u02c6n = (\u03b8, \u03c6) as\n(1)\nwhere w.l.o.g. we assume W = R for simplicity. Our goal is to learn a spherical convolution network\nNe that takes an equirectangular map Ie as input and, for every image position (x, y), produces as\noutput the results of applying the perspective projection network to the corresponding tangent plane\nfor spherical image Is:\n\nNp(Is)[\u03b8, \u03c6] = Np(P(Is, (\u03b8, \u03c6))),\n\nNe(Ie)[x, y] \u2248 Np(Is)[\u03b8, \u03c6],\n\n\u2200(x, y) \u2208 De,\n\n(\u03b8, \u03c6) = (\n\n180\u00b0 \u00d7 y\n\nHe\n\n360\u00b0 \u00d7 x\n\nWe\n\n,\n\n).\n\n(2)\n\nThis can be seen as a domain adaptation problem where we want to transfer the model from the\ndomain of Ip to that of Ie. However, unlike typical domain adaptation problems, the difference\nbetween Ip and Ie is characterized by a geometric projection transformation rather than a shift\nin data distribution. Note that the training data to learn Ne requires no manual annotations: it\nconsists of arbitrary 360\u00b0 images coupled with the \u201ctrue\" Np outputs computed by exhaustive planar\nreprojections, i.e., evaluating the rhs of Eq. 1 for every (\u03b8, \u03c6). Furthermore, at test time, only a single\nequirectangular projection of the entire 360\u00b0 input will be computed using Ne to obtain the dense\n(inferred) Np outputs, which would otherwise require multiple projections and evaluations of Np.\n\n3.2 Network Structure\nThe main challenge for transferring Np to Ne is the distortion introduced by equirectangular projec-\ntion. The distortion is location dependent\u2014a k \u00d7 k square in perspective projection will not be a\nsquare in the equirectangular projection, and its shape and size will depend on the polar angle \u03b8. See\nFig. 2. The convolution kernel should transform accordingly. Our approach 1) adjusts the shape\nof the convolution kernel to account for the distortion, in particular the content expansion, and 2)\nreduces the number of max-pooling layers to match the pixel sizes in Ne and Np, as we detail next.\nWe adapt the architecture of Ne from Np using the following heuristic. The goal is to ensure each\nkernel receives enough information from the input in order to compute the target output. First, we\nuntie the weight of convolution kernels at different \u03b8 by learning one kernel K y\ne for each output row\ny. Next, we adjust the shape of K y\ne such that it covers the Rf of the original kernel. We consider\ne \u2208 Ne to cover Kp \u2208 Np if more than 95% of pixels in the Rf of Kp are also in the Rf of Ke\nK y\nin Ie. The Rf of Kp in Ie is obtained by backprojecting the R \u00d7 R grid to \u02c6n = (\u03b8, 0) using P\u22121,\nwhere the center of the grid aligns on \u02c6n. Ke should be large enough to cover Kp, but it should also\nfor layer l as\nbe as small as possible to avoid over\ufb01tting. Therefore, we optimize the shape of K l,y\ne\nis initialized as 3 \u00d7 3. We \ufb01rst adjust the height kh and increase kh by 2\nfollows. The shape of K l,y\ne\n\n2e.g., Np could be AlexNet [25] or VGG [33] pre-trained for a large-scale recognition task.\n\n4\n\n\ud835\udf3d= 36\u00b0\ud835\udf3d= 108\u00b0\ud835\udf3d= 180\u00b0\fFigure 3: Spherical convolution. The kernel weight in spherical convolution is tied only along each row of the\nequirectangular image (i.e., \u03c6), and each kernel convolves along the row to generate 1D output. Note that the\nkernel size differs at different rows and layers, and it expands near the top and bottom of the image.\n\ne\n\ne depends on K l\u22121\n\nuntil the height of the Rf is larger than that of Kp in Ie. We then adjust the width kw similar to kh.\nFurthermore, we restrict the kernel size kh \u00d7 kw to be smaller than an upper bound Uk. See Fig. 4.\nBecause the Rf of K l\n, we search for the kernel size starting from the bottom layer.\nIt is important to relax the kernel from being square to being rectangular, because equirectangular\nprojection will expand content horizontally near the poles of the sphere (see Fig. 2). If we restrict the\nkernel to be square, the Rf of Ke can easily be taller but narrower than that of Kp which leads to\nover\ufb01tting. It is also important to restrict the kernel size, otherwise the kernel can grow wide rapidly\nnear the poles and eventually cover the entire row. Although cutting off the kernel size may lead to\ninformation loss, the loss is not signi\ufb01cant in practice because pixels in equirectangular projection do\nnot distribute on the unit sphere uniformly; they are denser near the pole, and the pixels are by nature\nredundant in the region where the kernel size expands dramatically.\nBesides adjusting the kernel sizes, we also adjust the number of pooling layers to match the pixel\nsize \u2206\u03b8 in Ne and Np. We de\ufb01ne \u2206\u03b8e = 180\u00b0/He and restrict We = 2He to ensure \u2206\u03b8e = \u2206\u03c6e.\nBecause max-pooling introduces shift invariance up to kw pixels in the image, which corresponds to\nkw \u00d7 \u2206\u03b8 degrees on the unit sphere, the physical meaning of max-pooling depends on the pixel size.\nSince the pixel size is usually larger in Ie and max-pooling increases the pixel size by a factor of kw,\nwe remove the pooling layer in Ne if \u2206\u03b8e \u2265 \u2206\u03b8p.\nFig. 3 illustrates how spherical convolution differs from ordinary CNN. Note that we approximate\none layer in Np by one layer in Ne, so the number of layers and output channels in each layer is\nexactly the same as the target network. However, this does not have to be the case. For example,\nwe could use two or more layers to approximate each layer in Np. Although doing so may improve\naccuracy, it would also introduce signi\ufb01cant overhead, so we stick with the one-to-one mapping.\n\n3.3 Training Process\nGiven the goal in Eq. 2 and the architecture described in Sec. 3.2, we would like to learn the network\nNe by minimizing the L2 loss E[(Ne(Ie) \u2212 Np(Is))2]. However, the network converges slowly,\npossibly due to the large number of parameters. Instead, we propose a kernel-wise pre-training\nprocess that disassembles the network and initially learns each kernel independently.\nTo perform kernel-wise pre-training, we further require Ne to generate the same intermediate repre-\nsentation as Np in all layers l:\n\np\n\ne(Ie)\u2212 N l\n\ne\n\nN l\ne(Ie)[x, y] \u2248 N l\n\np(Is)[\u03b8, \u03c6] \u2200l \u2208 Ne.\n\n(3)\nGiven Eq. 3, every layer l \u2208 Ne is independent of each other. In fact, every kernel is independent and\ncan be learned separately. We learn each kernel by taking the \u201cground truth\u201d value of the previous\n(Is) as input and minimizing the L2 loss E[(N l\nlayer N l\u22121\np(Is))2], except for the \ufb01rst layer.\nNote that N l\np refers to the convolution output of layer l before applying any non-linear operation,\ne.g. ReLU, max-pooling, etc. It is important to learn the target value before applying ReLU because\nit provides more information. We combine the non-linear operation with K l+1\nduring kernel-wise\npre-training, and we use dilated convolution [39] to increase the Rf size instead of performing\nmax-pooling on the input feature map.\nFor the \ufb01rst convolution layer, we derive the analytic solution directly. The projection operator P is\nij cijIe[i, j], for coef\ufb01cients cij\n\nlinear in the pixels in equirectangular projection: P(Is, \u02c6n)[x, y] =(cid:80)\n\n5\n\n\u03c6\u03b8KleKleKle............Kl+1eKl+1eKl+1e\fFigure 4: Method to select the kernel height kh. We project the receptive \ufb01eld of the target kernel to equirectan-\ngular projection Ie and increase kh until it is taller than the target kernel in Ie. The kernel width kw is determined\nusing the same procedure after kh is set. We restrict the kernel size kw \u00d7 kh by an upper bound Uk.\n(cid:80)\nfrom, e.g., bilinear interpolation. Because convolution is a weighted sum of input pixels Kp \u2217 Ip =\nxy wxyIp[x, y], we can combine the weight wxy and interpolation coef\ufb01cient cij as a single\nconvolution operator:\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:16)(cid:88)\n\n(cid:17)\n\nK 1\np \u2217 Is[\u03b8, \u03c6] =\n\nwxy\n\ncijIe[i, j] =\n\nxy\n\nij\n\nij\n\nxy\n\nwxycij\n\nIe[i, j] = K 1\n\ne \u2217 Ie.\n\n(4)\n\ne will be exact and requires no learning. Of course, the same is not possible for\n\nThe output value of N 1\nl > 1 because of the non-linear operations between layers.\nAfter kernel-wise pre-training, we can further \ufb01ne-tune the network jointly across layers and kernels\nby minimizing the L2 loss of the \ufb01nal output. Because the pre-trained kernels cannot fully recover\nthe intermediate representation, \ufb01ne-tuning can help to adjust the weights to account for residual\nerrors. We ignore the constraint introduced in Eq. 3 when performing \ufb01ne-tuning. Although Eq. 3\nis necessary for kernel-wise pre-training, it restricts the expressive power of Ne and degrades the\nperformance if we only care about the \ufb01nal output. Nevertheless, the weights learned by kernel-wise\npre-training are a very good initialization in practice, and we typically only need to \ufb01ne-tune the\nnetwork for a few epochs.\nOne limitation of SPHCONV is that it cannot handle very close objects that span a large FOV. Because\nthe goal of SPHCONV is to reproduce the behavior of models trained on perspective images, the\ncapability and performance of the model is bounded by the target model Np. However, perspective\ncameras can only capture a small portion of a very close object in the FOV, and very close objects are\nusually not available in the training data of the target model Np. Therefore, even though 360\u00b0 images\noffer a much wider FOV, SPHCONV inherits the limitations of Np, and may not recognize very close\nlarge objects. Another limitation of SPHCONV is the resulting model size. Because it unties the\nkernel weights along \u03b8, the model size grows linearly with the equirectangular image height. The\nmodel size can easily grow to tens of gigabytes as the image resolution increases.\n4 Experiments\nTo evaluate our approach, we consider both the accuracy of its convolutions as well as its applicability\nfor object detections in 360\u00b0 data. We use the VGG architecture3 and the Faster-RCNN [30] model as\nour target network Np. We learn a network Ne to produce the topmost (conv5_3) convolution output.\nDatasets We use two datasets: Pano2Vid for training, and Pano2Vid and PASCAL for testing.\nPano2Vid: We sample frames from the 360\u00b0 videos in the Pano2Vid dataset [35] for both training\nand testing. The dataset consists of 86 videos crawled from YouTube using four keywords: \u201cHiking,\u201d\n\u201cMountain Climbing,\u201d \u201cParade,\u201d and \u201cSoccer\u201d. We sample frames at 0.05fps to obtain 1,056 frames\nfor training and 168 frames for testing. We use \u201cMountain Climbing\u201d for testing and others for\ntraining, so the training and testing frames are from disjoint videos. See Supp. for sampling process.\nBecause the supervision is on a per pixel basis, this corresponds to N \u00d7 We \u00d7 He \u2248 250M (non\ni.i.d.) samples. Note that most object categories targeted by the Faster-RCNN detector do not appear\nin Pano2Vid, meaning that our experiments test the content-independence of our approach.\nPASCAL VOC: Because the target model was originally trained and evaluated on PASCAL VOC 2007,\nwe \u201c360-ify\u201d it to evaluate the object detector application. We test with the 4,952 PASCAL images,\nwhich contain 12,032 bounding boxes. We transform them to equirectangular images as if they\n\n3https://github.com/rbgirshick/py-faster-rcnn\n\n6\n\nReceptive FieldTarget Network NpPerspective Projection(Inverse)Receptive FieldNetwork NeNokw\u00d7kh> UkkwkhIncreasekhNoYes> YesOutput khTarget Kernel KpKernel Ke\foriginated from a 360\u00b0 camera. In particular, each object bounding box is backprojected to 3 different\nscales {0.5R, 1.0R, 1.5R} and 5 different polar angles \u03b8\u2208{36\u00b0, 72\u00b0, 108\u00b0, 144\u00b0, 180\u00b0} on the 360\u00b0\nimage sphere using the inverse perspective projection, where R is the resolution of the target network\u2019s\nRf. Regions outside the bounding box are zero-padded. See Supp. for details. Backprojection allows\nus to evaluate the performance at different levels of distortion in the equirectangular projection.\nMetrics We generate the output widely used in the literature (conv5_3) and evaluate it with the\nfollowing metrics.\nNetwork output error measures the difference between Ne(Ie) and Np(Is). In particular, we report\nthe root-mean-square error (RMSE) over all pixels and channels. For PASCAL, we measure the error\nover the Rf of the detector network.\nDetector network performance measures the performance of the detector network in Faster-RCNN\nusing multi-class classi\ufb01cation accuracy. We replace the ROI-pooling in Faster-RCNN by pooling\nover the bounding box in Ie. Note that the bounding box is backprojected to equirectangular projection\nand is no longer a square region.\nProposal network performance evaluates the proposal network in Faster-RCNN using average\nIntersection-over-Union (IoU). For each bounding box centered at \u02c6n, we project the conv5_3 output\nto the tangent plane \u02c6n using P and apply the proposal network at the center of the bounding box on\nthe tangent plane. Given the predicted proposals, we compute the IoUs between foreground proposals\nand the bounding box and take the maximum. The IoU is set to 0 if there is no foreground proposal.\nFinally, we average the IoU over bounding boxes.\nWe stress that our goal is not to build a new object detector; rather, we aim to reproduce the behavior\nof existing 2D models on 360\u00b0 data with lower computational cost. Thus, the metrics capture how\naccurately and how quickly we can replicate the exact solution.\nBaselines We compare our method with the following baselines.\n\u2022 EXACT \u2014 Compute the true target value Np(Is)[\u03b8, \u03c6] for every pixel. This serves as an upper\n\u2022 DIRECT \u2014 Apply Np on Ie directly. We replace max-pooling with dilated convolution to produce\n\u2022 INTERP \u2014 Compute Np(Is)[\u03b8, \u03c6] every S-pixels and interpolate the values for the others. We set\nS such that the computational cost is roughly the same as our SPHCONV. This is a more ef\ufb01cient\nvariant of Strategy II in Fig. 1.\n\na full resolution output. This is Strategy I in Fig. 1 and is used in 360\u00b0 video analysis [19, 26].\n\nbound in performance and does not consider the computational cost.\n\n\u2022 PERSPECT \u2014 Project Is onto a cube map [2] and then apply Np on each face of the cube, which is\na perspective image with 90\u00b0 FOV. The result is backprojected to Ie to obtain the feature on Ie.\nWe use W =960 for the cube map resolution so \u2206\u03b8 is roughly the same as Ip. This is a second\nvariant of Strategy II in Fig. 1 used in PanoContext [41].\n\nSPHCONV variants We evaluate three variants of our approach:\n\u2022 OPTSPHCONV \u2014 To compute the output for each layer l, OPTSPHCONV computes the exact\noutput for layer l\u22121 using Np(Is) then applies spherical convolution for layer l. OPTSPHCONV\nserves as an upper bound for our approach, where it avoids accumulating any error across layers.\n\u2022 SPHCONV-PRE \u2014 Uses the weights from kernel-wise pre-training directly without \ufb01ne-tuning.\n\u2022 SPHCONV \u2014 The full spherical convolution with joint \ufb01ne-tuning of all layers.\nImplementation details We set the resolution of Ie to 640\u00d7320. For the projection operator P, we\nmap \u03b1=65.5\u00b0 to W =640 pixels following SUN360 [38]. The pixel size is therefore \u2206\u03b8e=360\u00b0/640\nfor Ie and \u2206\u03b8p=65.5\u00b0/640 for Ip. Accordingly, we remove the \ufb01rst three max-pooling layers so\nNe has only one max-pooling layer following conv4_3. The kernel size upper bound Uk=7 \u00d7 7\nfollowing the max kernel size in VGG. We insert batch normalization for conv4_1 to conv5_3. See\nSupp. for details.\n4.1 Network output accuracy and computational cost\nFig. 5a shows the output error of layers conv3_3 and conv5_3 on the Pano2Vid [35] dataset (see\nSupp. for similar results on other layers.). The error is normalized by that of the mean predictor. We\nevaluate the error at 5 polar angles \u03b8 uniformly sampled from the northern hemisphere, since error is\nroughly symmetric with the equator.\n\n7\n\n\f(a) Network output errors vs. polar angle\n\n(b) Cost vs. accuracy\n\nFigure 5: (a) Network output error on Pano2Vid; lower is better. Note the error of EXACT is 0 by de\ufb01nition.\nOur method\u2019s convolutions are much closer to the exact solution than the baselines\u2019. (b) Computational cost\nvs. accuracy on PASCAL. Our approach yields accuracy closest to the exact solution while requiring orders of\nmagnitude less computation time (left plot). Our cost is similar to the other approximations tested (right plot).\nPlot titles indicate the y-labels, and error is measured by root-mean-square-error (RMSE).\n\nFigure 6: Three AlexNet conv1 kernels (left squares) and their corresponding four SPHCONV-PRE kernels at\n\u03b8 \u2208 {9\u00b0, 18\u00b0, 36\u00b0, 72\u00b0} (left to right).\n\nFirst we discuss the three variants of our method. OPTSPHCONV performs the best in all layers\nand \u03b8, validating our main idea of spherical convolution. It performs particularly well in the lower\nlayers, because the Rf is larger in higher layers and the distortion becomes more signi\ufb01cant. Overall,\nSPHCONV-PRE performs the second best, but as to be expected, the gap with OPTCONV becomes\nlarger in higher layers because of error propagation. SPHCONV outperforms SPHCONV-PRE in\nconv5_3 at the cost of larger error in lower layers (as seen here for conv3_3). It also has larger error\nat \u03b8=18\u00b0 for two possible reasons. First, the learning curve indicates that the network learns more\nslowly near the pole, possibly because the Rf is larger and the pixels degenerate. Second, we optimize\nthe joint L2 loss, which may trade the error near the pole with that at the center.\nComparing to the baselines, we see that ours achieves lowest errors. DIRECT performs the worst\namong all methods, underscoring that convolutions on the \ufb02attened sphere\u2014though fast\u2014are inade-\nquate. INTERP performs better than DIRECT, and the error decreases in higher layers. This is because\nthe Rf is larger in the higher layers, so the S-pixel shift in Ie causes relatively smaller changes in\nthe Rf and therefore the network output. PERSPECTIVE performs similarly in different layers and\noutperforms INTERP in lower layers. The error of PERSPECTIVE is particularly large at \u03b8=54\u00b0, which\nis close to the boundary of the perspective image and has larger perspective distortion.\nFig. 5b shows the accuracy vs. cost tradeoff. We measure computational cost by the number of\nMultiply-Accumulate (MAC) operations. The leftmost plot shows cost on a log scale. Here we\nsee that EXACT\u2014whose outputs we wish to replicate\u2014is about 400 times slower than SPHCONV,\nand SPHCONV approaches EXACT\u2019s detector accuracy much better than all baselines. The second\nplot shows that SPHCONV is about 34% faster than INTERP (while performing better in all metrics).\nPERSPECTIVE is the fastest among all methods and is 60% faster than SPHCONV, followed by\nDIRECT which is 23% faster than SPHCONV. However, both baselines are noticeably inferior in\naccuracy compared to SPHCONV.\nTo visualize what our approach has learned, we learn the \ufb01rst layer of the AlexNet [25] model\nprovided by the Caffe package [23] and examine the resulting kernels. Fig. 6 shows the original\nkernel Kp and the corresponding kernels Ke at different polar angles \u03b8. Ke is usually the re-scaled\nversion of Kp, but the weights are often ampli\ufb01ed because multiple pixels in Kp fall to the same\npixel in Ke like the second example. We also observe situations where the high frequency signal in\nthe kernel is reduced, like the third example, possibly because the kernel is smaller. Note that we\nlearn the \ufb01rst convolution layer for visualization purposes only, since l = 1 (only) has an analytic\nsolution (cf. Sec 3.3). See Supp. for the complete set of kernels.\n\n4.2 Object detection and proposal accuracy\nHaving established our approach provides accurate and ef\ufb01cient Ne convolutions, we now examine\nhow important that accuracy is to object detection on 360\u00b0 inputs. Fig. 7a shows the result of the\nFaster-RCNN detector network on PASCAL in 360\u00b0 format. OPTSPHCONV performs almost as well\nas EXACT. The performance degrades in SPHCONV-PRE because of error accumulation, but it still\n\n8\n\n18\u25e636\u25e654\u25e672\u25e690\u25e6012conv33RMSE18\u25e636\u25e654\u25e672\u25e690\u25e6012conv53RMSE\u03b8DirectInterpPerspectiveExactOptSphConvSphConv-PreSphConv1011021030.40.50.60.70.8Accuracy024600.511.5conv53RMSETera-MACs\f(a) Detector network performance.\n\n(b) Proposal network accuracy (IoU).\n\nFigure 7: Faster-RCNN object detection accuracy on a 360\u00b0 version of PASCAL across polar angles \u03b8, for both\nthe (a) detector network and (b) proposal network. R refers to the Rf of Np. Best viewed in color.\n\nFigure 8: Object detection examples on 360\u00b0 PASCAL test images. Images show the top 40% of equirectangular\nprojection; black regions are unde\ufb01ned pixels. Text gives predicted label, multi-class probability, and IoU, resp.\nOur method successfully detects objects undergoing severe distortion, some of which are barely recognizable\neven for a human viewer.\n\nsigni\ufb01cantly outperforms DIRECT and is better than INTERP and PERSPECTIVE in most regions.\nAlthough joint training (SPHCONV) improves the output error near the equator, the error is larger near\nthe pole which degrades the detector performance. Note that the Rf of the detector network spans\nmultiple rows, so the error is the weighted sum of the error at different rows. The result, together\nwith Fig. 5a, suggest that SPHCONV reduces the conv5_3 error in parts of the Rf but increases it at\nthe other parts. The detector network needs accurate conv5_3 features throughout the Rf in order to\ngenerate good predictions.\nDIRECT again performs the worst. In particular, the performance drops signi\ufb01cantly at \u03b8=18\u00b0,\nshowing that it is sensitive to the distortion. In contrast, INTERP performs better near the pole\nbecause the samples are denser on the unit sphere. In fact, INTERP should converge to EXACT at the\npole. PERSPECTIVE outperforms INTERP near the equator but is worse in other regions. Note that\n\u03b8\u2208{18\u00b0, 36\u00b0} falls on the top face, and \u03b8=54\u00b0 is near the border of the face. The result suggests that\nPERSPECTIVE is still sensitive to the polar angle, and it performs the best when the object is near the\ncenter of the faces where the perspective distortion is small.\nFig. 7b shows the performance of the object proposal network for two scales (see Supp. for more).\nInterestingly, the result is different from the detector network. OPTSPHCONV still performs almost\nthe same as EXACT, and SPHCONV-PRE performs better than baselines. However, DIRECT now\noutperforms other baselines, suggesting that the proposal network is not as sensitive as the detector\nnetwork to the distortion introduced by equirectangular projection. The performance of the methods\nis similar when the object is larger (right plot), even though the output error is signi\ufb01cantly different.\nThe only exception is PERSPECTIVE, which performs poorly for \u03b8\u2208{54\u00b0, 72\u00b0, 90\u00b0} regardless of the\nobject scale. It again suggests that objectness is sensitive to the perspective image being sampled.\nFig. 8 shows examples of objects successfully detected by our approach in spite of severe distortions.\nSee Supp. for more examples.\n\n5 Conclusion\nWe propose to learn spherical convolutions for 360\u00b0 images. Our solution entails a new form of\ndistillation across camera projection models. Compared to current practices for feature extraction on\n360\u00b0 images/video, spherical convolution bene\ufb01ts ef\ufb01ciency by avoiding performing multiple per-\nspective projections, and it bene\ufb01ts accuracy by adapting kernels to the distortions in equirectangular\nprojection. Results on two datasets demonstrate how it successfully transfers state-of-the-art vision\nmodels from the realm of limited FOV 2D imagery into the realm of omnidirectional data.\nFuture work will explore SPHCONV in the context of other dense prediction problems like segmenta-\ntion, as well as the impact of different projection models within our basic framework.\n\n9\n\n18\u25e636\u25e654\u25e672\u25e690\u25e60.20.40.60.8Accuracy18\u25e636\u25e654\u25e672\u25e690\u25e600.511.52OutputRMSEDirectInterpPerspectiveExactOptSphConvSphConv-PreSphConv18\u25e636\u25e654\u25e672\u25e690\u25e600.10.20.3IoUScale=0.5R18\u25e636\u25e654\u25e672\u25e690\u25e600.10.20.3Scale=1.0R\fReferences\n[1] https://facebook360.fb.com/editing-360-photos-injecting-metadata/.\n[2] https://code.facebook.com/posts/1638767863078802/under-the-hood-building-360-video/.\n[3] J. Ba and R. Caruana. Do deep nets really need to be deep? In NIPS, 2014.\n[4] A. Barre, A. Flocon, and R. Hansen. Curvilinear perspective, 1987.\n[5] C. Bucilu\u02c7a, R. Caruana, and A. Niculescu-Mizil. Model compression. In ACM SIGKDD, 2006.\n[6] T. Cohen, M. Geiger, J. K\u00f6hler, and M. Welling. Convolutional networks for spherical signals. arXiv\n\npreprint arXiv:1709.04893, 2017.\n\n[7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV,\n\n2017.\n\n[8] J. Deng, W. Dong, R. Socher, L. Li, and L. Fei-Fei. Imagenet: a large-scale hierarchical image database.\n\nIn CVPR, 2009.\n\n[9] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal\nvisual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98\u2013136,\nJan. 2015.\n\n[10] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action\n\nrecognition. In CVPR, 2016.\n\n[11] A. Furnari, G. M. Farinella, A. R. Bruna, and S. Battiato. Af\ufb01ne covariant features for \ufb01sheye distortion\n\nlocal modeling. IEEE Transactions on Image Processing, 26(2):696\u2013710, 2017.\n\n[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n\n[13] S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In CVPR, 2016.\n[14] P. Hansen, P. Corke, W. Boles, and K. Daniilidis. Scale-invariant features on the sphere. In ICCV, 2007.\n[15] P. Hansen, P. Corket, W. Boles, and K. Daniilidis. Scale invariant feature matching with wide angle images.\n\nIn IROS, 2007.\n\n[16] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In ICCV, 2017.\n[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n[18] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[19] H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, and M. Sun. Deep 360 pilot: Learning a deep\n\nagent for piloting through 360\u00b0 sports video. In CVPR, 2017.\n\n[20] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.\n[21] S. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully\n\nautomatic segmentation of generic objects in video. In CVPR, 2017.\n\n[22] Y. Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classi\ufb01cation. In\n\nCVPR, 2017.\n\n[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. In ACM MM, 2014.\n\n[24] R. Khasanova and P. Frossard. Graph-based classi\ufb01cation of omnidirectional images. arXiv preprint\n\narXiv:1707.08301, 2017.\n\n[25] A. Krizhevsky, I. Sutskever, and G. Hinton.\n\nnetworks. In NIPS, 2012.\n\nImagenet classi\ufb01cation with deep convolutional neural\n\n[26] W.-S. Lai, Y. Huang, N. Joshi, C. Buehler, M.-H. Yang, and S. B. Kang. Semantic-driven generation of\nhyperlapse from 360\u00b0 video. IEEE Transactions on Visualization and Computer Graphics, PP(99):1\u20131,\n2017.\n\n[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nIn Proc. of the IEEE, 1998.\n\n[28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\n2015.\n\n[29] E. Parisotto, J. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning.\n\nIn ICLR, 2016.\n\n[30] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\n\nproposal networks. In NIPS, 2015.\n\n[31] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep\n\nnets. In ICLR, 2015.\n\n[32] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In\n\nNIPS, 2014.\n\n[33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[34] Y.-C. Su and K. Grauman. Making 360\u00b0 video watchable in 2d: Learning videography for click free\n\nviewing. In CVPR, 2017.\n\n10\n\n\f[35] Y.-C. Su, D. Jayaraman, and K. Grauman. Pano2vid: Automatic cinematography for watching 360\u00b0 videos.\n\nIn ACCV, 2016.\n\n[36] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d\n\nconvolutional networks. In ICCV, 2015.\n\n[37] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning.\n\nIn ECCV, 2016.\n\n[38] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recognizing scene viewpoint using panoramic place\n\nrepresentation. In CVPR, 2012.\n\n[39] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.\n[40] L. Zelnik-Manor, G. Peters, and P. Perona. Squaring the circle in panoramas. In ICCV, 2005.\n[41] Y. Zhang, S. Song, P. Tan, and J. Xiao. Panocontext: A whole-room 3d context model for panoramic scene\n\nunderstanding. In ECCV, 2014.\n\n[42] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition\n\nusing places database. In NIPS, 2014.\n\n11\n\n\f", "award": [], "sourceid": 369, "authors": [{"given_name": "Yu-Chuan", "family_name": "Su", "institution": "UT Austin"}, {"given_name": "Kristen", "family_name": "Grauman", "institution": "University of Texas at Austin"}]}