{"title": "NeurVPS: Neural Vanishing Point Scanning via Conic Convolution", "book": "Advances in Neural Information Processing Systems", "page_first": 866, "page_last": 875, "abstract": "We present a simple yet effective end-to-end trainable deep network with geometry-inspired convolutional operators for detecting vanishing points in images. Traditional convolutional neural networks rely on aggregating edge features and do not have mechanisms to directly exploit the geometric properties of vanishing points as the intersections of parallel lines. In this work, we identify a canonical conic space in which the neural network can effectively compute the global geometric information of vanishing points locally, and we propose a novel operator named conic convolution that can be implemented as regular convolutions in this space. This new operator explicitly enforces feature extractions and aggregations along the structural lines and yet has the same number of parameters as the regular 2D convolution. Our extensive experiments on both synthetic and real-world datasets show that the proposed operator significantly improves the performance of vanishing point detection over traditional methods. The code and dataset have been made publicly available at https://github.com/zhou13/neurvps.", "full_text": "NeurVPS: Neural Vanishing Point Scanning\n\nvia Conic Convolution\n\nYichao Zhou\u2217\nUC Berkeley\n\nHaozhi Qi\nUC Berkeley\n\nJingwei Huang\n\nStandford University\n\nYi Ma\u2020\n\nUC Berkeley\n\nzyc@berkeley.edu\n\nhqi@berkeley.edu\n\njingweih@stanford.edu\n\nyima@eecs.berkeley.edu\n\nAbstract\n\nWe present a simple yet effective end-to-end trainable deep network with geometry-\ninspired convolutional operators for detecting vanishing points in images. Tradi-\ntional convolutional neural networks rely on aggregating edge features and do not\nhave mechanisms to directly exploit the geometric properties of vanishing points\nas the intersections of parallel lines. In this work, we identify a canonical conic\nspace in which the neural network can effectively compute the global geometric\ninformation of vanishing points locally, and we propose a novel operator named\nconic convolution that can be implemented as regular convolutions in this space.\nThis new operator explicitly enforces feature extractions and aggregations along\nthe structural lines and yet has the same number of parameters as the regular 2D\nconvolution. Our extensive experiments on both synthetic and real-world datasets\nshow that the proposed operator signi\ufb01cantly improves the performance of vanish-\ning point detection over traditional methods. The code and dataset have been made\npublicly available at https://github.com/zhou13/neurvps.\n\nIntroduction\n\n1\nVanishing point detection is a classic and important problem in 3D vision. Given the camera\ncalibration, vanishing points give us the direction of 3D lines, and thus let us infer 3D information of\nthe scene from a single 2D image. A robust and accurate vanishing point detection algorithm enables\nand enhances applications such as camera calibration [10], 3D reconstruction [18], photo forensics\n[35], object detection [19], wireframe parsing [48, 49], and autonomous driving [28].\nAlthough there has been a lot of work on this seemingly basic vision problem, no solution seems to\nbe quite satisfactory yet. Traditional methods (see [46, 27, 41] and references therein) usually \ufb01rst\nuse edge/line detectors to extract straight lines and then cluster them into multiple groups. Many\nrecent methods have proposed to improve the detection by training deep neural networks with labeled\ndata. However, such neural networks often offer only a coarse estimate for the position of vanishing\npoints [26] or horizontal lines [45]. The output is usually a component of a multi-stage system and\nused as an initialization to remove outliers from line clustering. Arguably the main reason for neural\nnetworks\u2019 poor precision in vanishing point detection (compared to line clustering-based methods) is\nlikely because existing neural network architectures are not designed to represent or learn the special\ngeometric properties of vanishing points and their relations to structural lines.\nTo address this issue, we propose a new convolutional neural network, called Neural Vanishing Point\nScanner (NeurVPS), that explicitly encodes and hence exploits the global geometric information of\nvanishing points and can be trained in an end-to-end manner to both robustly and accurately predict\nvanishing points. Our method samples a suf\ufb01cient number of point candidates and the network then\ndetermines which of them are valid. A common criterion of a valid vanishing point is whether it\nlies on the intersection of a suf\ufb01cient number of structural lines. Therefore, the role of our network\n\n\u2217We thank Yikai Li from SJTU and Jiajun Wu from MIT for their useful suggestions.\n\u2020This work is partially supported by the funding from Berkeley EECS Startup fund, Berkeley FHL Vive\nCenter for Enhanced Reality, research grants from Sony Research, and Bytedance Research Lab (Silicon Valley).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis to measure the intensity of the signals of the structural lines passing through the candidate point.\nAlthough this notion is simple and clear, it is a challenging task for neural networks to learn such\ngeometric concept since the relationship between the candidate point and structural lines not only\ndepend on global line orientations but also their pixel locations. In this work, we identify a canonical\nconic space in which this relationship only depends on local line orientations. For each pixel, we\nde\ufb01ne this space as a local coordinate system in which the x-axis is chosen to be the direction from\nthe pixel to the candidate point, so the associated structural lines in this space are always horizontal.\nWe propose a conic convolution operator, which applies regular convolution for each pixel in this\nconic space. This is similar to apply regular convolutions on a recti\ufb01ed image where the related\nstructural lines are transformed into horizontal lines. Therefore the network can determine how to use\nthe signals based on local orientations. In addition, feature aggregation in this recti\ufb01ed image also\nbecomes geometrically meaningful, since horizontal aggregation in the recti\ufb01ed image is identical to\nfeature aggregation along the structural lines.\nBased on the canonical space and the conic convolution operator, we are able to design the convolu-\ntional neural network that accurately predicts the vanishing points. We conduct extensive experiments\nand show the improvement on both synthetic and real-world datasets. With the ablation studies, we\nverify the importance of the proposed conic convolution operator.\n2 Related Work\nVanishing Point Detection. Vanishing point detection is a fundamental and yet surprisingly chal-\nlenging problem in computer vision. Since initially proposed by [3], researchers have been trying\nto tackle this problem from different perspectives. Early researches estimate vanishing points using\nsphere geometry [3, 30, 40], hierarchical Hough transformation [36], or the EM algorithms [46, 27].\nResearches such as [43, 33, 4, 1] use the Manhattan world assumptions [12] to improve the accuracy\nand the reliability of the detection. [2] extends the mutual orthogonality assumption to a set of mutual\northogonal vanishing point assumption (Atlanta world [37]).\nThe dominant approach is line-based vanishing point detection algorithms, which are often divided\ninto several stages. Firstly, a set of lines are detected [8, 42]. Then a line clustering algorithm [32]\nare used to propose several guesses of target vanishing point position based on geometric cues. The\nclustering methods include RANSAC [5], J-linkage [41], Hough transform [20], or EM [46, 27]. [50]\nuses contour detection and J-linkage in natural scenes but only one dominate vanishing point can be\ndetected. Our method does not rely on existing line detectors, and it can automatically learn the line\nfeatures in the conic space to predict any number of vanishing points from an image.\nRecently, with the help of convolutional neural networks, the vision community has tried to tackle the\nproblem from a data-driven and supervised learning approach. [9, 6, 47] formulate the vanishing point\ndetection as a patch classi\ufb01cation problem. They can only detect vanishing points within the image\nframe. Our method does not have such limitation. [45] detects vanishing points by \ufb01rst estimating\nhorizontal vanishing line candidates and score them by the vanishing points they go through. They\nuse an ImageNet pre-trained neural network that is \ufb01ne-tuned on Google street images. [26] uses\ninverse gnomonic image and regresses the sphere image representation of vanishing point. Both work\nrely on traditional line detection algorithms while our method learns it implicitly in the conic space.\nStructured Convolution Operators. Recently more and more operators are proposed to model\nspatial and geometric properties in images. For instance the wavelets based scattering networks\n(ScatNet) [7, 39] are introduced to ensure certain transform (say translational) invariance of the\nnetwork. [22] \ufb01rst explores geometric deformation with modern neural networks. [14, 23] modify the\nparameterization of the global deformable transformation into local convolution operators to improve\nthe performance on image classi\ufb01cation, object detection, and semantic segmentation. More recently,\nstructured and free-form \ufb01lters are composed [38]. While these methods allow the network to learn\nabout the space where the convolution operates on, we here explicitly de\ufb01ne the space from \ufb01rst\nprinciple and exploit its geometric information. Our method is similar to [22] in the sense that we\nboth want to rectify input to a canonical space. The difference is that they learn a global recti\ufb01cation\ntransformation while our transformation is local. Different from [14, 23], our convolutional kernel\nshape is not learned but designed according to the desired geometric property.\nGuided design of convolution kernels in canonical space is well practiced for irregular data. For\nspherical images, [11] designs operators for rotation-invariant features, while [24] operates in the\n\n2\n\n\fFigure 1: Illustration of sampled locations of 3 \u00d7 3 conic con-\nvolutions. The bright yellow region is the output pixel and v\nstands for the vanishing point. Upper and lower \ufb01gures illus-\ntrate the cases when the vanishing point is outside and inside the\nimage, respectively.\n\nFigure 2: Illustration of the overall network structure. The num-\nber of each convolutional block is the kernel size and output\ndimension respectively. The number of fully connected layer\nblock is the output dimension. The kernel size of Max Pooling\nlayer is 3 and stride is 2. Batch normalization and ReLU acti-\nvation are appended after each conv/fc layer except the last one\nuse sigmoid as activation.\n\nspace de\ufb01ned by longitude and latitude, which is more meaningful for climate data. In 3D vision,\ngeodesic CNN [31] adopts mesh convolution with the spherical coordinate, while TextureNet [21]\noperates in a canonical space de\ufb01ned by globally smoothed principal directions. Although we are\ndealing with regular images, we observe a strong correlation between the vanishing point and the\nconic space, where the conic operator is more effective than regular 2D convolution.\n3 Methods\n3.1 Overview\nFigure 2 illustrates the overall structure of our NeurVPS network. Taken an image and a vanishing\npoint as input, our network predicts the probability of a candidate being near a ground-truth vanishing\npoint. Our network has two parts: a backbone feature extraction network and a conic convolution\nsub-network. The backbone is a conventional CNN that extracts semantic features from images.\nWe use a single-stack hourglass network [34] for its ability to possess a large receptive \ufb01eld while\nmaintaining \ufb01ne spatial details. The conic convolutional network (Section 3.4) takes feature maps\nfrom the backbone as input and determines the existence of vanishing points around candidate\npositions (as a classi\ufb01cation problem). The conic convolution operators (Section 3.3) exploit the\ngeometric priors of vanishing points, and thus allow our algorithm to achieve superior performance\nwithout resorting to line detectors. Our system is end-to-end trainable.\nDue to the classi\ufb01cation nature of our model, we need to sample enough number of candidate points\nduring inference. It is computationally infeasible to directly sample suf\ufb01ciently dense candidates.\nTherefore, we use a coarse-to-\ufb01ne approach (Section 3.5). We \ufb01rst sample Nd points on the unit\nsphere and calculate their likelihoods of being the line direction (Section 3.2) of a vanishing point\nusing the trained neural network classi\ufb01er. We then pick the top K candidates and sample another Nd\npoints around each of their neighbours. This step is repeated until we reach the desired resolution.\n3.2 Basic Geometry and Representations of Vanishing Points\nThe position of a vanishing point encodes the line 3D direction. For a 3D ray described by o + \u03bbd in\nwhich o is its origin and d is its unit direction vector, its 2D projection on the image is\n\n\u00b7(o + \u03bbd),\n\n(1)\n\n(cid:34)px\n\n(cid:35)\n\nz\n\npy\n1\n\n=\n\n(cid:34)f\n(cid:124)\n\n(cid:35)\n(cid:125)\n\ncx\ncy\n1\n\n0\nf\n\n(cid:123)(cid:122)\n\nK\n\n3\n\n987654213987654213987654231987654231987654213987654213987654213987654213987654213VV3x3, 64/128/256/256, Conic Conv(BN, ReLU)3x3, stride 2 Max Poolingx41x1, 32, Conv(BN, ReLU)1024, FC(ReLU)Hourglass BackboneImageR, FC(Sigmoid)1024, FC(ReLU)Vanishing PointsOutput\fFigure 3: Illustration of vanishing points\u2019 Gaussian sphere representation of an image from the SU3\nwireframe dataset [49] and our multi-resolution sampling procedure in the coarse-to-\ufb01ne inference.\nIn the right three \ufb01gures, the red triangles represent the ground truth vanishing points and the dots\nrepresent the sampled locations.\n\nvy \u2212 cy\n\nf ]T \u2208 R3.\n\nd = [vx \u2212 cx\n\nwhere px and py are the coordinates in the image space, z is the depth in the camera space, K\nis the calibration matrix, f is the focal length, and [cx, cy]T \u2208 R2 is the optical center of the\ncamera. The vanishing point is the point with \u03bb \u2192 \u221e, whose image coordinate is v = [vx, vy]T :=\nlim\u03bb\u2192\u221e[px, py]T \u2208 R2. We can then derive the 3D direction of a line in term of its vanishing point:\n(2)\nIn the literature, a normalized line direction vector d is also called the Gaussian sphere representation\n[3] of the vanishing point v. The usage of d instead of v avoids the degenerated cases when\nd is parallel to the image plane. It also gives a natural metric that de\ufb01nes the distance between\ntwo vanishing points, the angle between their normalized line direction vectors: arccos|dT\ni dj| for\ntwo unit line directions di, dj \u2208 S2. Finally, sampling vanishing points with the Gaussian sphere\nrepresentation is easy, as it is equivalent to sampling on a unit sphere, while it remains ambiguous\nhow to sample vanishing points directly in the image plane.\n3.3 Conic Convolution Operators in Conic Space\nIn order for the network to effectively learn vanishing point related line features, we want to apply\nconvolutions in the space where related lines can be determined locally. We de\ufb01ne the conic space\nfor each pixel in the image domain as a rotated regular local coordinate system where the x-axis is the\ndirection from the pixel to the vanishing point. In this space, related lines can be identi\ufb01ed locally by\nchecking whether its orientation is horizontal. Accordingly, we propose a new convolution operator,\nnamed conic convolution, which applies the regular convolution in this conic space. This operator\neffectively encodes global geometric cues for classifying whether a candidate point (Section 3.6) is a\nvalid vanishing point. Figure 1 illustrates how this operator works.\nA 3 \u00d7 3 conic convolution takes the input feature map x and the coordinate of convolution center v\n(the position candidates of vanishing points) and outputs the feature map y with the same resolution.\nThe output feature map y can be computed with\n\n1(cid:88)\n\n1(cid:88)\n\n\u03b4x=\u22121\n\n\u03b4y=\u22121\n\ny(p) =\n\nw(\u03b4x, \u03b4y) \u00b7 x(p + \u03b4x \u00b7 t + \u03b4y \u00b7 R \u03c0\n\n2\n\nt), where t :=\n\nv \u2212 p\n(cid:107)v \u2212 p(cid:107)2\n\n\u2208 R2.\n\n(3)\n\nHere p \u2208 R2 is the coordinates of the output pixel, w is a 3 \u00d7 3 trainable convolution \ufb01lter,\n\u2208 R2\u00d72 is the rotational matrix that rotates a 2D vector by 90\u25e6 counterclockwise, and t is the\nR \u03c0\n2\nnormalized direction vector that points from the output pixel p to the convolution center v. We use\nbilinear interpolation to access values of x at non-integer coordinates.\nIntuitively, conic convolution makes edge detection easier and more accurate. An ordinary convolution\nmay need hundreds of \ufb01lters to recognize edge with different orientations, while conic convolution\nrequires much less \ufb01lters to recognize edges aligning with the candidate vanishing point because\n\ufb01lters are \ufb01rstly rotated towards the vanishing point. The strong/weak response (depends on the\ncandidate is positive/negative) will then be aggregated by subsequent fully-connected layers.\n3.4 Conic Convolutional Network\nThe conic convolutional network is a classi\ufb01er that takes the image feature map x and a candidate\nvanishing point position \u02c6v as input. For each angle threshold \u03b3 \u2208 \u0393, the network predicts whether\nthere exists a real vanishing point v in the image so that the angle between the 3D line directions v\nand \u02c6v is less than the threshold \u03b3. The choice of \u0393 will be discussed in Section 3.5.\n\n4\n\n\fFigure 2 shows the structure diagram of the proposed conic convolutional network. We \ufb01rst reduce\nthe dimension for the feature map from the backbone to save the GPU memory footprint with an\n1 \u00d7 1 convolution layer. Then 4 consecutive conic convolution (with ReLU activation) and max-\npooling layers are applied to capture the geometric information at different spatial resolutions. The\nchannel dimension is increased by a factor of two in each layer to compensate the reduced spatial\nresolution. After that, we \ufb02atten the feature map and use two fully connected layers to aggregate the\nfeatures. Finally, a sigmoid classi\ufb01er with binary cross entropy loss is applied on top of the feature to\ndiscriminate positive and negative samples with respect to different thresholds from \u0393.\n3.5 Coarse-to-\ufb01ne Inference\nWith the backbone and the conic convolutional network, we\ncan compute the probability of vanishing point over the hemi-\nsphere of the unit line direction vector \u02c6d \u2208 S2, as shown in\nFigure 3. We utilize a multi-resolution strategy to quickly\npinpoint the location of the vanishing points. We use R rounds\nto search for the vanishing points. In the r-th round, we uni-\nformly sample Nd line direction vectors on the surface of the\nunit spherical cap with direction nr and polar angle \u03b3r using\nthe Fibonacci lattice [17]. Mathematically, the n-th sampled\nline direction vector can be written as\nn(cos \u03b8r\n\nr,1 + sin \u03b8r\n\nnn\u22a5\n\nr,2),\n\nn\u22a5\n\n\u2212\u03b8r\n\nn\n\nn\u22a5\n\nr,2\n\n\u03c6r\nn\n\no\n\nnr\n\ndr\nn\n\nr,1\n\nn := arccos(cid:0)1 + (cos \u03b1r \u2212 1) \u2217 n/Nd\n\ndr\nn = cos \u03c6r\nnnr + sin \u03c6r\n\u03c6r\n\u221a\n\u03b8r\nn := (1 +\n\nnn\u22a5\n\n5)\u03c0n,\n\n(cid:1),\n\nFigure 4: Illustration of the variables\nused in uniform spherical cap sampling.\n\nr,1 and n\u22a5\n\nn(cid:105)| and nr+1 to the dr\n\nin which n\u22a5\nr,2 are two arbitrary orthogonal unit vectors that are perpendicular to nr, as shown\nin Figure 4. We initialize n1 \u2190 (0, 0, 1) and \u03b31 \u2190 \u03c0. For the round r + 1, we set the threashold\n\u03b3r+1 \u2190 \u03c1 maxw\u2208S2 minn arccos|(cid:104)w, dr\nn whose vanishing point obtains the\nbest score from the conic convolutional network classi\ufb01er with angle threshold \u03b3 = \u03b3r+1. Here, \u03c1 is\na hyperparameter controlling the distance between two nearby spherical caps. Therefore, we set the\nthreshold set \u0393 in Section 3.3 to be {\u03b3r+1 | r \u2208 {1, 2, . . . , R}} accordingly.\nThe above process detects a single dominant vanishing point in a given image. To search for more\nthan one vanishing point, one can modify the \ufb01rst round to \ufb01nd the best K line directions and use the\nsame process for each line direction in the remaining rounds.\n3.6 Vanishing Point Sampling for Training\nDuring training, we need to generate positive samples and negative samples. For each ground-truth\nvanishing point with line direction d and threshold \u03b3, we sample N + positive vanishing points\nand N\u2212 negative vanishing points. The positive vanishing points are uniformly sampled from\nS + = {w | w \u2208 S2 : arccos|(cid:104)w, d(cid:105)| < \u03b3} and the negative vanishing points are uniformly\nsampled from S\u2212 = {w | w \u2208 S2 : \u03b3 < arccos|(cid:104)w, d(cid:105)| < 2\u03b3}. In addition, we sample N\u2217\nrandom vanishing points for each image to reduce the sampling bias. The line directions of those\nvanishing points are uniformly sampled from the unit hemisphere.\n4 Experiments\n4.1 Datasets and Metrics\nWe conduct experiments on both synthetic [49] and real-world [50, 13] datasets.\nNatural Scene [50]. This dataset contains images of natural scenes from AVA and Flickr. The\nauthors pick the images that contain only one dominating vanishing point and label their locations.\nThere are 2,275 images in the dataset. We divide them into 2,000 training images and 275 test images\nrandomly. Because this dataset does not contain the camera calibration information, we set the focal\nlength to the half of the sensor width for vanishing point sampling and evaluation. Such focal length\nsimulates the wide-angle lens used in landscape photography.\nScanNet [13]. ScanNet is a 3D indoor environment dataset with reconstructed meshes and RGB\nimages captured by mobile devices. For each scene, we \ufb01nd the three orthogonal principal directions\nfor each scene which align with most of the surface normals and use them to compute the vanishing\n\n5\n\n\fImplementation Detail\n\npoints for each RGB image. We split the dataset as suggested by ScanNet v2 tasks, and train the\nnetwork to predict the three vanishing points given the RGB image. There are 266,844 training\nimages. We randomly sample 500 images from the validation set as our test set.\nSU3 Wireframe [49]. The \u201cground-truth\u201d vanishing point positions in real world datasets are often\ninaccurate. To systematically evaluate the performance of our algorithm, we test our method on the\nrecent synthetic SceneCity Urban 3D (SU3) wireframe dataset [49]. This photo-realistic dataset is\ncreated with a procedural building generator, in which the vanishing points are directly computed\nfrom the CAD models of the buildings. It contains 22,500 training images and 500 validation images.\nEvaluation Metrics. Previous methods usually use horizon detection accuracy [2, 29, 45] or pixel\nconsistency [50] to evaluate their method. These metrics are indirect for this task. To better understand\nthe performance of our algorithm, we propose a new metric, called angle accuracy (AA). For each\nvanishing point from the predictions, we calculate the angle between the ground-truth and the\npredicted one. Then we count the percentage of predictions whose angle difference is within a\npre-de\ufb01ned threshold. By varying different thresholds, we can plot the angle accuracy curves. AA\u03b8\nis de\ufb01ned as the area under the curve between [0, \u03b8] divided by \u03b8. In our experiments, the upper\nbound \u03b8 is set to be 0.2\u25e6, 0.5\u25e6, and 1.0\u25e6 on the synthetic dataset and 1\u25e6, 2\u25e6, and 10\u25e6 on the real\nworld dataset. Two angle accuracy curves (coarse and \ufb01ne level) are plotted for each dataset. Our\nmetric is able to show the algorithm performance under different precision requirements. For a fair\ncomparison, we also report the performance in pixel consistency as in the dataset paper [50].\n4.2\nWe implement the conic convolution operator in PyTorch by modifying the \u201cim2col + GEMM\u201d\nfunction according to Equation (3), similar to the method used in [14]. Input images are resized to\n512 \u00d7 512. During training, the Adam optimizer [25] is used. Learning rate and weight decay are set\nto be 4 \u00d7 10\u22124 and 1 \u00d7 10\u22125, respectively. All experiments are conducted on two NVIDIA RTX\n2080Ti GPUs, with each GPU holding 6 mini-batches. For synthetic data [49], we train 30 epochs\nand reduce the learning rate by 10 at the 24-th epoch. We use \u03c1 = 1.2, N + = N\u2212 = 1 and N\u2217 = 3.\nFor the Natural Scene dataset, we train the model for 100 epochs and decay the learning rate at 60-th\nepoch. For ScanNet [13], we train the model for 3 epochs. We augment the data with horizontal \ufb02ip.\nWe set Nd = 64 and use RSU3 = 5, RNS = 4, and RSN = 3 in the coarse-to-\ufb01ne inference for the\nSU3 dataset, the Natural Scene dataset, and the ScanNet dataset, respectively. During inference, the\nresults from the backbone network can be shared so only the conic convolution layers need to be\nforwarded multiple times. Using the Nature Scene dataset as an example, we conduct 4 rounds of\ncoarse-to-\ufb01ne inference, in each of which we sample 64 vanishing points. So we forward the conic\nconvolution part 256 times for each image during testing. The evaluation speed is about 1.5 vanishing\npoints per second on a single GPU.\n4.3 Ablation Studies on the Synthetic Dataset\nComparison with Baseline Methods. We\ncompare our method with both traditional line\ndetection based methods and neural network\nbased methods. The sample images and results\ncan be found in Figure 3 and supplementary ma-\nterials. For line-based algorithms, the LSD with\nJ-linkage clustering [42, 41, 16] probably is the\nmost widely used method for vanishing point\ndetection. Note that LSD is a strong competi-\ntor on the synthetic SU3 dataset as the images\ncontain many sharp edges and long lines.\nWe aim to compare pure neural network meth-\nods that only rely on raw pixels. Existing meth-\nods such as [9, 15, 6] can only detect vanishing\npoints inside images. [45, 26] rely on an external line map as initial inputs. To the best of our\nknowledge, there is no existing pure neural network methods that are general enough to handle our\ncase. Therefore, we propose two intuitive baselines. The \ufb01rst baseline, called REG, is a neural\nnetwork that direct regresses value of d using chamfer-(cid:96)2 loss, similar to [49]. We change all the\nconic convolutions to traditional 2D convolutions to make the numbers of parameters be the same.\n\nTable 1: Ablation study of our method. \u201cREG\u201d denotes\nthe baseline that directly regress line direction in the\ncamera space. \u201cCLS\u201d denotes the baseline that do van-\nishing point classi\ufb01cation using image feature and its\ncoordinate. Conic\u00d7K denotes our methods with varying\nnumber of conic convolution layers.\n\nmean median\n0.21\u25e6\n3.89\u25e6\n2.07\u25e6\n1.48\u25e6\n0.99\u25e6\n1.77\u25e6\n0.43\u25e6\n0.78\u25e6\n0.09\u25e6\n0.15\u25e6\n0.14\u25e6\n0.09\u25e6\n\nAA1.0\u25e6\n61.5\n15.0\n23.7\n50.3\n86.3\n86.2\n\nLSD [16]\n\nREG\nCLS\n\nConic\u00d72\nConic\u00d74\nConic\u00d76\n\nAA0.2\u25e6\n27.9\n2.2\n2.2\n10.5\n47.5\n49.1\n\nAA0.5\u25e6\n47.9\n6.5\n9.1\n28.9\n74.2\n74.0\n\n6\n\n\f(a) Angle difference ranges from 0\u25e6 to 1\u25e6.\n(b) Angle difference ranges from 0\u25e6 to 6\u25e6.\nFigure 5: Angle accuracy curves for different methods on the SU3 wireframe dataset [49].\n\n(a) Angle difference ranges from 0\u25e6 to 2\u25e6.\n(b) Angle difference ranges from 0\u25e6 to 20\u25e6.\nFigure 6: Angle accuracy curves for different methods on the Natural Scene dataset [50].\n\n(a) Angle difference ranges from 0\u25e6 to 2\u25e6.\n\n(b) Angle difference ranges from 0\u25e6 to 20\u25e6.\n\nFigure 7: Angle accuracy curves for different methods on the ScanNet dataset [13].\n\n7\n\n0.00.10.20.30.40.50.60.70.80.9Angle Difference0.00.10.20.30.40.50.60.70.80.9PercentageAA Curve @ 1 for SU3 WireframeLSD + J-Linkage [16]CNN RegressionCNN ClassificationConic x 2Conic x 4Conic x 60.00.61.21.82.43.03.64.24.85.4Angle Difference0.00.10.20.30.40.50.60.70.80.9PercentageAA Curve @ 6 for SU3 WireframeLSD + J-Linkage [16]CNN RegressionCNN ClassificationConic x 2Conic x 4Conic x 60.00.20.40.60.81.01.21.41.61.8Angle Difference0.00.10.20.30.40.50.60.70.80.9PercentageAA Curve @ 2 for the Natural Scene DatasetConic x 4vpdet [50]CNN ClassificationCNN Regression024681012141618Angle Difference0.00.10.20.30.40.50.60.70.80.9PercentageAA Curve @ 20 for the Natural Scene DatasetConic x 4vpdet [50]CNN ClassificationCNN Regression0.00.20.40.60.81.01.21.41.61.8Angle Difference0.00.10.20.30.40.50.60.70.80.9PercentageAA Curve @ 2 for the ScanNet DatasetLSD + J-Linkage [16]CNN RegressionCNN ClassificationConic x 6024681012141618Angle Difference0.00.10.20.30.40.50.60.70.80.9PercentageAA Curve @ 20 for the ScanNet DatasetLSD + J-Linkage [16]CNN RegressionCNN ClassificationConic x 6\fThe second baseline, called CLS, uses our \ufb01ne-to-coarse classi\ufb01cation approach. We change all the\nconic convolutions to their traditional counterparts, and concatenate d to the feature map right before\nfeeding it to the NeurVPS head to make the neural network aware of the position of vanishing points.\nThe results are shown in Table 1 and Figure 5. By utilizing the geometric priors and large-scale\ntraining data, our method signi\ufb01cantly outperforms other baselines across all the metrics. We note\nthat, compared to LSD, neural network baselines perform better in terms of mean error but much\nworse for AA. This is because line-based methods are generally more accurate, while data-driven\napproaches are more unlikely to produce outliers. This phenomenon is also observed in Figure 5b,\nwhere neural network baselines achieve higher percentage when the angle error is larger than 4.5\u25e6.\nEffect of Conic Convolution. We now examine the effect of different numbers of conic convolution\nlayers. We test with 2/4/6 conic convolution layers, denoted as Conic\u00d72/4/6, respectively. For\nConic\u00d72, we only keep the last two conic convolutions and replace others as their plain counterparts.\nFor Conic\u00d76, we add two more conic convolution layers at the \ufb01nest level, without max pooling\nappended. The results are shown in Table 1 and Figure 5. We observe that the performance keeps\nincreasing when adding more conic convolutions. We hypothesize that this is because stacking\nmultiple conic convolutions enables our model to capture higher order edge information and thus\nsigni\ufb01cantly increase the performance. The performance improvement saturates at Conic\u00d76.\n4.4 NeurVPS on the Real World Datasets\nNatural Scene [50] We validate our method on real world datasets\nto test its effectiveness and generalizability. The results of angle\naccuracy on the Natural Scene dataset [50] are shown in Table 2\nand Figure 6. We also compare the performance in the consistency\nmeasure, a metric used by the baseline method (a contour-based\nclustering algorithm, labeled as vpdet) in the dataset paper [50] in\nFigure 8. Our method outperforms this strong baseline algorithm\nvpdet by a fairly large margin in term of all metrics. Our exper-\niment also shows that the naive CNN baselines under-perform\nvpdet until the angle tolerance is around 4\u25e6. This is consistent\nwith the results from [50], in which vpdet is better than the pre-\nvious deep learning method [45] in the region that requires high\nprecision. Such phenomena indicates that our geometry-aware network is able to accurately locate\nvanishing points in images, while naive CNNs can only roughly determine vanishing points\u2019 position.\nScanNet [13] The results on the ScanNet\ndataset [13] are shown in Table 3 and Figure\n7. For baseline of traditional methods, we only\ncompare our method with LSD + J-linkage be-\ncause other methods such as [50] are not directly\napplicable when there are three vanishing points\nin a scene. Our results reduced the mean and\nmedian error by 6 and 4 times, respectively. The\nangle accuracy also improves by a large mar-\ngin. The ScanNet [13] is a large dataset, so both\nCLS and REG works reasonable good. How-\never, because the traditional convolution cannot\nfully exploit the geometry structure of vanish-\ning points, the performance of those baseline\nalgorithms is worse than the performance of our\nconic convolutional neural network. It is also\nworth mentioning that errors of ground truth\nvanishing points of the ScanNet dataset are quite large due to the inaccurate 3D reconstruction\nand budget capture devices, which probably is the reason why the performance gap between conic\nconvolutional networks and traditional 2D convolutional networks is not so signi\ufb01cant.\nOne drawback of our data-driven method is the need of large amount of training data. We do not\nevaluate our method on datasets such as YUD [15], ECD [2], and HLW [44] because there is no\nsuitable public dataset for training. In the future, we will study how to exploit geometric information\nunder unsupervised or semi-supervised settings hence to alleviate the data scarcity problem.\n\nmean median\n11.8\u25e6\n12.6\u25e6\n6.9\u25e6\n5.0\u25e6\n3.6\u25e6\n5.3\u25e6\n4.5\u25e6\n3.0\u25e6\nTable 3: Performance of algorithms on ScanNet [13].\n\nTable 2: Performance of algorithms on the Natural Scene\ndataset [50]. vpdet is the method from the dataset paper.\n\nmean median\n3.20\u25e6\n5.09\u25e6\n5.80\u25e6\n2.79\u25e6\n1.56\u25e6\n12.6\u25e6\n1.83\u25e6\n0.87\u25e6\n\nAA1\u25e6\n2.4\n4.4\n18.5\n29.1\n\nAA2\u25e6\n9.9\n14.5\n33.0\n50.3\n\nAA10\u25e6\n58.8\n62.4\n60.0\n85.5\n\nFigure 8: Consistency measure on\nthe Nature Scene dataset [50].\n\nREG\nCLS\nvpdet [50]\nOurs\n\nAA1\u25e6\n1.7\n1.5\n2.0\n3.4\n\nAA2\u25e6\n5.4\n5.1\n8.1\n11.5\n\nAA10\u25e6\n24.8\n45.1\n55.9\n61.7\n\nLSD [16]\nREG\nCLS\nOurs\n\n8\n\n01234567891000.10.20.30.40.50.60.70.80.91Conic x 4vpdet [50]\fReferences\n[1] Michel Antunes and Joao P Barreto. A global approach for the detection of vanishing points\n\nand mutually orthogonal vanishing directions. In CVPR, 2013.\n\n[2] Olga Barinova, Victor Lempitsky, Elena Tretiak, and Pushmeet Kohli. Geometric image parsing\n\nin man-made environments. In ECCV, 2010.\n\n[3] Stephen T Barnard. Interpreting perspective images. Arti\ufb01cial intelligence, 1983.\n[4] Jean-Charles Bazin, Yongduek Seo, Cedric Demonceaux, Pascal Vasseur, Katsushi Ikeuchi, Inso\nKweon, and Marc Pollefeys. Globally optimal line clustering and vanishing point estimation in\nManhattan world. In CVPR, 2012.\n\n[5] Robert C Bolles and Martin A Fischler. A RANSAC-based approach to model \ufb01tting and its\n\napplication to \ufb01nding cylinders in range data. In IJCAI, 1981.\n\n[6] Ali Borji. Vanishing point detection with convolutional neural networks. arXiv preprint, 2016.\n[7] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE TPAMI, 35(8):1872\u2013\n\n1886, 2013.\n\n[8] John Canny. A computational approach to edge detection. Morgan Kaufmann Publishers Inc.,\n\n1987.\n\n[9] Chin-Kai Chang, Jiaping Zhao, and Laurent Itti. DeepVP: Deep learning for vanishing point\n\ndetection on 1 million street view images. In ICRA, 2018.\n\n[10] Roberto Cipolla, Tom Drummond, and Duncan P Robertson. Camera calibration from vanishing\n\npoints in image of architectural scenes. In BMVC, 1999.\n\n[11] Taco S Cohen, Mario Geiger, Jonas K\u00a8ohler, and Max Welling. Spherical CNNs. In ICLR 2018,\n\n2018.\n\n[12] James M Coughlan and Alan L Yuille. Manhattan world: Compass direction from a single\n\nimage by Bayesian inference. In ICCV, 1999.\n\n[13] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias\n\nNie\u00dfner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017.\n\n[14] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei.\n\nDeformable convolutional networks. In ICCV, 2017.\n\n[15] Patrick Denis, James H Elder, and Francisco J Estrada. Ef\ufb01cient edge-based methods for\n\nestimating Manhattan frames in urban imagery. In ECCV, 2008.\n\n[16] Chen Feng, Fei Deng, and Vineet R Kamat. Semi-automatic 3D reconstruction of piecewise\n\nplanar building models from single image. CONVR, 2010.\n\u00b4Alvaro Gonz\u00b4alez. Measurement of areas on a sphere using \ufb01bonacci and latitude\u2013longitude\nlattices. Mathematical Geosciences, 2010.\n\n[17]\n\n[18] Erwan Guillou, Daniel Meneveaux, Eric Maisel, and Kadi Bouatouch. Using vanishing points\nfor camera calibration and coarse 3D reconstruction from a single image. The Visual Computer,\n2000.\n\n[19] Derek Hoiem, Alexei A Efros, and Martial Hebert. Putting objects in perspective. IJCV, 2008.\n[20] Paul VC Hough. Machine analysis of bubble chamber pictures. In International Conference on\n\nHigh Energy Accelerators and Instrumentation, 1959.\n\n[21] Jingwei Huang, Haotian Zhang, Li Yi, Thomas Funkhouser, Matthias Nie\u00dfner, and Leonidas\nGuibas. Texturenet: Consistent local parametrizations for learning from high-resolution signals\non meshes. In CVPR, 2019.\n\n[22] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In\n\nNIPS, 2015.\n\n[23] Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image\n\nclassi\ufb01cation. In CVPR, 2017.\n\n[24] Chiyu Jiang, Jingwei Huang, Karthik Kashinath, Philip Marcus, Matthias Niessner, et al.\n\nSpherical CNNs on unstructured grids. In ICLR 2019, 2019.\n\n[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint,\n\n2014.\n\n9\n\n\f[26] Florian Kluger, Hanno Ackermann, Michael Ying Yang, and Bodo Rosenhahn. Deep learning\n\nfor vanishing point detection using an inverse gnomonic projection. In GCPR, 2017.\n\n[27] J. Kosecka and W. Zhang. Video compass. In ECCV, 2002.\n[28] Seokju Lee, Junsik Kim, Jae Shin Yoon, Seunghak Shin, Oleksandr Bailo, Namil Kim, Tae-Hee\nLee, Hyun Seok Hong, Seung-Hoon Han, and In So Kweon. VPGNet: Vanishing point guided\nnetwork for lane and road marking detection and recognition. In ICCV, 2017.\n\n[29] Jos\u00b4e Lezama, Rafael Grompone von Gioi, Gregory Randall, and Jean-Michel Morel. Finding\n\nvanishing points via point alignments in image primal and dual domains. In CVPR, 2014.\n\n[30] Michael J Magee and Jake K Aggarwal. Determining vanishing points from perspective images.\n\nComputer Vision, Graphics, and Image Processing, 1984.\n\n[31] Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic\n\nconvolutional neural networks on riemannian manifolds. In ICCV Workshop, 2015.\n\n[32] GF McLean and D Kotturi. Vanishing point detection by line clustering. PAMI, 1995.\n[33] Faraz M Mirzaei and Stergios I Roumeliotis. Optimal estimation of vanishing points in a\n\nManhattan world. In ICCV, 2011.\n\n[34] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose\n\nestimation. In ECCV, 2016.\n\n[35] James F O\u2019Brien and Hany Farid. Exposing photo manipulation with inconsistent re\ufb02ections.\n\nToG, 2012.\n\n[36] Long Quan and Roger Mohr. Determining perspective structures using hierarchical Hough\n\ntransform. Pattern Recognition Letters, pages 279\u2013286, 1989.\n\n[37] Grant Schindler and Frank Dellaert. Atlanta world: An expectation maximization frame-\nwork for simultaneous low-level edge grouping and camera calibration in complex man-made\nenvironments. In CVPR, 2004.\n\n[38] Evan Shelhamer, Dequan Wang, and Trevor Darrell. Blurring the line between structure and\n\nlearning to optimize and adapt receptive \ufb01elds. arXiv preprint, 2019.\n\n[39] L. Sifre and S. Mallat. Rotation, scaling and deformation invariant scattering for texture\n\ndiscrimination. In CVPR, 2013.\n\n[40] Marco Straforini, C Coelho, and Marco Campani. Extraction of vanishing points from images\n\nof indoor and outdoor scenes. Image and Vision Computing, 1993.\n\n[41] Jean-Philippe Tardif. Non-iterative approach for fast and accurate vanishing point detection. In\n\nICCV, 2009.\n\n[42] Rafael Grompone Von Gioi, Jeremie Jakubowicz, Jean-Michel Morel, and Gregory Randall.\n\nLSD: A fast line segment detector with a false detection control. PAMI, 2008.\n\n[43] Horst Wildenauer and Allan Hanbury. Robust camera self-calibration from monocular images\n\nof Manhattan worlds. In CVPR, 2012.\n\n[44] Scott Workman, Menghua Zhai, and Nathan Jacobs. Horizon lines in the wild. In BMVC, 2016.\n[45] Menghua Zhai, Scott Workman, and Nathan Jacobs. Detecting vanishing points using global\n\nimage context in a non-manhattan world. In CVPR, 2016.\n\n[46] W. Zhang and J. Kosecka. Ef\ufb01cient detection of vanishing points. In ICRA, 2002.\n[47] Xiaodan Zhang, Xinbo Gao, Wen Lu, Lihuo He, and Qi Liu. Dominant vanishing point detection\n\nin the wild with application in composition analysis. Neurocomputing, 2018.\n\n[48] Yichao Zhou, Haozhi Qi, and Yi Ma. End-to-end wireframe parsing. In ICCV, 2019.\n[49] Yichao Zhou, Haozhi Qi, Yuexiang Zhai, Qi Sun, Zhili Chen, Li-Yi Wei, and Yi Ma. Learning\n\nto reconstruct 3D Manhattan wireframes from a single image. In ICCV, 2019.\n\n[50] Zihan Zhou, Farshid Farhat, and James Z Wang. Detecting dominant vanishing points in\nnatural scenes with application to composition-sensitive image retrieval. IEEE Transactions on\nMultimedia, 2017.\n\n10\n\n\f", "award": [], "sourceid": 471, "authors": [{"given_name": "Yichao", "family_name": "Zhou", "institution": "UC Berkeley"}, {"given_name": "Haozhi", "family_name": "Qi", "institution": "UC Berkeley"}, {"given_name": "Jingwei", "family_name": "Huang", "institution": "Stanford University"}, {"given_name": "Yi", "family_name": "Ma", "institution": "UC Berkeley"}]}