{"title": "Reflection Separation using a Pair of Unpolarized and Polarized Images", "book": "Advances in Neural Information Processing Systems", "page_first": 14559, "page_last": 14569, "abstract": "When we take photos through glass windows or doors, the transmitted background scene is often blended with undesirable reflection. Separating two layers apart to enhance the image quality is of vital importance for both human and machine perception. In this paper, we propose to exploit physical constraints from a pair of unpolarized and polarized images to separate reflection and transmission layers. Due to the simplified capturing setup, the system becomes more underdetermined compared with existing polarization based solutions that take three or more images as input. We propose to solve semireflector orientation estimation first to make the physical image formation well-posed and then learn to reliably separate two layers using a refinement network with gradient loss. Quantitative and qualitative experimental results show our approach performs favorably over existing polarization and single image based solutions.", "full_text": "Re\ufb02ection Separation using a Pair of\nUnpolarized and Polarized Images\n\nYouwei Lyu1(cid:93)\u2020 Zhaopeng Cui2(cid:93) Si Li1\u2217 Marc Pollefeys2 Boxin Shi3,4\u2217\n\n1Beijing University of Posts and Telecommunications\n\n2Department of Computer Science, ETH Z\u00fcrich\n\n3National Engineering Laboratory for Video Technology, Peking University\n\n4Peng Cheng Laboratory\n\n{youweilv, zhpcui}@gmail.com, lisi@bupt.edu.cn,\nmarc.pollefeys@inf.ethz.ch, shiboxin@pku.edu.cn\n\nAbstract\n\nWhen we take photos through glass windows or doors, the transmitted background\nscene is often blended with undesirable re\ufb02ection. Separating two layers apart\nto enhance the image quality is of vital importance for both human and machine\nperception. In this paper, we propose to exploit physical constraints from a pair of\nunpolarized and polarized images to separate re\ufb02ection and transmission layers.\nDue to the simpli\ufb01ed capturing setup, the system becomes more underdetermined\ncompared with existing polarization based solutions that take three or more images\nas input. We propose to solve semire\ufb02ector orientation estimation \ufb01rst to make the\nphysical image formation well-posed and then learn to reliably separate two layers\nusing a re\ufb01nement network with gradient loss. Quantitative and qualitative experi-\nmental results show our approach performs favorably over existing polarization\nand single image based solutions.\n\n1\n\nIntroduction\n\nTaking photos of a scene behind semire\ufb02ectors (e.g., glass windows and doors) without re\ufb02ection\ncontamination is not an easy task for photographers, because the captured image often contains two\nlayers of the scene: the layer transmitting through the surface and the other layer re\ufb02ected by the\nsurface. To separate the re\ufb02ection and transmission layers is not an easy task for computer vision\nresearchers either, because recovering two images from a single mixture image is highly ill-posed\nand the number of unknowns is twice as many as that of given measurements. Strong priors crafted\nfrom natural image statistics (e.g., gradient sparsity [14]) or learned from deep neural networks\n(e.g., [6]) can solve the problem if the assumed priors are well observed in the input. The problem\nnaturally becomes less ill-posed if multiple images are captured from different viewpoints (e.g., \ufb01ve\nimages in [15]) or different polarization angles (e.g., at least three images in [12]). The motions\nbetween the layers present in multiple images provide a strong and effective constraint, but aligning\nmultiple-view images contaminated by re\ufb02ections is not a trivial task [15]. Rotating a polarizer to\ncapture multiple images doesn\u2019t suffer from the alignment issue [12], but it requires skillful operations\nand the polarized images always \ufb01lter part of the incoming light.\nIn this paper, we propose to separate re\ufb02ection and transmission layers using a pair of unpolarized\nand polarized images. Such a setup takes fewer images than existing polarization based solutions\n\n(cid:93)Authors contributed equally to this work.\n\u2020Part of this work was \ufb01nished as a visiting student at Peking University.\n\u2217Corresponding authors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[18, 12, 22] and keeps an unpolarized image to maintain high light energy throughput. Directly\nsolving the two layers is still an ill-posed problem, but we \ufb01nd that the problem has a closed-\nform solution when the semire\ufb02ector surface normal is known. By assuming the semire\ufb02ector is\nmostly planar, we can use only two parameters to determine the complete physical image formation\nmodel that encodes the solution to layer separation. Based on these physical and mathematical\ndeductions, we propose an end-to-end deep neural network for re\ufb02ection separation using two\n(un)polarized images. More speci\ufb01cally, we design a cascaded architecture consisting of three\nmodules: semire\ufb02ector orientation estimation to determine key variables for a well-posed physical\nimage formation model, polarization-guided separation based on the physical model, and separated\nlayers re\ufb01nement with gradient loss to enhance the sharpness. The code and test data are available at\nhttps://github.com/YouweiLyu/reflection_separation_with_un-polarized_images.\nThe main contributions of this paper can be summarized as follows:\n\n\u2022 We propose to solve re\ufb02ection separation using a pair of unpolarized and polarized images\nfor the \ufb01rst time, which integrates polarization cues with a simpler and light-ef\ufb01cient setup.\n\u2022 We derive a new formulation based on semire\ufb02ector orientation estimation, which induces a\n\nwell-posed physical image formation model to be reliably learned for layer separation.\n\n\u2022 We design an end-to-end deep neural network with gradient loss to solve the separation\nproblem and show superior performance over existing polarization and single image based\nsolutions.\n\n2 Related Work\n\nIn terms of input, re\ufb02ection separation can take a single image or multiple images. The single image\nproblem has the most relaxed requirement, since it only needs an image captured by an ordinary\ncamera in the wild. But such a problem is also highly ill-posed, priors formulated using hand\ncrafted priors [13, 14, 16, 19, 21, 1] or features learned from large-scale training data [6, 20, 25, 24]\nare explored to facilitate the separation. By taking multiple images from different viewpoints, the\ndifference of projected motion from re\ufb02ection and transmission layers due to the visual parallax\nprovides useful cues to the separation [2, 8, 23]. By taking multiple images under different polarization\nangles, the differently polarized images provide \"independent\" representations of re\ufb02ection and\ntransmission layers based on physical image formation model to leverage the separation using\nindependent component analysis [7, 10, 3], closed form expressions [18, 12], or deep learning [22].\nMultiple images usually bring more promising separation quality than relying on only a single image,\nbut request more complicated and careful image capturing operations.\nIn terms of solutions, re\ufb02ection separation can be solved by non-learning based methods or learning\nmethods. Adopted priors of re\ufb02ection and transmission layers by non-learning based methods include\nthe sparse gradient prior [14, 13], blur level differences between two layers [16], the ghosting effect\ndue to thick glass [19, 5], and the Laplacian data \ufb01delity term [1]. Such handcrafted priors may get\nviolated in various real scenarios when expected properties are weakly observed. Learning based\nmethods are bene\ufb01ted by the comprehensive modeling ability of deep neural networks. It can be\nsolved by learning the gradient inference and image restoration sequentially [4, 6] or concurrently\n[20], by incorporating perceptual losses [25], and by considering bidirectional constraints [24]. With\ndifferently polarized images available, a simple encoder-decoder architecture is shown to be effective\nfor separating two layers using physics based image formation model [22].\nOur work belongs to the learning based approach using multiple images and physical constraints.\nDifferent from previous works exploring polarization cues [18, 12, 22] that require at least three\nimages with different polarization angles, we take a pair of unpolarized and polarized images and\nlearn to solve a more underdetermined system.\n\n3 Physical Image Formation Model\n\nGiven a pair of unpolarized and polarized images captured at the same view, we aim to separate\nthe re\ufb02ection layer and the transmission layer. In this section, we will \ufb01rst review the re\ufb02ection\nand transmission model, and then describe the relationship between polarization properties and\n\n2\n\n\fFigure 1: Illustration of physical image formation model.\n\nsemire\ufb02ector surface geometry. By assuming the medium is planar, we prove that the separation\ntightly relies on only two parameters of the plane.\n\n3.1 Re\ufb02ection and Transmission Image Formation\n\nSuppose It, the intensity of light from the transmission scene, and Ir, the intensity of light from\nthe re\ufb02ection scene, are both unpolarized. After being re\ufb02ected or transmitted, the intensity of light\nobserved at pixel x changes depending on \u03b8(x), the angle of incidence (AoI) at the re\ufb02ected point\ncorresponding to pixel x, as the following [12]:\n\nIunpol(x) =\n\nR\u22a5(\u03b8(x)) + R(cid:107)(\u03b8(x))\n\n2\n\n\u00b7 Ir(x) +\n\nT\u22a5(\u03b8(x)) + T(cid:107)(\u03b8(x))\n\n2\n\n\u00b7 It(x),\n\n(1)\n\nwhere R represents the relative strength of light re\ufb02ected off a glass surface, T represents the relative\nstrength of light transmitted through glass, and subscripts \u22a5 and (cid:107) correspond to the polarized\ncomponents perpendicular and parallel to the plane of incidence (PoI), respectively.\nWhen we place a linear polarizer with a polarization angle \u03c6 in front of the camera, according to\nMalus\u2019 law [9], the intensity at pixel x is\n\nIpol(x) =\n\nR\u22a5(\u03b8(x)) cos2 (\u03c6 \u2212 \u03c6\u22a5(x)) + R(cid:107)(\u03b8(x)) sin2 (\u03c6 \u2212 \u03c6\u22a5(x))\nT\u22a5(\u03b8(x)) cos2 (\u03c6 \u2212 \u03c6\u22a5(x)) + T(cid:107)(\u03b8(x)) sin2 (\u03c6 \u2212 \u03c6\u22a5(x))\n\n2\n\n\u00b7 Ir(x)+\n\u00b7 It(x),\n\n(2)\n\n2\n\nwhere \u03c6\u22a5(x) is the orientation of the polarizer for the best transmission of the component perpendic-\nular to the PoI. For easy representation, we denote\n\n\u03be(x) = R\u22a5(\u03b8(x)) + R(cid:107)(\u03b8(x)),\n\n(3)\n\n\u03b6(x) = R\u22a5(\u03b8(x)) cos2 (\u03c6 \u2212 \u03c6\u22a5(x)) + R(cid:107)(\u03b8(x)) sin2 (\u03c6 \u2212 \u03c6\u22a5(x)).\n\n(4)\nThe glass can be considered as a double-surfaced semire\ufb02ector, and we have R\u22a5(\u03b8(x))+T\u22a5(\u03b8(x)) =\n1 and R(cid:107)(\u03b8(x)) + T(cid:107)(\u03b8(x)) = 1 for each pixel x approximately [12]. Then Equation (1) and\nEquation (2) can be rewritten as\n\nIunpol(x) =\n\n\u03be(x)\n\n2\n\n\u00b7 Ir(x) +\n\n2 \u2212 \u03be(x)\n\n2\n\n\u00b7 It(x),\n\nIpol(x) =\n\n\u03b6(x)\n\n2\n\n\u00b7 Ir(x) +\n\n1 \u2212 \u03b6(x)\n\n2\n\n\u00b7 It(x),\n\n3\n\n(5)\n\n(6)\n\nNormal ofglass GlassPoINormal of PoI\u22a5\u2225Transmission layerReflection layerCaptured image\u22a5\u2225CameraI\ud835\udc3c\ud835\udc5f\ud835\udc3c\ud835\udc61\ud835\udf03\fwhere \u03be(x) \u2208 (0, 2) and \u03b6(x) \u2208 (0, 1). Given the value of \u03be(x) and \u03b6(x), the re\ufb02ection layer and the\ntransmission layer can be computed by\n\nIr(x) = 2 \u00b7 (2 \u2212 \u03be(x)) \u00b7 Ipol(x) \u2212 (1 \u2212 \u03b6(x)) \u00b7 Iunpol(x)\n\n2\u03b6(x) \u2212 \u03be(x)\n\n,\n\n(7)\n\nIt(x) = 2 \u00b7 \u03b6(x) \u00b7 Iunpol(x) \u2212 \u03be(x) \u00b7 Ipol(x)\n\n2\u03b6(x) \u2212 \u03be(x)\n\n(8)\nexcept for 2\u03b6(x) = \u03be(x) where \u03c6 \u2212 \u03c6\u22a5(x) = \u00b145\u25e6 or \u00b1135\u25e6. The angle of a polarizer \u03c6 can be\nmeasured by calibration. Associated with surface geometry of semire\ufb02ector, \u03c6\u22a5(x) is not constant\nbut spatially varying over the whole image plane. There may exist trivial \u03c6 \u2212 \u03c6\u22a5(x) corresponding\nto a few pixels, which have negligible effect on the separation.\nIn short, the re\ufb02ection layer Ir(x) and the transmission layer It(x) are determined by \u03be(x) and \u03b6(x)\nwhen a pair of unpolarized and polarized images are given.\n\n,\n\n3.2 Semire\ufb02ector Surface Geometry\n\nIn order to recover the re\ufb02ection layer Ir and the transmission layer It, we \ufb01rst have to solve \u03be(x)\nand \u03b6(x) according to Equations (5) and (6), which can be further computed by \u03b8(x) and \u03c6 \u2212 \u03c6\u22a5(x)\naccording to Equations (3) and (4). In this section, we will describe how we compute \u03b8(x) and\n\u03c6 \u2212 \u03c6\u22a5(x) for each pixel given the surface normal of the semire\ufb02ector and camera parameters.\nWe assume the semire\ufb02ector has a planar surface, and the camera coordinate is the same as the world\ncoordinate. Then the semire\ufb02ector plane can be expressed as\n\nsin \u03b1 \u00b7 x \u2212 cos \u03b1 sin \u03b2 \u00b7 y + cos \u03b1 cos \u03b2(z \u2212 z0) = 0,\n\n(9)\nwhere \u03b1 represents the rotation angle around y-axis and \u03b2 represents the angle around x-axis. The\nplane normal is thus given by\n\n(cid:35)(cid:34) cos \u03b1\n\n(cid:35)(cid:34) 0\n\n(cid:35)\n\n(cid:34)\n\nnglass =\n\n0\n\n0\n\n0 cos \u03b2 \u2212 sin \u03b2\ncos \u03b2\n0\n\nsin \u03b2\n\n0\n1\n\u2212 sin \u03b1 0\n\n0\n\nsin \u03b1\n\n0\n\ncos \u03b1\n\n=\n\n0\n1\n\nsin \u03b1\n\n\u2212 cos \u03b1 sin \u03b2\ncos \u03b1 cos \u03b2\n\n(cid:35)\n\n.\n\n(10)\n\n(cid:34) 1\n\nLet f be the focal length of the camera, and (px, py) be the coordinate of the principal point. For the\npixel x located at (u, v) on the image plane, we can easily compute its corresponding 3D point X on\nthe medium plane as\n\n(cid:35)\n\n(cid:34) u \u2212 px\n\nv \u2212 py\n\nf\n\nX =\n\nf cos \u03b1 cos \u03b2 + (u \u2212 px) sin \u03b1 \u2212 (v \u2212 py) cos \u03b1 sin \u03b2\n\nz0 cos \u03b1 cos \u03b2\n\n.\n\n(11)\n\nLet X = X/(cid:107)X(cid:107), then the AoI corresponding to pixel x can be calculated as\n\n\u03b8(x) = arccos(cid:12)(cid:12)nglass \u00b7 X(cid:12)(cid:12).\n\n(12)\nWe calculate the absolute value for the above term since \u03b8(x) \u2208 [0, 90\u25e6]. The normal of PoI\nnP oI = (xP oI , yP oI , zP oI )\n\n(cid:62) is then calculated as\n\nnP oI = nglass \u00d7 X,\nand the projection of nP oI on imaging plane is (xP oI , yP oI )\n\u03c6\u22a5(x) \u2208 [0, 360\u25e6), we have\n\n(13)\n(cid:62) denoting orientation of \u03c6\u22a5(x). For\n\n\u03c6\u22a5(x) = arctan\n\n.\n\n(14)\n\nyP oI\nxP oI\n\nWe combine the re\ufb02ection and transmission image formation and semire\ufb02ector surface geometry\nto compute \u03c6\u22a5(x) and \u03b8(x) for each pixel. Note they are not affected by z0, because physically\nthe transparent plane can be projected to parallel plane with arbitrary intercept about z-axis and\nmathematically before computing arctan and arccos, z0 has been eliminated according to Equations\n(12) and (14).\nIn short, it is the normal of glass that matters, and we only need to estimate coef\ufb01cients \u03b1 and \u03b2 to\ndetermine the semire\ufb02ector plane.\n\n4\n\n\fFigure 2: Our method takes a cascaded architecture with three modules: semire\ufb02ector orientation\nestimation, polarization-guided separation, and separated layers re\ufb01nement.\n\n4 Re\ufb02ection Separation Network\n\nIn this section, we introduce the proposed re\ufb02ection separation network which makes use of physical\nmodel discussed in Section 3, and details about loss function and network training.\n\n4.1 Network Architecture\n\nAs shown in Figure 2, our network takes a cascaded architecture which consists of three modules:\nsemire\ufb02ector orientation estimation, polarization-guided separation, and separated layers re\ufb01nement.\nTaking a pair of unpolarized and polarized images, the semire\ufb02ector orientation module aims to\npredict coef\ufb01cients of the glass plane, i.e., \u03b1 and \u03b2. As we only need to estimate two parameters, the\npose estimation module is pretty light, and consists of seven convolutional layers followed by two fully\nconnected layers. The polarization-guided separation module takes \u03b1 and \u03b2 as inputs, and computes\nthe re\ufb02ection layer \u02c6Ir and transmission layer \u02c6It. This module only relies on the physical image\nformation model in Section 3 using analytic equations, so we do not have any parameters to learn\nhere. The separated layers using equations may not be satisfactory due to the gap between physical\nmodel and real data. The numerical problem also occurs when the denominators in Equation (7) and\nEquation (8) approach zero, and the computed results are degenerated. Fortunately, this happens only\nfor a few pixels and the remaining non-degenerated calculations can guide a re\ufb01nement network to\nproduce compelling separation results. We therefore further feed \u02c6Ir and \u02c6It with original input images\nand \u03be, \u03b6 into the separated layers re\ufb01nement module to improve the initial estimation. The re\ufb01nement\nmodule has a widely adopted encoder-decoder architecture. In detail, the encoder consists of eight\nconvolutional layers and the decoder consists of \ufb01ve deconvolutional layers.\nWe hope the network to reconstruct details in the original image as many as possible, so we de\ufb01ne\nthe loss on both the estimated image and its gradient as:\n\nL = \u03bb1Lr(Ir) + \u03bb2Lt(It) + \u03bb3Lr(gr) + \u03bb4Lt(gt),\n\n(15)\nwhere Lr(Ir) and Lt(It) de\ufb01ne the loss on the estimated re\ufb02ection and transmission layers, Lr(gr)\nand Lt(gt) de\ufb01ne the loss on the gradients of the estimated re\ufb02ection and transmission layers, and\n\u03bb1,2,3,4 are the weighting parameters. The mean square error (MSE) is used for all the loss. We\nimplement our model using PyTorch deep learning framework [17]. Adam [11] is used as the\noptimizer with a starting learning rate of 0.0004, \u03b21 = 0.9 and \u03b22 = 0.999. The learning rate is\ndescended to 0.0002 and 0.00008 after 12th and 18th epochs respectively. \u03bb1,2,3,4 are set to be 1.2,\n1.5, 1.0, and 1.5 respectively for our training.\n\n4.2 Training Data Generation\n\nThe deep-learning method tends to be data-hungry, but it is dif\ufb01cult to obtain pairwise re\ufb02ection and\ntransmission images with both polarized and unpolarized observations at a large scale. It is possible\nto directly use Equation (1) and Equation (2) to generate the synthetic data, but it is expected that the\n\n5\n\nEncoder InputUnpolarized imagePolarized image\ud835\udf09\ud835\udf03OutputReflection layerTransmission layerEncoderDecoder\ud835\udf37\ud835\udf36Fully connected layer\ud835\udf19\u22a5\ud835\udf01\u1218\ud835\udc3c\ud835\udc5f\u1218\ud835\udc3c\ud835\udc61Semireflectororientation estimationPolarization-guided separationSeparated layers refinement\fTable 1: Quantitative evaluation results on synthetic data.\n\nOurs\n\nSSIM 0.9708\n28.23\nPSNR\nSSIM 0.8953\n20.92\nPSNR\n\nOurs-\nInitial\n0.8324\n21.61\n\n0.6253\n13.90\n\nRe\ufb02ectNet-\nFinetuned\n\n0.9627\n27.52\n\n0.8303\n18.50\n\nOurs-\n\n2% noise\n0.9691\n28.08\n\n0.8785\n20.53\n\nOurs-\n\n8% noise\n0.9668\n27.31\n\n0.8418\n19.18\n\nOurs-\n\n16% noise\n\n0.9619\n27.17\n\n0.8022\n18.26\n\nTransmission\n\nRe\ufb02ection\n\nnetwork trained with such data may not generalize well on real scenarios. Therefore, we propose an\neffective data generation pipeline to better match images of real-world scenes.\nAt the \ufb01rst step, we randomly pick two images from PLACE2 dataset [26] as original re\ufb02ection and\ntransmission layers. Based on a commonly adopted assumption that people take photos focusing on\nthe background scene (transmission layer) so the re\ufb02ection layer is likely to be blurry [6], a Gaussian\nsmoothing kernel with a random kernel size in the range of 3 to 7 pixels is applied to a portion of\nre\ufb02ection images. We also need to simulate coef\ufb01cients \u03b1 and \u03b2 of the semire\ufb02ector plane. We\nassume people rarely take photos in front of the glass that inclines by a weird angle, e.g., glass nearly\northogonal to the image plane, so we set \u03b1 \u2208 (\u221265\u25e6, 65\u25e6) and \u03b2 \u2208 (\u221235\u25e6, 35\u25e6). For the virtual\ncamera, we set the focal length as 1.4 times as long as the image width, and the image resolution as\n256 \u00d7 256. By \ufb01xing these factors, the normal of glass is speci\ufb01ed, \u03b8(x) and \u03c6\u22a5(x) can be derived\nfrom Equation (12) and Equation (14), respectively. \u03c6 can be an arbitrary value in the range of [0, 2\u03c0),\nas long as the polarization images are captured under the same polarizer angle. In our experiment, we\nset \u03c6 to be 0. Additionally, real-world scenes are generally high-dynamic-range (HDR), so we apply\ndynamic range manipulation as conducted in [22] to simulate appearance of re\ufb02ections in a more\nrealistic manner. Finally, the synthetic unpolarized image Iunpol and the polarized image Ipol can be\nobtained by Equation (5) and Equation (6).\n\n5 Experimental Results\n\nWe evaluate our method on both synthetic and real data with extensive experiments including the\ncomparison with related work and ablation study. For all quantitative evaluations, both the peak-\nsignal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM) are used to evaluate\nthe quality of separated images.\n\n5.1 Evaluation on Synthetic Data\n\nWe use 5000 pairs of images from our synthetic validation dataset with ground truth re\ufb02ection\nand transmission layers to quantitatively compare our method with state-of-the-art approaches.\nRe\ufb02ectNet [22] is a learning based method using three polarized images; Zhang et al. [25], CRRN\n[20], CEILNet [6] are deep learning based solutions using a single image; and LB14 [16] is a non-\nlearning method using a single image. To test the performance of Re\ufb02ectNet [22], we generated two\nadditional polarization images for each pair of (un)polarized images in our dataset, and \ufb01netuned\nRe\ufb02ectNet using Adam solver with a learning rate of 0.005 for 5 epochs. The experimental results\nare shown in Figure 3 and Table 1. We can see that, compared to all single-image based methods,\nour method has much better performance, which shows the advantage with the additional polarized\nimage. We can also see that all single-image based methods have bad performance for re\ufb02ection\nlayers1 due to their weak signal in the input images. Our method also outperforms Re\ufb02ectNet [22]\nwhich requires three polarized images as input, especially in terms of the quality of the re\ufb02ection\nlayer, although our method only needs one polarized image in addition to an unpolarized image.\nMoreover, our method performs the best in suppressing undesired re\ufb02ection in transmission layer\nand recovers high-quality re\ufb02ection layer as well, as indicated by corresponding SSIM and PSNR\nvalues under images in Figure 3. We also evaluate our initial polarization-guided separation \u02c6Ir and \u02c6It\n(\"Ours-Initial\") in Table 1, and we can see that the initial separation is effective, and our re\ufb01nement\nnetwork helps eliminate the artifact and noise caused by rough estimation of \u03be and \u03b6. At last, we test\n\n1Brightness is upgraded for visualization purpose.\n\n6\n\n\fFigure 3: Quantitative and qualitive evaluation on synthetic data, compared with Re\ufb02ectNet [22],\nZhang et al. [25], CRRN [20], CEILNet [6], and LB14 [16].\n\n7\n\nOursGTInput imagesSSIM:0.8049PSNR:17.99SSIM:0.04383PSNR:4.680SSIM:0.4811PSNR:13.07SSIM:0.1249PSNR:4.511SSIM:0.8990PSNR:18.44SSIM:0.1758PSNR:7.826Zhang et al.SSIM:0.7352PSNR:15.64SSIM:0.8049PSNR:17.99SSIM:0.6100PSNR:13.67SSIM:0.00047PSNR:3.153SSIM:0.7464PSNR:16.64SSIM:0.00324PSNR:6.686LB14SSIM:0.7438PSNR:4.321SSIM:0.00152PSNR:4.321SSIM:0.5368PSNR:13.33SSIM:0.0117PSNR:3.337SSIM:0.7876PSNR:16.72SSIM:0.06656PSNR:7.223CRRNSSIM:0.7664PSNR:18.17SSIM:0.00015PSNR:3.565SSIM:0.5682PSNR:12.39SSIM:0.00035PSNR:2.590SSIM:0.8454PSNR:18.55SSIM:0.00049PSNR:5.887CEILNetReflectNetSSIM:0.9655PSNR:25.86SSIM:0.9721PSNR:22.10SSIM:0.8466PSNR:19.26SSIM:0.9655PSNR:20.24SSIM:0.9708PSNR:27.58SSIM:0.9695PSNR:24.01SSIM:0.4797PSNR:13.89SSIM:0.8130PSNR:14.64SSIM:0.5289PSNR:14.86SSIM:0.6890PSNR:10.69SSIM:0.7744PSNR:18.52SSIM:0.7209PSNR:16.29ReflectionReflectionReflectionTransmission TransmissionTransmissionUnpolarized imgPolarized imgUnpolarized imgPolarized imgUnpolarized imgPolarized imgSSIM:0.8858PSNR:21.31SSIM:0.7956PSNR:14.44SSIM:0.7106PSNR:16.97SSIM:0.9416PSNR:18.15SSIM:0.9016PSNR:20.56SSIM:0.9271PSNR:15.57Initial\fFigure 4: Qualitative evaluation on real data, compared with Re\ufb02ectNet [22], Zhang et al. [25], CRRN\n[20], CEILNet [6], and LB14 [16].\n\nour method against Gaussian noise added to images with different standard deviations. The results are\nshown in Table 1. We can see that our method performs consistently well and is robust to Gaussian\nnoise.\n\n5.2 Evaluation on Real Data\n\nWe use the Lucid Vision Phoenix polarization camera1 to capture the real dataset. The polarization\ncamera can take four images with different polarizer angles at a single shot. We use three of them\nas input images to Re\ufb02ectNet [22] and one of them as polarized input image to our method. The\nunpolarized input image is calculated by summing two polarized images captured with orthogonal\npolarizer angles [9]. Note the polarization camera has no color \ufb01lter, so we can only provide results in\ngray scale, as displayed in Figure 42. These scenes contain strong re\ufb02ections with complex textures,\nand all single-image based methods fail to recover the transmissions while removing the re\ufb02ections.\nThanks to the polarimetric cues, both Re\ufb02ectNet [22] and our method show obvious advantage over\n\n1https://thinklucid.com/product/phoenix-5-0-mp-polarized-model/\n2For better visualization, the minimum and maximum intensity values of different algorithms are stretched in\n\na consistent range.\n\n8\n\nOursInput imagesZhang et al.LB14CRRNCEILNetReflectNetReflection Reflection Reflection Transmission Transmission Transmission Polarized imgUnpolarized imgPolarized imgUnpolarized imgPolarized imgUnpolarized img\fTable 2: Quantitative evaluation results in ablation study.\n\nOurs\n\nSSIM 0.9708\n28.23\nPSNR\nSSIM 0.8953\nPSNR\n20.92\n\nW/o\n\u03be & \u03b6\n0.9632\n27.38\n\n0.8721\n20.02\n\nRe\ufb02ectNet-\nre\ufb01nement\n\n0.9594\n27.20\n\n0.8084\n18.30\n\nW/o\n\nori. est.\n0.9647\n27.47\n\n0.8015\n18.84\n\nW/o\n\ngrad. loss\n0.9674\n27.82\n0.9131\n21.94\n\nOurs-\n\nParabola\n0.8846\n24.40\n\n0.4833\n13.69\n\nTransmission\n\nRe\ufb02ection\n\nsingle-image based methods. Compared to Re\ufb02ectNet [22], our method shows stronger capability in\nsuppressing the ghost in transmission and clear extraction of the re\ufb02ection layer.\n\n5.3 Ablation Study\n\nWe \ufb01rst verify the contribution of semire\ufb02ector orientation estimation by directly estimating \u03be(x)\nand \u03b6(x) from the network (without inferring \u03b1 and \u03b2 \ufb01rst). In other words, we also use an encoder-\ndecoder architecture to estimate \u03be(x) and \u03b6(x) directly from a given pair of (un)polarized images.\nSSIM and PSNR averaged over 5000 validation images are shown in \"W/o ori. est.\" column of Table 2.\nFrom Table 2, we can see that, with more prior knowledge encoded in the network, the orientation\nestimation with only two parameters is easier to learn and also better than directly estimating \u03be(x)\nand \u03b6(x) for each pixel.\nWe further evaluate different loss functions, and train our network without the gradient loss. The\nresults are listed in \"W/o grad. loss\" column of Table 2. We \ufb01nd the gradient loss is particularly useful\nin improving the quality of transmission layer estimation (background scene with more interests),\nthough it may hurt the accuracy of re\ufb02ection layer (usually treated as noise to be removed [16]).\nIn order to compare our method with Re\ufb02ectNet [22] thoroughly, we remove \u03be and \u03b6 from the input\nof our re\ufb01nement network, and feed the results of Re\ufb02ectNet and our polarization-guided separation\ninto this re\ufb01nement network. Under this setup, the quantitative results of re\ufb02ection and transmission\nare listed in \"Re\ufb02ectNet-re\ufb01nement\" and \"W/o \u03be & \u03b6\" columns of Table 2. We can see that even with\nthis re\ufb01nement Re\ufb02ectNet still performs worse than our full pipeline. It also shows the importance of\nfeeding \u03be and \u03b6 into the re\ufb01nement network.\nOur model assumes the semire\ufb02ector approximately has a planar shape. When it becomes a curved\nshape such as the windshield in a car, our semire\ufb02ector orientation estimation module will fail, and\nthus the performance of our method will deteriorate. We generate the test data using the parabola\nsurface simulation as Re\ufb02ectNet, and directly test using our current model. The result is listed\nin Table 2. We can see that the performance becomes much worse especially for the re\ufb02ection.\nThe performance might be improved if we modify the semire\ufb02ector orientation estimation module\naccordingly, and we will consider this as our future work.\n\n6 Conclusion\n\nWe solved the problem of integrating polarimetric constraints from a pair of unpolarized and polarized\nimages to separate re\ufb02ection and transmission layers. To deal with the ill-posedness introduced\nby using fewer polarized images, we derived a semire\ufb02ector orientation constraint to make the\nphysical image formation for layer separation valid given our setup, and trained a neural network\nto successfully separate two layers, showing state-of-the-art performance. Our simple yet unique\ncapturing setup not only explored polarimetric constraints for separating re\ufb02ection and transmission\nlayers as reliably as existing approaches using three or more polarized images, but also could be\npotentially integrated into smart phones without affecting the original photography quality by not\nmaking all images polarized.\n\nAcknowledgments\n\nThis work was supported by the National Natural Science Foundation of China under Grant 61872012\nand 61702047.\n\n9\n\n\fReferences\n[1] N. Arvanitopoulos, R. Achanta, and S. S\u00fcsstrunk. Single image re\ufb02ection suppression. In Proc. CVPR,\n\n2017.\n\n[2] E. Be\u2019Ery and A. Yeredor. Blind separation of superimposed shifted images using parameterized joint\n\ndiagonalization. IEEE TIP, 17(3):340\u2013353, 2008.\n\n[3] A. M. Bronstein, M. M. Bronstein, M. Zibulevsky, and Y. Y. Zeevi. Sparse ICA for blind separation of\ntransmitted and re\ufb02ected images. International Journal of Imaging Systems and Technology, 15(1):84\u201391,\n2005.\n\n[4] P. Chandramouli, M. Noroozi, and P. Favaro. Convnet-based depth estimation, re\ufb02ection separation and\n\ndeblurring of plenoptic images. In Proc. ACCV, 2016.\n\n[5] Y. Diamant and Y. Y. Schechner. Overcoming visual reverberations. In Proc. CVPR, 2008.\n\n[6] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf. A generic deep architecture for single image re\ufb02ection\n\nremoval and image smoothing. In Proc. ICCV, 2017.\n\n[7] H. Farid and E. H. Adelson. Separating re\ufb02ections and lighting using independent components analysis. In\n\nProc. CVPR, 1999.\n\n[8] K. Gai, Z. Shi, and C. Zhang. Blind separation of superimposed moving images using image statistics.\n\nIEEE TPAMI, 34(1):19\u201332, 2012.\n\n[9] E. Hecht. Optics. Pearson education. Addison-Wesley, 2002.\n\n[10] Hermanto, A. K. D. B. Filho, T. Yamamura, and N. Ohnishi. Separating virtual and real objects using\nindependent component analysis. IEICE Transactions on Information and Systems, 84(9):1241\u20131248,\n2001.\n\n[11] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\n[12] N. Kong, Y. Tai, and J. S. Shin. A physically-based approach to re\ufb02ection separation: from physical\n\nmodeling to constrained optimization. IEEE TPAMI, 36(2):209\u2013221, 2014.\n\n[13] A. Levin and Y. Weiss. User assisted separation of re\ufb02ections from a single image using a sparsity prior.\n\nIn Proc. ECCV, 2004.\n\n[14] A. Levin and Y. Weiss. User assisted separation of re\ufb02ections from a single image using a sparsity prior.\n\nIEEE TPAMI, 29(9):1647\u20131654, 2007.\n\n[15] Y. Li and M. S. Brown. Exploiting re\ufb02ection change for automatic re\ufb02ection removal. In Proc. ICCV,\n\n2013.\n\n[16] Y. Li and M. S. Brown. Single image layer separation using relative smoothness. In Proc. CVPR, 2014.\n\n[17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and\n\nA. Lerer. Automatic differentiation in PyTorch. In Proc. NIPS Autodiff Workshop, 2017.\n\n[18] Y. Y. Schechner, J. Shamir, and N. Kiryati. Polarization and statistical analysis of scenes containing a\n\nsemire\ufb02ector. Journal of the Optical Society of America, 17(2):276\u2013284, 2000.\n\n[19] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman. Re\ufb02ection removal using ghosting cues. In Proc.\n\nCVPR, 2015.\n\n[20] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot. Crrn: Multi-scale guided concurrent re\ufb02ection\n\nremoval network. In Proc. CVPR, 2018.\n\n[21] R. Wan, B. Shi, A. H. Tan, and A. C. Kot. Depth of \ufb01eld guided re\ufb02ection removal. In Proc. ICIP, 2016.\n\n[22] P. Wieschollek, O. Gallo, J. Gu, and J. Kautz. Separating re\ufb02ection and transmission images in the wild. In\n\nProc. ECCV, 2018.\n\n[23] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman. A computational approach for obstruction-free\n\nphotography. ACM TOG, 34(4):79:1\u201379:11, 2015.\n\n[24] J. Yang, D. Gong, L. Liu, and Q. Shi. Seeing deeply and bidirectionally: A deep learning approach for\n\nsingle image re\ufb02ection removal. In Proc. ECCV, 2018.\n\n10\n\n\f[25] X. Zhang, R. Ng, and Q. Chen. Single image re\ufb02ection separation with perceptual losses. In Proc. CVPR,\n\n2018.\n\n[26] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for\n\nscene recognition. IEEE TPAMI, 40(6):1452\u20131464, 2017.\n\n11\n\n\f", "award": [], "sourceid": 8242, "authors": [{"given_name": "Youwei", "family_name": "Lyu", "institution": "Beijing University of Posts and Telecommunications"}, {"given_name": "Zhaopeng", "family_name": "Cui", "institution": "ETH Zurich"}, {"given_name": "Si", "family_name": "Li", "institution": "Beijing University of Posts and Telecommunications"}, {"given_name": "Marc", "family_name": "Pollefeys", "institution": "ETH Zurich"}, {"given_name": "Boxin", "family_name": "Shi", "institution": "Peking University"}]}