{"title": "Computational Mirrors: Blind Inverse Light Transport by Deep Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 14311, "page_last": 14321, "abstract": "We recover a video of the motion taking place in a hidden scene by observing changes in indirect illumination in a nearby uncalibrated visible region. We solve this problem by factoring the observed video into a matrix product between the unknown hidden scene video and an unknown light transport matrix. This task is extremely ill-posed, as any non-negative factorization will satisfy the data. Inspired by recent work on the Deep Image Prior, we parameterize the factor matrices using randomly initialized convolutional neural networks trained in a one-off manner, and show that this results in decompositions that reflect the true motion in the hidden scene.", "full_text": "Computational Mirrors: Blind Inverse Light\n\nTransport by Deep Matrix Factorization\n\nMiika Aittala\n\nMIT\n\nMIT\n\nmiika@csail.mit.edu\n\nprafull@mit.edu\n\nlmurmann@mit.edu\n\nadamy@mit.edu\n\nPrafull Sharma\n\nLukas Murmann\n\nAdam B. Yedidia\n\nMIT\n\nMIT\n\nGregory W. Wornell\n\nMIT\n\ngww@mit.edu\n\nWilliam T. Freeman\nMIT, Google Research\n\nbillf@mit.edu\n\nAbstract\n\nFr\u00e9do Durand\n\nMIT\n\nfredo@mit.edu\n\nWe recover a video of the motion taking place in a hidden scene by observing\nchanges in indirect illumination in a nearby uncalibrated visible region. We solve\nthis problem by factoring the observed video into a matrix product between the\nunknown hidden scene video and an unknown light transport matrix. This task is\nextremely ill-posed as any non-negative factorization will satisfy the data. Inspired\nby recent work on the Deep Image Prior, we parameterize the factor matrices using\nrandomly initialized convolutional neural networks trained in a one-off manner,\nand show that this results in decompositions that re\ufb02ect the true motion in the\nhidden scene.\n\n1\n\nIntroduction\n\nWe study the problem of recovering a video of activity taking place outside our \ufb01eld of view by\nobserving its indirect effect on shadows and shading in an observed region. This allows us to turn, for\nexample, the pile of clutter in Figure 1 into a \u201ccomputational mirror\u201d with a low-resolution view into\nnon-visible parts of the room.\nThe physics of light transport tells us that the image observed on the clutter is related to the hidden\nimage by a linear transformation. If we were to know this transformation, we could solve for the\nhidden video by matrix inversion (we demonstrate this baseline approach in Section 3). Unfortunately,\nobtaining the transport matrix by measurement requires an expensive pre-calibration step and access\nto the scene setup. We instead tackle the hard problem of estimating both the hidden video and the\ntransport matrix simultaneously from a single input video of the visible scene. For this, we cast the\nproblem as matrix factorization of the observed clutter video into a product of an unknown transport\nmatrix and an unknown hidden video matrix.\nMatrix factorization is known to be very ill-posed. Factorizations for any matrix are in principle\neasy to \ufb01nd: we can simply choose one of the factors at will (as a full-rank matrix) and recover a\ncompatible factor by pseudoinversion. Unfortunately, the vast majority of these factorizations are\nmeaningless for a particular problem. The general strategy for \ufb01nding meaningful factors is to impose\nproblem-dependent priors or constraints \u2014 for example, non-negativity or spatial smoothness. While\nsuccessful in many applications, meaningful image priors can be hard to express computationally.\nIn particular, we are not aware of any successful demonstrations of the inverse light transport\nfactorization problem in the literature. We \ufb01nd that classical factorization approaches produce\nsolutions that are scrambled beyond recognition.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur key insight is to build on the recently developed Deep Image Prior [37] to generate the factor\nmatrices as outputs of a pair of convolutional neural networks trained in a one-off fashion. That is,\nwe randomly initialize the neural net weights and inputs and then perform a gradient descent to \ufb01nd\na set of weights such that their outputs\u2019 matrix product yields the known video. No training data\nor pre-training is involved in the process. Rather, the structure of convolutional neural networks,\nalternating convolutions and nonlinear transformations, induces a bias towards factor matrices that\nexhibit consistent image-like structure, resulting in recovered videos that closely match the hidden\nscene, although global ambiguities such as re\ufb02ections and rotations remain. We found that this holds\ntrue of the video factor as well as the transport factor, in which columns represent the scene\u2019s response\nto an impulse in the hidden scene, and exhibit image-like qualities.\nThe source code, supplemental material, and a video demonstrating the results can be found at the\nproject webpage at compmirrors.csail.mit.edu.\n\n2 Related Work\n\nMatrix Factorization. Matrix factorization is a fundamental topic in computer science and mathe-\nmatics. Indeed, many widely used matrix transformations and decompositions, such as the singular\nvalue decomposition, eigendecomposition, and LU decomposition, are instances of constrained\nmatrix factorization. There has been extensive research in the \ufb01eld of blind or lightly constrained\nmatrix factorization. The problem has applications in facial and object recognition [19], sound\nseparation [40], representation learning [36], and automatic recommendations [44]. Neural nets have\nbeen used extensively in this \ufb01eld [11, 28, 29, 17], and are often for matrix completion with low-rank\nassumption [36, 44].\nBlind deconvolution [20, 18, 5, 21, 31, 12] is closely related to our work but involves a more restricted\nclass of matrices. This greatly reduces the number of unknowns (a kernel rather than a full matrix)\nand makes the problem less ill-posed, although still quite challenging.\nKoenderink et al. [16] analyze a class of problems where one seeks to simultaneously estimate some\nproperty, and calibrate the linear relationship between that property and the available observations.\nOur blind inverse light transport problem falls into this framework.\nDeep Image Prior. In 2018, Ulyanov et al. [37] published their work on the Deep Image Prior\u2014the\nremarkable discovery that due to their structure, convolutional neural nets inherently impose a natural-\nimage-like prior on the outputs they generate, even when they are initialized with random weights\nand without any pretraining. Since the publication of [37], there have been several other papers that\nmake use of the Deep Image Prior for a variety of applications, including compressed sensing [38],\nimage decomposition [6], denoising [4], and image compression [9]. In concurrent work, the Deep\nImage Prior and related ideas have also been applied to blind deconvolution [1, 27].\nLight Transport Measurement. There has been extensive past work on measuring and approximat-\ning light transport matrices using a variety of techniques, including compressed sensing [25], kernel\nNystr\u00f6m methods [41], and Monte Carlo methods [33]. [32] and [7] study the recovery of an image\u2019s\nre\ufb02ectance \ufb01eld, which is the light transport matrix between the incident and exitant light \ufb01elds.\nNon-Line-of-Sight (NLoS) Imaging. Past work in active NLoS imaging focuses primarily on active\ntechniques using time-of-\ufb02ight information to resolve scenes [43, 34, 10]. Time-of-\ufb02ight information\nallows the recovery of a wealth of information about hidden scenes, including number of people [42],\nobject tracking [15], and general 3D structure [8, 24, 39]. In contrast, past work in passive NLoS\nimaging has focused primarily on occluder-based imaging methods. These imaging methods can\nsimply treat objects in the surrounding environment as pinspecks or pinholes to reconstruct hidden\nscenes, as in [35] or [30]. Others have used corners to get 1D reconstructions of moving scenes [3],\nor used sophisticated models of occluders to infer light \ufb01elds [2].\n\n3\n\nInverse Light Transport\n\nWe preface the development of our factorization method by introducing the inverse light transport\nproblem, and presenting numerical real-world experiments with a classical matrix inversion solution\nwhen the transport matrix is known. In later sections we study the case of unknown transport matrices.\n\n2\n\n\fFigure 1: A typical experimental setup used throughout this paper. A camera views a pile of clutter,\nwhile a hidden video L is being projected outside the direct view Z of the camera. We wish to recover\nthe hidden video from the shadows and shading observed on the clutter. The room lights in this\nphotograph were turned on for the purpose of visualization only. During regular capture, we try to\nminimize any sources of ambient light. We encourage the reader to view the supplemental video to\nsee the data and our results in motion.\n\n3.1 Problem Formulation\n\nThe problem addressed in this paper is illustrated in Figure 1. We observe a video Z of, for example,\na pile of clutter, while a hidden video L plays on a projector behind the camera. Our aim is to recover\nL by observing the subtle changes in illumination it causes in the observed part of the scene. Such\ncues include e.g. shading variations, and the clutter casting moving shadows in ways that depend on\nthe distribution of incoming light (i.e. the latent image of the surroundings). The problem statement\ndiscussed here holds in more general situations than just the projector-clutter setup, but we focus on\nit to simplify the exposition and experimentation.\nThe key property of light transport we use is that it is linear [13, 26]: if we light up two pixels on\nthe hidden projector image in turn and sum the respective observed images of the clutter scene, the\nresulting image is the same as if we had taken a single photograph with both pixels lit at once. More\ngenerally, for every pixel in the hidden projector image, there is a corresponding response image in\nthe observed scene. When an image is displayed on the projector, the observed image is a weighted\nsum of these responses, with weights given by the intensities of the pixels in the projector image. In\nother words, the image formation is the matrix product\n\nZ = T L\n\n(1)\nwhere Z \u2208 RIJ\u00d7t is the observed image of resolution I \u2217 J at t time instances, and L \u2208 Rij\u00d7t is the\nhidden video of resolution i\u2217 j (of same length t), and the light transport matrix T \u2208 RIJ\u00d7ij contains\nall of the response images in its columns.1 T has no dependence on time, because the camera, scene\ngeometry, and re\ufb02ectance are static.\nThis equation is the model we work with in the remainder of the paper. In the subsequent sections,\nwe make heavy use of the fact that all of these matrices exhibit image-like spatial (and temporal)\ncoherence across their dimensions. A useful intuition for understanding the matrix T is that by\nviewing each of its columns in turn, one would see the scene as if illuminated by just a single pixel of\nthe hidden image.\n\n3.2\n\nInversion with Known Light Transport Matrix\n\nWe \ufb01rst describe a baseline method for inferring the hidden video from the observed one in the\nnon-blind case. In addition to the observed video Z, we assume that we have previously measured\nthe light transport T by lighting up the projector pixels individually in sequence and recording the\nresponse in the observed scene. We discretize the projector into 32 \u00d7 32 square pixels, corresponding\nto i = j = 32. As we now know two of the three matrices in Equation 1, we can recover L as a\nsolution to the least-squares problem argminL||T L\u2212 Z||2\n2. We augment this with a penalty on spatial\n1Throughout this paper, we use notation such as T \u2208 RIJ\u00d7ij to imply that T is, in principle, a 4-dimensional\ntensor with dimensions I, J, i and j, and that we have packed it into a 2-dimensional tensor (matrix) by stacking\nthe I and J dimensions together into the columns, and i and j in the rows. Thus, when we refer to e.g. columns\nof this matrix as images, we are really referring to the unpacked I \u00d7 J array corresponding to that column.\n\n3\n\ntItij......J\fFigure 2: Reconstructions with known light transport matrix.\n\ngradients. In practice, we measure T in a mixed DCT and PCA basis to improve the signal-to-noise\nratio, and invert the basis transformation post-solve. We also subtract a black frame with all projector\npixels off from all measured images to eliminate the effect of ambient light.\nFigure 2 shows the result of this inversion for two datasets, one where a video was projected on the\nwall, and the other in a live-action scenario. The recovered video matches the ground truth, although\nunsurprisingly it is less sharp. This non-blind solution provides a sense of upper bound for the much\nharder blind case.\nThe amount of detail recoverable depends on the content of the observed scene and the corresponding\ntransport matrix [23]. In this and subsequent experiments, we consider the case where the objects in\nthe observed scene have some mutual occlusion, so that they cast shadows onto one another. This\nimproves the conditioning of the light transport matrix. In contrast, for example, a plain diffuse wall\nwith no occluders would act as a strong low-pass \ufb01lter and eliminate the high-frequency details in the\nhidden video. However, we stress that we do not rely on explicit knowledge of the geometry, nor on\ndirect shadows being cast by the hidden scene onto the observed scene.\n\n4 Deep Image Prior based Matrix Factorization\n\nOur goal is to recover the latent factors when we do not know the light transport matrix. In this section,\nwe describe a novel matrix factorization method that uses the Deep Image Prior [37] to encourage\nnatural-image-like structure in the factor matrices. We \ufb01rst describe numerical experiments with\none-dimensional toy versions of the light transport problem, as well as general image-like matrices.\nWe also demonstrate the failure of classical methods to solve this problem. Applications to real light\ntransport con\ufb01gurations will be described in the next section.\n\n4.1 Problem Formulation\n\nIn many inference problems, it is known that the observed quantities are formed as a product of latent\nmatrices, and the task is to recover these factors. Matrix factorization techniques seek to decompose\na matrix Z into a product Z \u2248 T L (using our naming conventions from Section 3), either exactly\nor approximately. The key dif\ufb01culty is that a very large space of different factorizations satisfy any\ngiven matrix.\nThe most common solution is to impose additional priors on the factors, for example, non-\nnegativity [19, 40] and spatial continuity [21]. They are tailored on a per-problem basis, and\noften capture the desired properties (e.g. likeness to a natural image) only in a loose sense. Much\nof the work in nonnegative matrix factorization assumes that the factor matrices are low-rank rather\nthan image-like.\nThe combination of being severely ill-posed and non-convex makes matrix factorization highly sensi-\ntive to not only the initial guess, but also the dynamics of the optimization. While this makes analysis\nhard, it also presents an opportunity: by shaping the loss landscape via suitable parameterization, we\ncan guide the steps towards better local optima. This is a key motivation behind our method.\n\n4.2 Method\n\nWe are inspired by the Deep Image Prior [37] and Double-DIP [6], where an image or a pair of images\nis parameterized via convolutional neural networks that are optimized in a one-off manner for each\ntest instance. We propose to use a pair of one-off trained CNNs to generate the two factor matrices\n\n4\n\nHands VideoLive Lego Sequence\fFigure 3: High level overview of our matrix factorization approach. The CNNs are initialized\nrandomly and \u201cover\ufb01tted\u201d to map two vectors of noise onto two matrices T and L, with the goal of\nmaking their product match the input matrix Z. In contrast to optimizing directly for the entries of T\nand L, this procedure regularizes the factorization to prefer image-like structure in these matrices.\n\nin our problem (Figure 3). We start with two randomly initialized CNNs, each one outputting a\nrespective matrix L and T . Similarly to [37], these CNNs are not trained from pairs of input/output\nlabeled data, but are trained only once and speci\ufb01cally to the one target matrix. The optimization\nadjusts the weights of these networks with the objective of making the product of their output matrices\nidentical to the target matrix being factorized. The key idea is that the composition of convolutions\nand pointwise nonlinearities has an inductive bias towards generating image-like structure, and\ntherefore is more likely to result in factors that have the appearance of natural images. The general\nformulation of our method is the minimization problem\n\nargmin\u03b8,\u03c6d(T (NT ; \u03b8)L(NL; \u03c6), Z)\n\n(2)\n\nwhere Z \u2208 Rh\u00d7w is the matrix we are looking to factorize, and T : RnT (cid:55)\u2192 Rh\u00d7q and L : RnL (cid:55)\u2192\nRq\u00d7w are functions implemented by convolutional neural networks, parametrized by weights \u03b8 and\n\u03c6, respectively. These are the optimization variables. q is a chosen inner dimension of the factors (by\ndefault the full rank choice q = min(w, h)). d : Rw\u00d7h \u00d7 Rw\u00d7h (cid:55)\u2192 R is any problem-speci\ufb01c loss\nfunction, e.g. a pointwise difference between matrices. The inputs NT \u2208 RnT and NL \u2208 RnL to the\nnetworks are typically \ufb01xed vectors of random noise. Optionally the values of these vectors may be\nset as learnable parameters. They can also subsume other problem-speci\ufb01c inputs to the network, e.g.\nwhen one has access to auxiliary images that may guide the network in performing its task. The exact\ndesign of the networks and the loss function is problem-speci\ufb01c.\n\n4.3 Experiments and Results\n\nWe test the CNN-based factorization approach on synthetically generated tasks, where the input\nis a product of a pair of known ground truth matrices. We use both toy data that simulates the\ncharacteristics of light transport and video matrices, as well as general natural images.\nWe design the generator neural networks T and L as identically structured sequences of convolutions,\nnonlinearities and upsampling layers, detailed in the supplemental appendix.\nTo ensure non-\nnegativity, the \ufb01nal layer activations are exponential functions. Inspired by [22], we found it useful\nto inject the pixel coordinates as auxiliary feature channels on every layer. Our loss function is\nd(x, y) = ||\u2207(x \u2212 y)||1 + w||x \u2212 y||1 where \u2207 is the \ufb01nite difference operator along the spatial\ndimensions, and w is a small weight; the heavy emphasis on local image gradients appears to aid\nthe optimization. We use Adam [14] as the optimization algorithm. The details can be found in the\nsupplemental appendix.\nFigure 4 shows a pair of factorization results, demonstrating that our method is able to extract\nimages similar to the original factors. We are not aware of similar results in the literature; as a\nbaseline we attempt these factorizations with the DIP disabled, and with standard non-negative matrix\nfactorization. These methods fail to produce meaningful results.\n\n5\n\nNVLhqNTT*qwTLTLlossgenerator CNN\u2019shwestimated factormatricesproduct of estimated factorsZhwinput matrix\fFigure 4: Matrix factorization results. The input to the method is a product of the two leftmost\nmatrices. Our method \ufb01nds visually readable factors, and e.g. recovers all three of the faint curves\non the \ufb01rst example. On the right, we show two different baselines: one computed with Matlab\u2019s\nnon-negative matrix factorization (in alternating least squares mode), and one using our code but\noptimizing directly on matrix entries instead of using the CNN, with an L1 smoothness prior.\n\n4.4 Distortions and Failure Modes\n\nThe factor matrices are often warped or \ufb02ipped. This stems from ambiguities in the factorization task,\nas the factor matrices can express mutually cancelling distortions. However, the DIP tends to strongly\ndiscourage distortions that break spatial continuity and scramble the images.\nMore speci\ufb01cally, the space of ambiguities and possible distortions can be characterized as follows\n[16]. Let T0 and L0 be the true underlying factors, the observed video thus being Z = T0L0. All\nvalid factorizations are then of the form T = T0A\u2020 and L = AL0, where A is an arbitrary full-rank\nmatrix, and A\u2020 is its inverse. This can be seen by substituting T L = (T0A\u2020)(AL0) = T0(A\u2020A)L0 =\nT0L0 = Z.\nThe result of any factorization implicitly corresponds to some choice of A (and A\u2020). In simple\nand relatively harmless cases (in that they do not destroy the legibility of the images), the matrix\nA can represent e.g. a permutation that \ufb02ips the image, whence A\u2020 is a \ufb02ip that restores the\noriginal orientation. They can also represent reciprocal intensity modulations, meaning that there\nis a fundamental ambiguity about the intensity of the factors. However, for classical factorization\nmethods, the matrices tend to consist of unstructured \u201cnoise\u201d that scrambles the image-like structure\nin T0 and L0 beyond recognition. Our \ufb01nding is that the use of DIP discourages such factorizations.\n\n5 Blind Light Transport Factorization\n\nWe now combine the ideas from the previous two sections and present a method for tackling the\ninverse light transport problem blindly, when we have no access to a measured light transport matrix.\nWe show results on both synthetic and real data, and study the behavior of the method experimentally.\n\n5.1 Method\nSetup Continuing from Section 3, our goal is to factor the observed video Z \u2208 RIJ\u00d7t of I \u2217 J\npixels and t frames, into a product of two matrices: the light transport T \u2208 RIJ\u00d7ij, and the hidden\nvideo L \u2208 Rij\u00d7t. The hidden video is of resolution i \u2217 j, with i = j = 16. Most of our input videos\nare of size I = 96 (height), J = 128 (width), and t ranging from roughly 500 to 1500 frames.\nFollowing our approach in Section 4, the task calls for designing two convolutional neural networks\nthat generate the respective matrices. Note that T can be viewed as a 4-dimensional I \u00d7 J \u00d7 i \u00d7 j\ntensor, and likewise L can be seen as a 3-dimensional i\u00d7j\u00d7t tensor. We design the CNNs to generate\nthe tensors in these shapes, and in a subsequent network operation reshape the results into the stacked\nmatrix representation, so as to evaluate the matrix product. The dimensionality of the convolutional\n\ufb01lters determines which dimensions in the result are bound together with image structure. In the\nfollowing, we describe the networks generating the factors. An overview of our architecture is shown\nin Figure 5.\n\nHidden Video Generator Network The hidden video tensor L should exhibit image-like structure\nalong all of its three dimensions. Therefore a natural model is to use 3D convolutional kernels in the\nnetwork L that generates it. Aside from its dimensionality, the network follows a similar sequential\n\n6\n\n\u2248ground truth factor matricesinput=\u2248our result\u2248=\u2248baselineMatlab nnmf()without DIP\fFigure 5: An overview of the architecture and data \ufb02ow of our blind inverse light transport method.\nAlso shown (bottom left) are examples of the left singular vectors stored in U. L and Q are\nconvolutional neural networks, and the remainder of the blocks are either multidimensional tensors\nor matrices, with dimensions shown at the edges. The matrices in the shaded region are computed\nonce during initialization. The input Z to the method is shown in the lower right corner.\n\nup-scaling design as that discussed in Section 4. It is illustrated in Figure 5 and detailed in the\nsupplemental appendix.\n\nLight Transport Generator Network The light transport tensor T , likewise, exhibits image\nstructure between all its dimensions, which in principle would call for use of 4D convolutions.\nUnfortunately these are very slow to evaluate, and unimplemented in most CNN frameworks. We\nalso initially experimented with alternating between 2D convolutions along I, J dimensions and i, j\ndimensions, and otherwise following the same sequential up-scaling design. While we reached some\nsuccess with this design, we found a markedly different architecture to work better.\nThe idea is to express the slices of T as linear combinations of basis images obtained from the\nsingular value decomposition (SVD) of the input video. This is both computationally ef\ufb01cient and\nguides the optimization by constraining the iterates and the solution to lie in the subspace of valid\nfactorizations. Intuitively, the basis expresses a frequency-like decomposition of shadow motions and\nother effects in the video, as shown in Figure 5.\nWe begin by precomputing the truncated singular value decomposition U \u03a3V T of the input video Z\n(with the highest s = 32 singular values), and aim to express the columns of T as linear combinations\nof the left singular vectors U \u2208 RIJ\u00d7s. The individual singular vectors have the dimensions I \u00d7 J\nof the input video. These vectors form an appropriate basis for constructing the physical impulse\nresponse images in T , as the the column space of Z coincides with that of T due to them being\nrelated by right-multiplication. 2\nWe denote the linear combination by a matrix Q \u2208 Rs\u00d7ij. The task boils down to \ufb01nding Q such that\n(U Q)L \u2248 Z. Here L comes from the DIP-CNN described earlier. While one could optimize for the\nentries of Q directly, we again found that generating Q using a CNN produced signi\ufb01cantly improved\nresults. For this purpose, we use a CNN that performs 2D convolutions in the ij-dimension, but not\nacross s, as only the former dimension is image-valued. In summary, the full minimization problem\nbecomes a variant of Eq. 2:\n\n\u221a\n\nargmin\u03b8,\u03c6d(U\n\n\u03a3Q(NQ; \u03b8)L(NL; \u03c6), Z)\n\n(3)\n\n2Strictly speaking, some dimensions of the true T may be lost in the numerical null space of Z (or to the\ntruncated singular vectors) if the light transport \u201cblurs\u201d the image suf\ufb01ciently, making it impossible to exactly\nreproduce the T from U. In practice we \ufb01nd that at the resolutions we realistically target, this does not prevent\nus from obtaining meaningful factorizations.\n\n7\n\nNVLLreshapejtiijtNQQsjiQreshapesij*IJijTLQIJtZSVDIJsts\u221a\u03a3 V TU\u221a\u03a3 *(unused)IJtTLlossobserved videotransport matrixprecomputedhidden video matrixreconstructed videoU4U1U12U32\fFigure 6: Blind light transport factorization using our method. The \ufb01rst three sequences are projected\nonto a wall behind the camera. The Lego sequence is performed live in front of the illuminated wall.\n\n\u221a\nwhere Q implements the said CNN. The somewhat inconsequential additional scaling term\n\u03a3\noriginates from our choice to distribute the singular value magnitudes equally between the left and\nright singular vectors.\n\nImplementation Details The optimization is run using Adam [14] algorithm, simultaneously\noptimizing over parameters of Q and L. The loss function is a sum of pointwise difference and a\nheavily weighted temporal gradient difference between Z and reconstructed T L. Details are in the\nsupplemental appendix. We extend the method to color by effectively treating it as three separate\nproblems for R, G, and B; however, they become closely tied as the channels are generated by the\nsame neural network as 3-dimensional output. We also penalize color saturation in the transport\nmatrix to encourage the network to explain colors with the hidden video. To ensure non-negativity,\nwe use a combination of exponentiations and tanh functions as output activations for the network\nL. For T we penalize negative values with a prior. We also found it useful to inject pixel and time\ncoordinates as auxiliary feature maps, and to multiply a Hann window function onto intermediate\nfeature maps at certain intermediate layers. These introduce forced spatial variation into the feature\nmaps and help the networks to rapidly break symmetry in early iterations.\n\n5.2 Experiments and Results\n\nWe test our method with multiple video datasets collected using a projector setup (as described\nin Section 3) recorded in different scenes with different hidden projected videos (Figure 6). We\nencourage the reader to view the supplemental video, as motion is the main focus of this work.\nThe results demonstrate that our method is capable of disentangling the light transport from the\ncontent of the hidden video to produce a readable estimate of the latter. The disk dataset is a controlled\nvideo showing variously complex motions of multiple bright spots. The number of the disks and their\nrelated positions are resolved correctly, up to a spatial warp ambiguity similar to the one discussed in\nSection 4.4. The space of ambiguities in this full 2-dimensional scenario is signi\ufb01cantly larger than\nin the 1-D factorization: the videos can be arbitrarily rotated, \ufb02ipped, shifted and often exhibit some\ndegree of non-linear warping. The color balance between the factors is also ambiguous. As a control\nfor possible unforeseen nonlinearities in the experimental imaging pipeline, we also tested the method\non a semi-synthetic dataset that was generated by explicitly multiplying a measured light transport\nmatrix with the disk video; the results from this synthetic experiment were essentially identical to our\nexperimental results.\nThe other hidden videos in our test set exhibit various degrees of complexity. For example, in hands,\nwe wave a pair of hands back and forth; watching our solved video, the motions and hand gestures are\nclearly recognizable. However, as the scenes become more complex, such as in a long fast-forwarded\nvideo showing colored blocks being variously manipulated (play), the recovered video does show\n\n8\n\n\fsimilarly colored elements moving in correlation with the ground truth, but the overall action is less\nintelligible.\nWe also test our method on the live-action sequence introduced in Section 3. Note that in this scenario\nthe projector plays no role, other than acting as a lamp illuminating the scene. While less clear than\nthe baseline solution with a measured transport matrix, our blindly factored solution does still resolve\nthe large-scale movements of the person, including movements of limbs as he waves his hands and\nrotates the Lego blocks.\n\nComparison with Existing Approaches We compare our method to an extension of the deblurring\napproach by Levin et al. in [21]. We believe that blind deconvolution is the closest problem to ours,\nsince it can be seen as a matrix factorization between a convolution matrix and a latent sharp image.\nWe extended their marginalization method to handle general matrices and not just convolution, and\nuse the same sparse derivative prior as them (see the supplementary materials for more details on how\nwe adapted the approach). Figure 6 and the supplementary video show that this approach produces\nvastly inferior reconstructions.\n\n6 Discussion and Conclusions\n\nWe have shown that cluttered scenes can be computationally turned into low-resolution mirrors\nwithout prior calibration. Given a single input video of the visible scene, we can recover a latent\nvideo of the hidden scene as well as a light transport matrix. We have expressed the problem as a\nfactorization of the input video into a transport matrix and a lighting video, and used a deep prior\nconsisting of convolutional neural networks trained in a one-off fashion. We \ufb01nd it remarkable\nthat merely asking for latent factors easily expressible by a CNN is suf\ufb01cient to solve our problem,\nallowing us to entirely bypass challenges such as the estimation of the geometry and re\ufb02ectance\nproperties of the scene.\nBlind inverse light transport is an instance of a more general pattern, where the latent variables of\ninterest (the video) are tangled with another set of latent variables (the light transport), and to get one,\nwe must simultaneously estimate both [16]. Our approach shows that when applicable, identifying\nand enforcing natural image structure in both terms is a powerful tool. We hope that our method can\ninspire novel approaches to a wide range of other apparently hopelessly ill-posed problems.\n\nAcknowledgements\n\nThis work was supported, in part, by DARPA under Contract No. HR0011-16-C-0030, and by NSF\nunder Grant No. CCF-1816209. The authors wish to thank Luke Anderson for proofreading and\nhelping with the manuscript.\n\nReferences\n[1] Muhammad Asim, Fahad Shamshad, and Ali Ahmed. Blind image deconvolution using deep generative\n\npriors. arXiv preprint arXiv:1802.04073, 2018.\n\n[2] Manel Baradad, Vickie Ye, Adam B Yedidia, Fr\u00e9do Durand, William T Freeman, Gregory W Wornell,\nand Antonio Torralba. Inferring light \ufb01elds from shadows. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 6267\u20136275, 2018.\n\n[3] Katherine L Bouman, Vickie Ye, Adam B Yedidia, Fr\u00e9do Durand, Gregory W Wornell, Antonio Torralba,\nand William T Freeman. Turning corners into cameras: Principles and methods. In International Conference\non Computer Vision, volume 1, page 8, 2017.\n\n[4] Zezhou Cheng, Matheus Gadelha, Subhransu Maji, and Daniel Sheldon. A Bayesian perspective on the\nDeep Image Prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 5443\u20135451, 2019.\n\n[5] Rob Fergus, Barun Singh, Aaron Hertzmann, Sam T Roweis, and William T Freeman. Removing camera\nshake from a single photograph. In ACM transactions on graphics (TOG), volume 25, pages 787\u2013794.\nACM, 2006.\n\n9\n\n\f[6] Yossi Gandelsman, Assaf Shocher, and Michal Irani. \"Double-DIP\": Unsupervised image decomposi-\ntion via coupled deep-image-priors. In Computer Vision and Pattern Recognition (CVPR), 2019 IEEE\nConference on, 2019.\n\n[7] Gaurav Garg, Eino-Ville Talvala, Marc Levoy, and Hendrik PA Lensch. Symmetric photography: Exploiting\n\ndata-sparseness in re\ufb02ectance \ufb01elds. In Rendering Techniques, pages 251\u2013262, 2006.\n\n[8] Genevieve Gariepy, Francesco Tonolini, Robert Henderson, Jonathan Leach, and Daniele Faccio. Detection\n\nand tracking of moving objects hidden from view. Nature Photonics, 10(1):23\u201326, 2016.\n\n[9] Reinhard Heckel and Paul Hand. Deep decoder: Concise image representations from untrained non-\n\nconvolutional networks. In International Conference on Learning Representations, 2019.\n\n[10] Felix Heide, Lei Xiao, Wolfgang Heidrich, and Matthias B Hullin. Diffuse mirrors: 3D reconstruction\nfrom diffuse indirect illumination using inexpensive time-of-\ufb02ight sensors. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 3222\u20133229, 2014.\n\n[11] Patrik O Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of machine learning\n\nresearch, 5(Nov):1457\u20131469, 2004.\n\n[12] Stuart M Jefferies and Julian C Christou. Restoration of astronomical images by iterative blind deconvolu-\n\ntion. The Astrophysical Journal, 415:862, 1993.\n\n[13] James T Kajiya. The rendering equation. In ACM SIGGRAPH computer graphics, volume 20, pages\n\n143\u2013150. ACM, 1986.\n\n[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations, 2015.\n\nIn International\n\n[15] Jonathan Klein, Christoph Peters, Jaime Mart\u00edn, Martin Laurenzis, and Matthias B Hullin. Tracking objects\n\noutside the line of sight using 2D intensity images. Scienti\ufb01c reports, 6:32491, 2016.\n\n[16] Jan J. Koenderink and Andrea J. Van Doorn. The generic bilinear calibration-estimation problem. Int. J.\n\nComput. Vision, 23(3):217\u2013234, June 1997.\n\n[17] Raul Kompass. A generalized divergence measure for nonnegative matrix factorization. Neural computa-\n\ntion, 19(3):780\u2013791, 2007.\n\n[18] Dilip Krishnan, Terence Tay, and Rob Fergus. Blind deconvolution using a normalized sparsity measure.\nIn Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 233\u2013240. IEEE,\n2011.\n\n[19] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization.\n\nNature, 401(6755):788, 1999.\n\n[20] Anat Levin, Yair Weiss, Fredo Durand, and William Freeman. Understanding and evaluating blind\n\ndeconvolution algorithms. 2009.\n\n[21] Anat Levin, Yair Weiss, Fredo Durand, and William T Freeman. Ef\ufb01cient marginal likelihood optimization\n\nin blind deconvolution. In CVPR 2011, pages 2657\u20132664. IEEE, 2011.\n\n[22] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason\nYosinski. An intriguing failing of convolutional neural networks and the CoordConv solution. In Advances\nin Neural Information Processing Systems, pages 9605\u20139616, 2018.\n\n[23] Dhruv Mahajan, Ira Kemelmacher Shlizerman, Ravi Ramamoorthi, and Peter Belhumeur. A theory of\n\nlocally low-dimensional light transport. ACM Trans. Graph., 26(3), July 2007.\n\n[24] Rohit Pandharkar, Andreas Velten, Andrew Bardagjy, Everett Lawson, Moungi Bawendi, and Ramesh\nRaskar. Estimating motion and size of moving non-line-of-sight objects in cluttered environments. In\nComputer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 265\u2013272. IEEE, 2011.\n\n[25] Pieter Peers, Dhruv K Mahajan, Bruce Lamond, Abhijeet Ghosh, Wojciech Matusik, Ravi Ramamoorthi,\nand Paul Debevec. Compressive light transport sensing. ACM Transactions on Graphics (TOG), 28(1):3,\n2009.\n\n[26] Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically based rendering: From theory to implementa-\n\ntion. Morgan Kaufmann, 2016.\n\n10\n\n\f[27] Dongwei Ren, Kai Zhang, Qilong Wang, Qinghua Hu, and Wangmeng Zuo. Neural blind deconvolution\n\nusing deep priors. arXiv preprint arXiv:1908.02197, 2019.\n\n[28] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank\nmatrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE\ninternational conference on acoustics, speech and signal processing, pages 6655\u20136659. IEEE, 2013.\n\n[29] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov Chain\nMonte Carlo. In Proceedings of the 25th International Conference on Machine Learning, pages 880\u2013887.\nACM, 2008.\n\n[30] Charles Saunders, John Murray-Bruce, and Vivek K Goyal. Computational periscopy with an ordinary\n\ndigital camera. Nature, 565(7740):472, 2019.\n\n[31] Timothy J Schulz. Multiframe blind deconvolution of astronomical images. JOSA A, 10(5):1064\u20131073,\n\n1993.\n\n[32] Pradeep Sen, Billy Chen, Gaurav Garg, Stephen R Marschner, Mark Horowitz, Marc Levoy, and Hendrik\n\nLensch. Dual photography. ACM Transactions on Graphics (TOG), 24(3):745\u2013755, 2005.\n\n[33] SK Sharma and Srilekha Banerjee. Role of approximate phase functions in Monte Carlo simulation of\n\nlight propagation in tissues. Journal of Optics A: Pure and Applied Optics, 5(3):294, 2003.\n\n[34] Shikhar Shrestha, Felix Heide, Wolfgang Heidrich, and Gordon Wetzstein. Computational imaging with\n\nmulti-camera time-of-\ufb02ight systems. ACM Transactions on Graphics (ToG), 35(4):33, 2016.\n\n[35] Antonio Torralba and William T. Freeman. Accidental pinhole and pinspeck cameras. International\n\nJournal of Computer Vision, 110(2):92\u2013112, Nov 2014.\n\n[36] George Trigeorgis, Konstantinos Bousmalis, Stefanos Zafeiriou, and Bj\u00f6rn W Schuller. A deep matrix\nfactorization method for learning attribute representations. IEEE transactions on pattern analysis and\nmachine intelligence, 39(3):417\u2013429, 2016.\n\n[37] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 9446\u20139454, 2018.\n\n[38] David Van Veen, Ajil Jalal, Eric Price, Sriram Vishwanath, and Alexandros G Dimakis. Compressed\n\nsensing with deep image prior and learned regularization. arXiv preprint arXiv:1806.06438, 2018.\n\n[39] Andreas Velten, Thomas Willwacher, Otkrist Gupta, Ashok Veeraraghavan, Moungi G Bawendi, and\nRamesh Raskar. Recovering three-dimensional shape around a corner using ultrafast time-of-\ufb02ight imaging.\nNature communications, 3:745, 2012.\n\n[40] Tuomas Virtanen. Monaural sound source separation by nonnegative matrix factorization with tempo-\nral continuity and sparseness criteria. IEEE transactions on audio, speech, and language processing,\n15(3):1066\u20131074, 2007.\n\n[41] Jiaping Wang, Yue Dong, Xin Tong, Zhouchen Lin, and Baining Guo. Kernel nystr\u00f6m method for light\n\ntransport. ACM Transactions on Graphics (TOG), 28(3):29, 2009.\n\n[42] Lu Xia, Chia-Chih Chen, and Jake K Aggarwal. Human detection using depth information by kinect. In\nComputer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference\non, pages 15\u201322. IEEE, 2011.\n\n[43] Feihu Xu, Dongeek Shin, Dheera Venkatraman, Rudi Lussana, Federica Villa, Franco Zappa, Vivek K\nGoyal, Franco Wong, and Jeffrey Shapiro. Photon-ef\ufb01cient computational imaging with a single-photon\ncamera. In Computational Optical Sensing and Imaging, pages CW5D\u20134. Optical Society of America,\n2016.\n\n[44] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. Deep matrix factorization\n\nmodels for recommender systems. In IJCAI, pages 3203\u20133209, 2017.\n\n11\n\n\f", "award": [], "sourceid": 8097, "authors": [{"given_name": "Miika", "family_name": "Aittala", "institution": "MIT CSAIL / NVIDIA"}, {"given_name": "Prafull", "family_name": "Sharma", "institution": "MIT"}, {"given_name": "Lukas", "family_name": "Murmann", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Adam", "family_name": "Yedidia", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Gregory", "family_name": "Wornell", "institution": "MIT"}, {"given_name": "Bill", "family_name": "Freeman", "institution": "MIT/Google"}, {"given_name": "Fredo", "family_name": "Durand", "institution": "MIT"}]}