{"title": "Grouping-Based Low-Rank Trajectory Completion and 3D Reconstruction", "book": "Advances in Neural Information Processing Systems", "page_first": 55, "page_last": 63, "abstract": "Extracting 3D shape of deforming objects in monocular videos, a task known as non-rigid structure-from-motion (NRSfM), has so far been studied only on synthetic datasets and controlled environments. Typically, the objects to reconstruct are pre-segmented, they exhibit limited rotations and occlusions, or full-length trajectories are assumed. In order to integrate NRSfM into current video analysis pipelines, one needs to consider as input realistic -thus incomplete- tracking, and perform spatio-temporal grouping to segment the objects from their surroundings. Furthermore, NRSfM needs to be robust to noise in both segmentation and tracking, e.g., drifting, segmentation ``leaking'', optical flow ``bleeding'' etc. In this paper, we make a first attempt towards this goal, and propose a method that combines dense optical flow tracking, motion trajectory clustering and NRSfM for 3D reconstruction of objects in videos. For each trajectory cluster, we compute multiple reconstructions by minimizing the reprojection error and the rank of the 3D shape under different rank bounds of the trajectory matrix. We show that dense 3D shape is extracted and trajectories are completed across occlusions and low textured regions, even under mild relative motion between the object and the camera. We achieve competitive results on a public NRSfM benchmark while using fixed parameters across all sequences and handling incomplete trajectories, in contrast to existing approaches. We further test our approach on popular video segmentation datasets. To the best of our knowledge, our method is the first to extract dense object models from realistic videos, such as those found in Youtube or Hollywood movies, without object-specific priors.", "full_text": "Grouping-Based Low-Rank Trajectory Completion\n\nand 3D Reconstruction\n\nKaterina Fragkiadaki\n\nEECS, University of California,\n\nBerkeley, CA 94720\n\nkatef@berkeley.edu\n\nMarta Salas\n\nUniversidad de Zaragoza,\n\nZaragoza, Spain\n\nmsalasg@unizar.es\n\nPablo Arbel\u00b4aez\n\nUniversidad de los Andes,\n\nBogot\u00b4a, Colombia\n\npa.arbelaez@uniandes.edu.co\n\nmalik@eecs.berkeley.edu\n\nJitendra Malik\n\nEECS, University of California,\n\nBerkeley, CA 94720\n\nAbstract\n\nExtracting 3D shape of deforming objects in monocular videos, a task known as\nnon-rigid structure-from-motion (NRSfM), has so far been studied only on syn-\nthetic datasets and controlled environments. Typically, the objects to reconstruct\nare pre-segmented, they exhibit limited rotations and occlusions, or full-length\ntrajectories are assumed. In order to integrate NRSfM into current video analy-\nsis pipelines, one needs to consider as input realistic -thus incomplete- tracking,\nand perform spatio-temporal grouping to segment the objects from their surround-\nings. Furthermore, NRSfM needs to be robust to noise in both segmentation and\ntracking, e.g., drifting, segmentation \u201cleaking\u201d, optical \ufb02ow \u201cbleeding\u201d etc. In\nthis paper, we make a \ufb01rst attempt towards this goal, and propose a method that\ncombines dense optical \ufb02ow tracking, motion trajectory clustering and NRSfM\nfor 3D reconstruction of objects in videos. For each trajectory cluster, we com-\npute multiple reconstructions by minimizing the reprojection error and the rank\nof the 3D shape under different rank bounds of the trajectory matrix. We show\nthat dense 3D shape is extracted and trajectories are completed across occlusions\nand low textured regions, even under mild relative motion between the object and\nthe camera. We achieve competitive results on a public NRSfM benchmark while\nusing \ufb01xed parameters across all sequences and handling incomplete trajectories,\nin contrast to existing approaches. We further test our approach on popular video\nsegmentation datasets. To the best of our knowledge, our method is the \ufb01rst to\nextract dense object models from realistic videos, such as those found in Youtube\nor Hollywood movies, without object-speci\ufb01c priors.\n\nIntroduction\n\n1\nStructure-from-motion is the ability to perceive the 3D shape of objects solely from motion cues. It\nis considered the earliest form of depth perception in primates, and is believed to be used by animals\nthat lack stereopsis, such as insects and \ufb01sh [1].\nIn computer vision, non-rigid structure-from-motion (NRSfM) is the extraction of a time-varying\n3D point cloud from its 2D point trajectories. The problem is under-constrained since many 3D\ntime-varying shapes and camera poses give rise to the same 2D image projections. To tackle this\nambiguity, early work of Bregler et al. [2] assumes the per frame 3D shapes lie in a low dimensional\nsubspace. They recover the 3D shape basis and coef\ufb01cients, along with camera rotations, using\na 3K factorization of the 2D trajectory matrix, where K the dimension of the shape subspace,\n\n1\n\n\fFigure 1: Overview. Given a monocular video, we cluster dense \ufb02ow trajectories using 2D motion\nsimilarities. Each trajectory cluster results in an incomplete trajectory matrix that is the input to our\nNRSfM algorithm. Present and missing trajectory entries for the chosen frames are shown in green\nand red respectively. The color of the points in the rightmost column represents depth values (red is\nclose, blue is far). Notice the completion of the occluded trajectories on the belly dancer, that reside\nbeyond the image border.\n\nextending the rank 3 factorization method for rigid SfM of Tomasi and Kanade [3]. Akhter et al.[4]\nobserve that the 3D point trajectories admit a similar low-rank decomposition: they can be written\nas linear combinations over a 3D trajectory basis. This essentially re\ufb02ects that 3D (and 2D) point\ntrajectories are temporally smooth. Temporal smoothness is directly imposed using differentials\nover the 3D shape matrix in Dai et al. [5]. Further, rather than recovering the shape or trajectory\nbasis and coef\ufb01cients, the authors propose a direct rank minimization of the 3D shape matrix, and\nshow superior reconstruction results.\nDespite such progress, NRSfM has been so far demonstrated only on a limited number of synthetic\nor lab acquired video sequences. Factors that limit the application of current approaches to real-\nworld scenarios include:\n(i) Missing trajectory data. The aforementioned state-of-the-art NRSfM algorithms assume com-\nplete trajectories. This is an unrealistic assumption under object rotations, deformations or occlu-\nsions. Work of Torresani et al. [6] relaxes the full-length trajectory assumption. They impose a\nGaussian prior over the 3D shape and use probabilistic PCA within a linear dynamical system for\nextracting 3D deformation modes and camera poses; however, their method is sensitive to initializa-\ntion and degrades with the amount of missing data. Gotardo and Martinez [7] combine the shape\nand trajectory low-rank decompositions and can handle missing data; their method is one of our\nbaselines in Section 3. Park et al. [8] use static background structure to estimate camera poses\nand handle missing data using a linear formulation over a prede\ufb01ned trajectory basis. Simon at al.\n[9] consider a probabilistic formulation of the bilinear basis model of Akhter et al. [10] over the\nnon-rigid 3D shape deformations. This results in a matrix normal distribution for the time varying\n3D shape with a Kronecker structured covariance matrix over the column and row covariances that\ndescribe shape and temporal correlations respectively. Our work makes no assumptions regarding\ntemporal smoothness, in contrast to [8, 7, 9].\n(ii) Requirement of accurate video segmentation. The low-rank priors typically used in NRSfM\nrequire the object to be segmented from its surroundings. Work of [11] is the only approach that\nattempts to combine video segmentation and reconstruction, rather than considering pre-segmented\nobjects. The authors projectively reconstruct small trajectory clusters assuming they capture rigidly\nmoving object parts. Reconstruction results are shown in three videos only, making it hard to judge\nthe success of this locally rigid model.\nThis paper aims at closing the gap between theory and application in object-agnostic NRSfM from\nrealistic monocular videos. We build upon recent advances in tracking, video segmentation and\nlow-rank matrix completion to extract 3D shapes of objects in videos under rigid and non-rigid\nmotion. We assume a scaled orthographic camera model, as standard in the literature [12, 13], and\nlow-rank object-independent shape priors for the moving objects. Our goal is a richer representation\nof the video segments in terms of rotations and 3D deformations, and temporal completion of their\ntrajectories through occlusion gaps or tracking failures.\n\n2\n\nVideo sequenceTrajectory clustering3D ShapeDepth Missing entriesNRSfM\fAn overview of our approach is presented in Figure 1. Given a video sequence, we compute dense\npoint trajectories and cluster them using 2D motion similarities. For each trajectory cluster, we\n\ufb01rst complete the 2D trajectory matrix using standard low-rank matrix completion. We then re-\ncover the camera poses through a rank 3 truncation of the trajectory matrix and Euclidean upgrade.\nLast, keeping the camera poses \ufb01xed, we minimize the reprojection error of the observed trajectory\nentries along with the nuclear norm of the 3D shape. A byproduct of af\ufb01ne NRSfM is trajectory\ncompletion. The recovered 3D time-varying shape is backprojected in the image and the resulting\n2D trajectories are completed through deformations, occlusions or other tracking ambiguities, such\nas lack of texture. In summary, our contributions are:\n(i) Joint study of motion segmentation and structure-from-motion. We use as input to reconstruction\ndense trajectories from optical \ufb02ow linking [14], as opposed to a) sparse corner trajectories [15],\nused in previous NRSfM works [4, 5], or b) subspace trajectories of [16, 17], that are full-length\nbut cannot tolerate object occlusions. Reconstruction needs to be robust to segmentation mistakes.\nMotion trajectory clusters are inevitably polluted with \u201cbleeding\u201d trajectories that, although reside\non the background, they anchor on occluding contours. We use morphological operations to discard\nsuch trajectories that do not belong to the shape subspace and confuse reconstruction.\n(ii) Multiple hypothesis 3D reconstruction through trajectory matrix completion under various rank\nbounds, for tackling the rank ambiguity.\n(iii) We show that, under high trajectory density, rank 3 factorization of the trajectory matrix, as\nopposed to 3K, is suf\ufb01cient to recover the camera rotations in NRSfM. This allows the use of an\neasy, well-studied Euclidean upgrade for the camera rotations, similar to the one proposed for rigid\nSfM [3].\nWe present competitive results of our method on the recently proposed NRSfM benchmark of [17],\nunder a \ufb01xed set of parameters and while handling incomplete trajectories, in contrast to existing ap-\nproaches. Further, we present extensive reconstruction results in videos from two popular video seg-\nmentation benchmarks, VSB100 [18] and Moseg [19], that contain videos from Hollywood movies\nand Youtube. To the best of our knowledge, we are the \ufb01rst to show dense non-rigid reconstructions\nof objects from real videos, without employing object-speci\ufb01c shape priors [10, 20]. Our code is\navailable at: www.eecs.berkeley.edu/\u223ckatef/nrsfm.\n\n2 Low-rank 3D video reconstruction\n\n2.1 Video segmentation by multiscale trajectory clustering\nGiven a video sequence, we want to segment the moving objects in the scene. Brox and Malik\n[19] propose spectral clustering of dense point trajectories from 2D motion similarities and achieve\nstate-of-the-art performance on video segmentation benchmarks. We extend their method to pro-\nduce multiscale (rather than single scale) trajectory clustering to deal with segmentation ambiguities\ncaused by scale and motion variations of the objects in the video scene. Speci\ufb01cally, we \ufb01rst com-\npute a spectral embedding from the top eigenvectors of the normalized trajectory af\ufb01nity matrix. We\nthen obtain discrete trajectory clusterings using the discretization method of [21], while varying the\nnumber of eigenvectors to be 10, 20, 30 and 40 in each video sequence.\nIdeally, each point trajectory corresponds to a sequence of 2D projections of a 3D physical point.\nHowever, each trajectory cluster is spatially surrounded by a thin layer of trajectories that reside\noutside the true object mask and do not represent projections of 3D physical points. They are\nthe result of optical \ufb02ow \u201cbleeding \u201d to untextured surroundings [22], and anchor themselves on\noccluding contours of the object. Although \u201cbleeding\u201d trajectories do not drift across objects, they\nare a source of noise for reconstruction since they do not belong to the subspace spanned by the true\nobject trajectories. We discard them by computing an open operation (erosion followed by dilation)\nand an additional erosion of the trajectory cluster mask in each frame.\n\n2.2 Non-rigid structure-from-motion\n\nk \u2208 R3\u00d71 denote the 3D\nGiven a trajectory cluster that captures an object in space and time, let Xt\ncoordinate [X Y Z]T of the kth object point at the tth frame. We represent 3D object shape with a\n\n3\n\n\fmatrix S that contains the time varying coordinates of K object surface points in F frames:\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0 X1\n\n\uf8ee\uf8ef\uf8f0 S1\n\n...\nSF\n\n1 X1\n2\n\n...\n1 XF\nXF\n2\n\n\uf8f9\uf8fa\uf8fb .\n\nP\n\n\u00b7\u00b7\u00b7 X1\n...\n\u00b7\u00b7\u00b7 XF\n\nP\n\nS3F\u00d7P =\n\nFor the special case of rigid objects, shape coordinates are constant and the shape matrix takes the\nsimpli\ufb01ed form: S3\u00d7P = [X1 X2\nWe adopt a scaled orthographic camera model for reconstruction [3]. Under orthography, the\nprojection rays are perpendicular to the image plane and the projection equation takes the form:\nx = RX + t, where x = [x y]T is the vector of 2D pixel coordinates, R2\u00d73 is a scaled truncated\nrotation matrix and t2\u00d71 is the camera translation. Combining the projection equations for all object\n\n\u00b7\u00b7\u00b7 XP ] .\n\npoints in all fames, we obtain:\uf8ee\uf8f0 x1\n\n1\n\nx1\n2\n...\nxF\n2\n\n...\nxF\n1\n\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\nx1\nP\n...\nxF\nP\n\nwhere the camera pose matrix R takes the form:\n\n\uf8f9\uf8fb ,\n\n\uf8ee\uf8f0 R1\n\n...\nRF\n\nRrigid\n\n2F\u00d73 =\n\n\uf8ee\uf8f0 t1\n\uf8f9\uf8fb = R \u00b7 S +\n\uf8ee\uf8f0R1\n\n...\ntF\n\n0\n\n...\n\n...\n\n0\n\n0\n\nRnonrigid\n\n2F\u00d73F =\n\n\uf8f9\uf8fb \u00b7 1P\n\nT ,\n\n\uf8f9\uf8fb .\n\n0\n\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7 RF\n\n(1)\n\n(2)\n\n(5)\n\nWe subtract the camera translation tt from the pixel coordinates xt, t = 1\u00b7\u00b7\u00b7 F , \ufb01xing the origin\nof the coordinate system on the objects\u2019s center of mass in each frame, and obtain the centered\ntrajectory matrix W2F\u00d7P for which W = R \u00b7 S.\nLet \u02dcW denote an incomplete trajectory matrix of a cluster obtained from our multiscale trajectory\nclustering. Let H \u2208 {0, 1}2F\u00d7P denote a binary matrix that indicates presence or absence of\nentries in \u02dcW. Given \u02dcW, H, we solve for complete trajectories W, shape S and camera pose R\nby minimizing the camera reprojection error and 3D shape rank under various rank bounds for the\ntrajectory matrix. Rather than minimizing the matrix rank which is intractable, we minimize the\nmatrix nuclear norm instead (denoted by (cid:107)\u00b7(cid:107)\u2217), that yields the best convex approximation for the\nmatrix rank over the unit ball of matrices. Let (cid:12) denote Hadamard product and (cid:107)\u00b7(cid:107)F denote the\nFrobenius matrix norm. Our cost function reads:\n\nNRSfM(K):\nmin .\nW,R,S\nsubject to\n\n(cid:107)H (cid:12) (W \u2212 \u02dcW)(cid:107)2\nRank(W) \u2264 3K, \u2203\u03b1t, s.t. Rt(Rt)T = \u03b1tI2\u00d72, t = 1\u00b7\u00b7\u00b7 F.\n\nF + 1K>1 \u00b7 \u00b5(cid:107)Sv(cid:107)\u2217\n\nF + (cid:107)W \u2212 R \u00b7 S(cid:107)2\n\n(3)\n\nWe compute multiple reconstructions with K \u2208 {1\u00b7\u00b7\u00b7 9}. Sv denotes the re-arranged shape matrix\nwhere each row contains the vectorized 3D shape in that frame:\n\n\uf8ee\uf8f0 X 1\n\n...\nX F\n1\n\n1\n\nSv\nF\u00d73P =\n\nY 1\n1\n...\nY F\n1\n\nZ 1\n1\n...\nZ F\n1\n\nP\n\n\u00b7\u00b7\u00b7 X 1\n...\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7 X F\n\nP\n\nY 1\nP\n...\nY F\nP\n\nZ 1\nP\n...\nZ F\nP\n\n\uf8f9\uf8fb = [PX PY\n\nPZ] (I3 \u2297 S),\n\n(4)\n\nwhere PX , PY , PZ are appropriate row selection matrices. Dai et al. [5] observe that Sv\nF\u00d73P has\nlower rank than the original S3F\u00d7P since it admits a K-rank decomposition, instead of 3K, as-\nsuming per frame 3D shapes span a K dimensional subspace. Though S facilitates the writing of\nthe projection equations, minimizing the rank of the re-arranged matrix Sv avoids spurious degrees\nof freedom. Minimization of the nuclear norm of Sv is used only in the non-rigid case (K > 1).\nIn the rigid case, the shape does not change in time and Sv\n1\u00d73P has rank 1 by construction. We\napproximately solve Eq. 3 in three steps.\nLow-rank trajectory matrix completion We want to complete the 2D trajectory matrix under a\nrank bound constraint:\n\n(cid:107)H (cid:12) (W \u2212 \u02dcW)(cid:107)2\nmin .\nsubject to Rank(W) \u2264 3K.\nW\n\nF\n\n4\n\n\fDue to its intractability, the rank bound constraint is typically imposed by a factorization, W =\nU V T , U2F\u00d7r,VP\u00d7r, for our case r = 3K. Work of [23] empirically shows that the following\nregularized problem is less prone to local minima than its non-regularized counterpart (\u03bb = 0):\n\nmin .\n\nW,U2F \u00d73K ,VP \u00d73K\nsubject to\n\n(cid:107)H (cid:12) (W \u2212 \u02dcW)(cid:107)2\nW = UVT .\n\nF + \u03bb\n\n2 ((cid:107)U(cid:107)2\n\nF + (cid:107)V(cid:107)2\nF )\n\n(6)\n\nWe solve Eq. 6 using the method of Augmented Lagrange multipliers. We want to explicitly search\nover different rank bounds for the trajectory matrix W as we vary K. We do not choose to minimize\nthe nuclear norm instead, despite being convex, since different weights for the nuclear term result in\nmatrices of different ranks, thus is harder to control explicitly the rank bound. Prior work [24, 23]\nshows that the bilinear formulation of Eq. 6, despite being non-convex in comparison to the nuclear\nregularized objective ((cid:107)H(cid:12) (W\u2212 \u02dcW)(cid:107)2\nF +(cid:107)W(cid:107)\u2217), it returns the same optimum in cases r >= r\u2217,\nwhere r\u2217 denotes the rank obtained by the unconstrained minimization of the nuclear regularized\nobjective. We use the continuation strategy proposed in [23] over r to avoid local minima for r < r\u2217:\nstarting from large values of r, we iteratively reduce it till the desired rank bound 3K is achieved.\nFor details, please see [23, 24].\nEuclidean upgrade Given a complete trajectory matrix, minimization of the reprojection error\nterm of Eq. 3 under the orthonormality constraints is equivalent to a SfM or NRSfM problem in its\nstandard form, previously studied in the seminal works of [3, 2]:\n\nmin .\nR,S\nsubject to\n\n(cid:107)W \u2212 R \u00b7 S(cid:107)2\n\u2203\u03b1t, s.t. Rt(Rt)T = \u03b1tI2\u00d72, t = 1\u00b7\u00b7\u00b7 F.\n\nF\n\n(7)\n\nFor rigid objects, Tomasi and Kanade [3] recover the camera pose and shape matrix via SVD of W\ntruncated to rank 3: W = UDVT = (UD1/2)(D1/2VT) = \u02c6R \u00b7 \u02c6S. The factorization is not unique\nsince for any invertible matrix G3\u00d73: \u02c6R \u00b7 \u02c6S = \u02c6R \u00b7 GG\u22121 \u02c6S. We estimate G so that \u02c6RG satis\ufb01es the\northonormality constraints:\n\northogonality:\nsame norm:\n\n\u02c6R2t\u22121GGT \u02c6RT\n\nt = 1\u00b7\u00b7\u00b7 F\n2t\u22121 = \u02c6R2tGGT \u02c6RT\n2t,\n\n2t = 0,\n\n\u02c6R2t\u22121GGT \u02c6RT\n\nt = 1\u00b7\u00b7\u00b7 F.\n\n(8)\n\nThe constraints of Eq. 8 form an overdetermined homogeneous linear system with respect to the\nelements of the gram matrix Q = GGT . We estimate Q using least-squares and factorize it using\nSVD to obtain G up to an arbitrary scaling and rotation of its row space [25]. Then, the rigid object\nshape is obtained by S3\u00d7P = G\u22121 \u02c6S.\nFor non-rigid objects, a similar Euclidean upgrade of the rotation matrices has been attempted using\na rank 3K (rather than 3) decomposition of W [26]. In the non-rigid case, the corrective transforma-\ntion G has size 3K \u00d7 3K. Each column triplet 3K \u00d7 3 is recovered independently since it contains\nthe rotation information from all frames. For a long time, an overlooked rank 3 constraint on the\nGram matrix Qk = GT\nk Gk spurred conjectures regarding the ambiguity of shape recovery under\nnon-rigid motion [26]. This lead researchers to introduce additional priors for further constraining\nthe problem, such as temporal smoothness [27]. Finally, the work of [4] showed that orthonormality\nconstraints are suf\ufb01cient to recover a unique non-rigid 3D shape. Dai et al. [5] proposed a practical\nalgorithm for Euclidean upgrade using rank 3K decomposition of W that minimizes the nuclear\nnorm of Qk under the orthonormality constraints.\nSurprisingly, we have found that in practice it is not necessary to go beyond rank 3 truncation of W\nto obtain the rotation matrices in the case of dense NRSfM. The large majority of trajectories span\nthe rigid component of the object, and their information suf\ufb01ces to compute the objects\u2019 rotations.\nThis is not the case for synthetic NRSfM datasets, where the number of tracked points on the artic-\nulating links is similar to the points spanning the \u201ctorso-like\u201d component, as in the famous \u201cDance\u201d\nsequence [12]. In Section 3, we show dense face reconstruction results while varying the truncating\nrank \u03bar of W for the Euclidean upgrade step, and verify that \u03bar = 3 is more stable than \u03bar > 3 for\nNRSfM of faces.\nRank regularized least-squares for 3D shape recovery In the non-rigid case, given the recovered\ncamera poses R, we minimize the reprojection error of the observed trajectory entries and 3D shape\n\n5\n\n\fFigure 2: Qualitative results in the synthetic benchmark of [17]. High quality reconstructions are\nobtained with oracle (full-length) trajectories for both abrupt and smooth motion. For incomplete\ntrajectories, in the 3rd column we show in red the missing and in green the present trajectory entries.\nThe reconstruction result for the 2nd video sequence that has 30% missing data, though worse, is\nstill recognizable.\n\nnuclear norm:\n\n2(cid:107)H (cid:12) ( \u02dcW \u2212 R \u00b7 S)(cid:107)2\n\nmin .\nsubject to Sv = [PX PY PZ] (I3 \u2297 S).\n\nF + \u00b5(cid:107)Sv(cid:107)\u2217\n\nS\n\n1\n\n(9)\n\nNotice that we consider only the observed entries in \u02dcW to constrain the 3D shape estimation; how-\never, information from the complete W has been used for extracting the rotation matrices R. We\nsolve the convex, non-smooth problem in Eq. 9 using the nuclear minimization algorithm proposed\nin [28]. It generalizes the accelerated proximal gradient method of [29] from l1 regularized least-\nsquares on vectors to nuclear norm regularized least-squares on matrices. It has a better iteration\ncomplexity than the Fixed Point Continuation (FPC) method of [30] and the Singular Value Thresh-\nolding (SVT) method [31].\nGiven camera pose R and shape S, we backproject to obtain complete centered trajectory matrix\nW = R\u00b7 S. Though we can in principle iterate over the extraction of camera pose and 3D shape, we\nobserved bene\ufb01ts from such iteration only in the rigid case. This observation agrees with the results\nof Marques and Costeira [32] for rigid SfM from incomplete trajectories.\n3 Experiments\nThe only available dense NRSfM benchmark has been recently introduced in Garg et al. [17]. They\npropose a dense NRSfM method that minimizes a robust discontinuity term over the recovered 3D\ndepth along with 3D shape rank. However, their method assumes as input full-length trajectories\nobtained via the subspace \ufb02ow tracking method of [16]. Unfortunately, the tracker of [16] can\ntolerate only very mild out-of-plane rotations or occlusions, which is a serious limitation for tracking\nin real videos. Our method does not impose the full-length trajectory requirement. Also, we show\nthat the robust discontinuity term in [17] may not be necessary for high quality reconstructions.\nThe benchmark contains four synthetic video sequences that depict a deforming face, and three real\nsequences that depict a deforming back, face and heart, respectively. Only the synthetic sequences\nhave ground-truth 3D shapes available, since it is considerably more dif\ufb01cult to obtain ground-truth\nfor NRSfM in non-synthetic environments. Dense full-length ground-truth 2D trajectories are pro-\nvided for all sequences. For evaluation, we use the code supplied with the benchmark, that performs\na pre-alignment step at each frame between St and St\nGT using Procrustes analysis. Reconstruction\nperformance is measured by mean RMS error across all frames, where the per frame RMS error of\na shape St with respect to ground-truth shape St\n\nGT is de\ufb01ned as: (cid:107)St\u2212St\nGT (cid:107)F\n(cid:107)St\n\nGT (cid:107)F\n\n.\n\nFigure 2 presents our qualitative results and Table 1 compares our performance against previous\nstate-of-the-art NRSfM methods: Trajectory Basis (TB) [12], Metric Projections (MP) [33], Varia-\ntional Reconstruction (VR) [17] and CSF [7]. For CSF, we were not able to complete the experiment\nfor sequences 3 and 4 due to the non-scalable nature of the algorithm. Next to the error of each\n\n6\n\nsequence 210 frames longabrupt deform./rot.sequence 399 frames longmild deform./rot.complete trajectoriesincomplete trajectoriesrotated frontal viewoursours: frontal viewoursours: frontal viewgroundtruth 3D shapemissing entries \fFigure 3: Reconstruction results in the \u201cBack\u201d, \u201cFace\u201d and \u201cHeart\u201d sequences of [17]. We show\npresent and missing trajectory entries, per frame depth maps and retextured depth maps.\n\nmethod we show in parentheses the rank used, that is, the rank that gave the best error. Our method\nuses exactly the same parameters and K = 9 for all four sequences. Baseline VR [17] adapts the\nweight for the nuclear norm of S for each sequence. This shows robustness of our method under\nvarying object deformations. \u03bar is the truncated rank of W used for the Euclidean upgrade step.\nWhen \u03bar > 3, we use the Euclidean upgrade proposed in [5]. \u03bar = 3 gives the most stable face\nreconstruction results.\nNext, to imitate a more realistic setup, we introduce missing entries to the ground-truth 2D tracks by\n\u201chiding\u201d trajectory entries that are occluded due to face rotations. The occluded points are shown in\nred in Figure 2 3rd column. From the \u201cincomplete trajectories\u201d section of Table 1, we see that the\nerror increase for our method is small in comparison to the full-length trajectory case.\nIn the real \u201cBack\u201d, \u201cFace\u201d and \u201cHeart\u201d sequences of the benchmark, the objects are pre-segmented.\nWe keep all trajectories that are at least \ufb01ve frames long. This results in 29.29%, 30.54% and\n52.71% missing data in the corresponding trajectory matrices \u02dcW. We used K = 8 for all sequences.\nWe show qualitative results in Figure 3. The present and missing entries are shown in green and\nred, respectively. The missing points occupy either occluded regions, or regions with ambiguous\ncorrespondence, e.g., under specularities in the Heart sequence.\nNext, we test our method on reconstructing objects from videos of two popular video segmentation\ndatasets: VSB100 [18], that contains videos uploaded to Youtube, and Moseg [19], that contains\nvideos from Hollywood movies. Each video is between 19 and 121 frames long. For all videos\nwe use K \u2208 {1\u00b7\u00b7\u00b7 5}. We keep all trajectories longer than \ufb01ve frames. This results in missing\ndata varying from 20% to 70% across videos, with an average of 45% missing trajectory entries.\nWe visualize reconstructions for the best trajectory clusters (the ones closest to the ground-truth\nsegmentations supplied with the datasets) in Figure 4.\nDiscussion Our 3D reconstruction results in real videos show that, under high trajectory density,\nsmall object rotations suf\ufb01ce to create the depth perception. We also observe the tracking quality to\nbe crucial for reconstruction. Optical \ufb02ow deteriorates as the spatial resolution decreases, and thus\nhigh video resolution is currently important for our method. The most important failure cases for our\n\nTB [12] MP [33] VR [17] ours\n\n\u03bar = 3\n\nSeq.1 (10) 18.38 (2) 19.44 (3) 4.01 (9) 5.16\nSeq.2 (10) 7.47 (2) 4.87 (3) 3.45 (9) 3.71\nSeq.3 (99) 4.50 (4) 5.13 (6) 2.60 (9) 2.81\nSeq.4 (99) 6.61 (4) 5.81 (4) 2.81 (9) 3.19\n\nground-truth full trajectories\nours\n\u03bar = 6\n6.69\n5.20\n2.88\n3.08\n\nours\n\u03bar = 9\n21.02\n25.6\n3.00\n3.54\n\nincomplete trajectories\nCSF\n\nours \u03bar = 3\n\n4.92 (8.93% occl)\n15.6\n9.44 (31.60% occl) 36.8\n3.40 (14.07% occl) \u2014\u2014\n5.53 ( 13.63% occl) \u2014\u2014\n\nTable 1: Reconstruction results on the NRSfM benchmark of [17]. We show mean RMS error per\ncent (%). Numbers for TB, MP and VR baselines are from [17]. In the \ufb01rst column, we show in\nparentheses the number of frames. \u03bar is the rank of W used for the Euclidean upgrade. The last\ntwo columns shows the performance of our algorithm and CSF baseline when occluded points in the\nground-truth tracks are hidden.\n\n7\n\n \fFigure 4: Reconstruction results on the VSB and Moseg video segmentation datasets. For each\nexample we show a) the trajectory cluster, b) the present and missing entries, and c) the depths of the\nvisible (as estimated from ray casting) points, where red and blue denote close and far respectively.\n\nmethod are highly articulated objects, which violates the low-rank assumptions. 3D reconstruction\nof articulated bodies is the focus of our current work.\n4 Conclusion\nWe have presented a practical method for extracting dense 3D object models from monocular un-\ncalibrated video without object-speci\ufb01c priors. Our method considers as input trajectory motion\nclusters obtained from automatic video segmentation that contain large amounts of missing data\ndue to object occlusions and rotations. We have applied our NRSfM method on synthetic dense re-\nconstruction benchmarks and on numerous videos from Youtube and Hollywood movies. We have\nshown that a richer object representation is achievable from video under mild conditions of camera\nmotion and object deformation: small object rotations are suf\ufb01cient to recover 3D shape. \u201cWe see\nbecause we move, we move because we see\u201d, said Gibson in his \u201cPerception of the Visual World\u201d\n[34]. We believe this paper has made a step towards encompassing 3D perception from motion into\ngeneral video analysis.\nAcknowledgments\nThe authors would like to thank Philipos Modrohai for useful discussions. M.S. acknowledges\nfunding from Direcci\u00b4on General de Investigaci\u00b4on of Spain under project DPI2012-32168 and the\nMinisterio de Educaci\u00b4on (scholarship FPU-AP2010-2906).\nReferences\n[1] Andersen, R.A., Bradley, D.C.: Perception of three-dimensional structure from motion. Trends in cogni-\n\ntive sciences 2 (1998) 222\u2013228\n\n[2] Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3d shape from image streams.\n\nCVPR. (2000)\n\nIn:\n\n[3] Tomasi, C., Kanade, T.: shape and motion from image streams: a factorization method. Technical report,\n\nIJCV (1991)\n\n[4] Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Trajectory space: A dual representation for nonrigid structure\n\nfrom motion. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2011) 1442\u20131456\n\n8\n\nK = 2K = 1K = 3K = 1K = 2K = 4K = 3K = 3K = 1K = 2\f[5] Dai, Y.: A simple prior-free method for non-rigid structure-from-motion factorization. In: IJCV. (2012)\n[6] Torresani, L., Hertzmann, A., Bregler, C.: Nonrigid structure-from-motion: Estimating shape and motion\n\nwith hierarchical priors. TPAMI 30 (2008)\n\n[7] Gotardo, P.F.U., Martinez, A.M.: Computing smooth time trajectories for camera and deformable shape\n\nin structure from motion with occlusion. TPAMI 33 (2011)\n\n[8] Park, H.S., Shiratori, T., Matthews, I., Sheikh, Y.: 3d reconstruction of a moving point from a series of\n\n2d projections. ECCV (2010)\n\n[9] Simon, T., Valmadre, J., Matthews, I., Sheikh, Y.: Separable spatiotemporal priors for convex reconstruc-\n\ntion of time-varying 3d point clouds. In: ECCV. (2014)\n\n[10] Akhter, I., Simon, T., Khan, S., Matthews, I., Sheikh, Y.: Bilinear spatiotemporal basis models. In: ACM\n\nTransaction on graphics, Accepted with minor revisions. (2011)\n\n[11] Russell, C., Yu, R., Agapito, L.: Video pop-up: Monocular 3d reconstruction of dynamic scenes. In:\n\nECCV. (2014)\n\n[12] Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Trajectory space: A dual representation for nonrigid structure\n\nfrom motion. IEEE Trans. Pattern Anal. Mach. Intell. 33 (2011)\n\n[13] Torresani, L., Bregler, C.: Space-time tracking. In: ECCV. (2002)\n[14] Sundaram, N., Brox, T., Keutzer, K.: Dense point trajectories by GPU-accelerated large displacement\n\noptical \ufb02ow. In: ECCV. (2010)\n\n[15] Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision\n\n(darpa). In: Proceedings of the 1981 DARPA Image Understanding Workshop. (1981)\n\n[16] Garg, R., Roussos, A., de Agapito, L.: A variational approach to video registration with subspace con-\n\nstraints. International Journal of Computer Vision 104 (2013) 286\u2013314\n\n[17] Garg, R., Roussos, A., Agapito, L.: Dense variational reconstruction of non-rigid surfaces from monocu-\n\nlar video. (2013)\n\n[18] Galasso, F., Nagaraja, N.S., Cardenas, T.J., Brox, T., Schiele, B.: A uni\ufb01ed video segmentation bench-\n\nmark: Annotation, metrics and analysis. In: ICCV. (2013)\n\n[19] Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: ECCV. (2010)\n[20] Bao, S.Y.Z., Chandraker, M., Lin, Y., Savarese, S.: Dense object reconstruction with semantic priors. In:\n\nCVPR. (2013) 1264\u20131271\n\n[21] Yu, S., Shi, J.: Multiclass spectral clustering. In: ICCV. (2003)\n[22] Thompson, W.B.: Exploiting discontinuities in optical \ufb02ow. International Journal of Computer Vision 30\n\n(1998) 17\u20134\n\n[23] Cabral, R., de la Torre, F., Costeira, J., Bernardino, A.: Unifying nuclear norm and bilinear factorization\n\napproaches for low-rank matrix decomposition. In: ICCV. (2013)\n\n[24] Burer, S., Monteiro, R.D.C.: Local minima and convergence in low-rank semide\ufb01nite programming.\n\nMath. Program. 103 (2005) 427\u2013444\n\n[25] Brand, M.: A Direct Method for 3D Factorization of Nonrigid Motion Observed in 2D. In: CVPR. (2005)\n[26] Xiao, J., Chai, J., Kanade, T.: A closed-form solution to nonrigid shape and motion recovery. Technical\n\nReport CMU-RI-TR-03-16, Robotics Institute, Pittsburgh, PA (2003)\n\n[27] Torresani, L., Yang, D.B., Alexander, E.J., Bregler, C.: Tracking and modeling non-rigid objects with\n\nrank constraints. In: CVPR. (2001)\n\n[28] Toh, K.C., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized linear least\n\nsquares problems. Paci\ufb01c Journal of Optimization (2010)\n\n[29] Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences 2 (2009) 183\u2013202\n\n[30] Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for matrix rank minimization.\n\nMath. Program. 128 (2011) 321\u2013353\n\n[31] Cai, J.F., Cand`es, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM\n\nJ. on Optimization 20 (2010) 1956\u20131982\n\n[32] Marques, M., Costeira, J.: Estimating 3d shape from degenerate sequences with missing data. Computer\n\nVision and Image Understanding 113 (2009) 261 \u2013 272\n\n[33] Paladini, M., Del Bue, A., Sto\u02c7sic, M., Dodig, M., Xavier, J., Agapito, L.: Optimal metric projections for\n\ndeformable and articulated structure-from-motion. IJCV 96 (2012)\n\n[34] Gibson, J.J.: The perception of the visual world. The American Journal of Psychology, 64 (1951) 622\u2013625\n\n9\n\n\f", "award": [], "sourceid": 55, "authors": [{"given_name": "Katerina", "family_name": "Fragkiadaki", "institution": "University of California, Berkeley"}, {"given_name": "Marta", "family_name": "Salas", "institution": "Universidad de Zaragoza"}, {"given_name": "Pablo", "family_name": "Arbelaez", "institution": null}, {"given_name": "Jitendra", "family_name": "Malik", "institution": "UC Berkeley"}]}