{"title": "Correlated Uncertainty for Learning Dense Correspondences from Noisy Labels", "book": "Advances in Neural Information Processing Systems", "page_first": 920, "page_last": 928, "abstract": "Many machine learning methods depend on human supervision to achieve optimal performance. However, in tasks such as DensePose, where the goal is to establish dense visual correspondences between images, the quality of manual annotations is intrinsically limited. We address this issue by augmenting neural network predictors with the ability to output a distribution over labels, thus explicitly and introspectively capturing the aleatoric uncertainty in the annotations.\nCompared to previous works, we show that correlated error fields arise naturally in applications such as DensePose and these fields can be modeled by deep networks, leading to a better understanding of the annotation errors.\nWe show that these models, by understanding uncertainty better, can solve the original DensePose task more accurately, thus setting the new state-of-the-art accuracy in this benchmark.\nFinally, we demonstrate the utility of the uncertainty estimates in fusing the predictions of produced by multiple models, resulting in a better and more principled approach to model ensembling which can further improve accuracy.", "full_text": "Correlated Uncertainty for Learning\n\nDense Correspondences from Noisy Labels\n\nNatalia Neverova, David Novotny, Andrea Vedaldi\n\nFacebook AI Research\n\n{nneverova, dnovotny, vedaldi}@fb.com\n\nAbstract\n\nMany machine learning methods depend on human supervision to achieve optimal\nperformance. However, in tasks such as DensePose, where the goal is to establish\ndense visual correspondences between images, the quality of manual annotations\nis intrinsically limited. We address this issue by augmenting neural network\npredictors with the ability to output a distribution over labels, thus explicitly and\nintrospectively capturing the aleatoric uncertainty in the annotations. Compared to\nprevious works, we show that correlated error \ufb01elds arise naturally in applications\nsuch as DensePose and these \ufb01elds can be modelled by deep networks, leading\nto a better understanding of the annotation errors. We show that these models,\nby understanding uncertainty better, can solve the original DensePose task more\naccurately, thus setting the new state-of-the-art accuracy in this benchmark. Finally,\nwe demonstrate the utility of the uncertainty estimates in fusing the predictions\nproduced by multiple models, resulting in a better and more principled approach to\nmodel ensembling which can further improve accuracy.\n\n1\n\nIntroduction\n\nDeep neural networks achieve state-of-the-art performance in many applications, but at the cost of\ncollecting large quantities of annotated training data. Manual annotations are time consuming and, in\nsome cases, of limited quality. This is particularly true for quantitative labels such as the 3D shape of\nobjects in images or dense correspondence \ufb01elds between objects. In these cases, one should consider\nmanual labels as a form of weak supervision and design learning algorithm accordingly.\nAn emerging approach to handle annotation noise is to task the network with predicting the aleatoric\nuncertainty in the labels. Consider a predictor \u02c6y = \u03a6(x) mapping a data point x to an estimate \u02c6y of\nits label. Given the \u201cground-truth\u201d label y, the standard approach is to minimize a loss of the type\n(cid:96)(y, \u02c6y) so that \u02c6y approaches y as much as possible. However, if the \u201cground-truth\u201d value y is affected\nby noise, then naively minimizing this quantity may be undesirable. An alternative approach is to\npredict instead a distribution p(\u02c6y|x) = \u03a6\u02c6y(x) over possible values of the annotation y. This has\nseveral advantages: (1) it can model the distribution of annotation errors speci\ufb01c to each data point x\n(as not all data points are equally dif\ufb01cult to annotate), (2) it can model the prediction uncertainty\n(as knowing x may not be suf\ufb01cient to fully determine y), and (3) it allows the model to account\nfor its own limitations (by assessing the dif\ufb01culty of the prediction task). Under such a model, the\npoint-wise loss is replaced by the negative log-likelihood (cid:96)(y, \u03a6(x)) = \u2212 log p(y|x) = \u2212 log \u03a6y(x).\nApproaches using these ideas have demonstrated their power in a number of applications. However,\nmost methods have adopted simplistic uncertainty models. In particular, when the goal is to predict\na label vector y \u2208 Rn, errors have been assumed to be uncorrelated, so that \u2212 log p(y|x) =\ni=1 log p(yi|x). However, this is very seldom the case. For example, if yi is a label associated to\na particular pixel i in an image x, we can expect annotation and prediction errors to be very strongly\ncorrelated for pixels in the same neighborhood.\n\n\u2212(cid:80)n\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Systematic errors in manual dense correspondences (DensePose [5]). The annotators\nare shown a set of points sampled randomly and uniformly over one of prede\ufb01ned body parts of\na person in an image. Their task is to click on corresponding pixels in another image obtained by\nrendering and therefore providing ground truth correspondences to a canonical 3D model of a human\nbody. Due to self-occlusions and ambiguities, errors made by the annotators tend to be correlated\nwithin each given body part and can be partially described by global af\ufb01ne transforms (translation,\nrotation, scaling) w.r.t true locations. We model the structure of these errors by learning a neural\nnetwork that estimates a distribution over the correlated annotation \ufb01eld.\n\nIn this paper, we investigate richer uncertainty models that can better exploit the data structure. As a\nrunning example, we consider the task of dense pose estimation, or DensePose (\ufb01g. 1): namely, given\nan image x of a human, the goal is to associate to every pixel i a coordinate chart (pi, ui) where\npi \u2208 {0, 1, . . . , P} is a part label (where 0 is the background) and ui \u2208 [0, 1]2 is a chart coordinate,\nspeci\ufb01c to part pi (\ufb01g. 1). This is an interesting test case for three reasons: (1) the task consists of\npixel-wise, and hence correlated, predictions; (2) the quality of human annotations is uneven and data\ndependent advantages; (3) there is structure in the data (body parts) that may align with the structure\nof the annotation errors.\nWe make three contributions. The \ufb01rst is to apply a standard uncertainty model to this task of learning\ndense correspondences; this was not done before and alone achieves a signi\ufb01cant improvement over\nstrong baselines trained with regression losses.\nOur second, and more signi\ufb01cant, contribution is to propose more sophisticated uncertainty models.\nThese models allow to predict for each pixel i a direction for the error in labelling the chart coordinate\nvectors. This can for example be used to express different degrees of uncertainty due to foreshortening\nof a limb. They also allow to express a degree of correlation between all vectors that belong to a\ncommon region, in this case identi\ufb01ed as a human body part. While richer, our models can still be\nintegrated ef\ufb01ciently in deep neural networks for end-to-end training.\nOur third contribution is a deeper departure from prior work. Instead of just modelling uncertainty\nas distribution \u2212 log p(\u02c6y|x) = \u03a6\u02c6y(x) conditioned on the input data x, we consider the possibility of\nconditioning the uncertainty on the annotation y directly. For example, in the DensePose task, the\nannotation y can be used to predict image regions where uncertainty is likely to be higher even before\nobserving the image x.\n\n2 Related work\n\nUncertainty in machine learning is usually decomposed into three types [9]: approximation, due to\nthe fact that the model is not suf\ufb01ciently expressive to model the data-to-label association, aleatoric,\n\n2\n\n\fdue to the intrinsic stochastic nature of the association, and epistemic, due to the model\u2019s limited\nknowledge about this association, which prevents it form determining it uniquely.\nUncertainty in deep neural network has been modelled using approximate Bayesian approaches [6,\n1, 7, 8] using dropout as a way to generate the necessary samples [4, 7]. Ensembling [3], which\ncombines multiple models, has been explored in [15, 10, 16]. The recent method of [18] proposes a\nfrequentist method that can estimate both aleatoric and epistemic uncertainty.\nThe work of [7, 14] is probably the most related to ours. They also model approximation and aleatoric\nuncertainty by con\ufb01guring a deep neural network to produce an explicit estimate of the latter, in the\nform of a parametric posterior distribution. In this paper, we build a similar model, but apply it to a\ndense, structure image labelling tasks. We thus extend the model to express structured uncertainty,\nwhere errors are highly correlated in a way which depends on the input image and annotation.\nWe apply our approach to the DensePose problem, originally introduced in [5]. We not only show\nthat we can accurately model uncertainty in the annotation process, but also learn better overall\nDensePose regressor, outperforming the current state-of-the-art results of [12], from which we borrow\nour experimental setup.\n\n3 Method\nWe consider the problem of predicting, given a data point x, a label vector y \u2208 Rdn formed by n\nsubvectors yi \u2208 Rd, i = 1, . . . , n of dimension d. In our test application, namely DensePose, these\nsubvectors are the chart coordinates yi = ui \u2208 [0, 1]2 that associate to pixels i of image x a particular\npoint on the human body. However, there are many other problems, including colorization, depth\nprediction and inpainting, that can be modelled in a similar manner.\nWe denote by \u03b4 = \u02c6y \u2212 y the error between the predicted value \u02c6y of the label and the annotated\nvalue y. In order to model uncertainty in the system, we train a predictive model \u03a6(x) that outputs\nnot only the point estimate \u02c6y \u2248 y, but also a distribution p(y|x) over possible values. For simplicity,\nwe express the latter as\n(1)\nwhere q(\u03b4|x) is an unbiased distribution of the residual (i.e. E[q(\u03b4|x)] = 0 and argmax\u03b4 q(\u03b4|x) = 0).\nHence, the output of the neural network is a pair (\u02c6y, q) = \u03a6(x) comprising a point estimate \u02c6y and the\ndistribution of the residual q. This model can be trained by optimizing the negative log-likelihood\n(cid:96)(y, \u03a6(x)) = \u2212 log q(y \u2212 \u02c6y).\nNext, we discuss possible variants of the model with different complexity and expressive power.\n\np(y|x) = q(\u02c6y \u2212 y|x)\n\nmodel then is given by q(\u03b4|x) =(cid:81)n\n\n3.1 Elementary uncertainty model\nIn the simplest case, we let y \u2208 Rn and assume that subvectors yi are single scalars. The uncertainty\ni=1 q(\u03b4i|x) which amounts to assuming that the residuals are\nstatistically independent. The simplest choice for q(\u03b4i|x) is a Gaussian N (0, \u03c32\ni ) with standard\ndeviation \u03c32\ni . Hence, the neural network (\u02c6y, \u03c3) = \u03a6(x) outputs for each pixel i the prediction \u02c6yi as\nwell as an estimate \u03c32\n\ni of the prediction uncertainty. In this case, the training loss expands as:\n\n(cid:96)(y, \u03a6(x)) =\n\nn\n2\n\nlog 2\u03c0 +\n\n1\n2\n\nlog \u03c32\n\ni +\n\n(cid:18)\n\nn(cid:88)\n\ni=1\n\n(cid:19)\n\n(\u02c6yi \u2212 y)2\n\n\u03c32\ni\n\ni is obtained by setting \u03c32 = |\u02c6yi \u2212 y|. Hence, the predictor \u03a6\nNote that minimum of the r.h.s. w.r.t. \u03c32\nwill try to output a value of \u03c3i which is equal to the error actually incurred at that pixel; crucially,\nhowever, the model \u03a6 cannot measure this error directly as it only receives the data point x (and not\nthe annotation y) as input. Hence, the model is encouraged to perform introspection.\n\n3.2 Higher-order uncertainty models\n\nThe model of section 3.1 is simplistic as it assumes that errors are statistically independent, which\nis seldom the case in applications. In order to address this limitation, we assume that residuals are\ninstead generated by the model\n\n(2)\n\n(3)\n\n\u03b4i = \u0001 + \u03b7i + \u03beiwi\n\n3\n\n\f1Id) is an overall isotropic offset, \u03b7i \u223c N (0, \u03c32\n\nwhere \u0001 \u223c N (0, \u03c32\n2iId), is a subvector speci\ufb01c\nisotropic offset and \u03bei \u223c N (0, \u03c32\n3i) is a subvector speci\ufb01c directional offset along the unit vector wi.\nHere Id denotes the d \u00d7 d identity matrix, where d is the dimensionality of the subvectors yi (d = 2\nin the DensePose application).\nThis model extends (2) in several ways. First, the term \u0001 indicates that errors are overall correlated.\nFor instance, in DensePose it is likely that all points annotated for a given human body part would be\naffected by a similar annotation shift compared to the \u201ccorrect\u201d annotation. This is because humans\nare better at relative rather than absolute judgments when it comes to establishing ambiguous visual\ncorrespondences. Second, the term \u03b7i expresses local isotropic uncertainty, similar to (2). Third, the\nterm \u03beiwi express local directional uncertainty. This can be used to capture any expected directionally\nin the error. For instance, in DensePose we expect errors to be larger in the direction of visual\nforeshortening of a limb.\nNext, we calculate the negative log-likelihood \u2212 log q(\u03b4|x) under this model in order to evaluate and\nlearn it. The collection of residuals \u03b4 is a Gaussian vector with co-variance matrix\n\n\u03a3 = JJ(cid:62) + diag(\u03a31, . . . , \u03a3n), \u03a3i = \u03c32\n\u00b7\u00b7\u00b7\n\n(cid:62) \u2208 Rdn\u00d7d.\n\n2iId + \u03c32\n\n3iwiw(cid:62)\ni ,\n\nwhere J = \u03c31 \u00b7 [Id\nSome algebra shows that the determinant of the covariance matrix and the concentration matrix are\ngiven by\n\nId]\n\n(4)\n\ndet \u03a3i, C = \u03a3\u22121 = \u00afC \u2212 \u00afCJK\u22121J(cid:62) \u00afC, K = Id + \u03c32\n\n\u03a3\u22121\n\ni\n\n.\n\n(5)\n\ni=1\n\ni=1\n\nHere \u00afC = diag(C1, . . . , Cn) is the block-diagonal matrix containing the subvector-speci\ufb01c concen-\ntration matrices Ci = \u03a3\u22121\n\n. We can expand this further by noting that:\n\ni\n\nCi =\n\n1\n\u03c32\n2i\n\n\u00b7 \u03a0i, \u03a0i = Id \u2212 \u03c1iwiw(cid:62)\ni ,\n\ndet \u03a3i =\n\n\u03c32d\n2i\n1 \u2212 \u03c1i\n\n,\n\n\u03c1i =\n\n\u03c32\n3i\n\n\u03c32\n2i + \u03c32\n3i\n\n(6)\n\nwhere \u03a0i can be interpreted as a projection operator and \u03c1i as a correlation coef\ufb01cient. Then:1\n\ndet \u03a3 = det K \u00b7 n(cid:89)\n\n1 \u00b7 n(cid:88)\n\n\u2212 log q(\u03b4|x) =\n\nnd\n2\n\nlog 2\u03c0 +\n\n+\n\n1\n2\n\nlog det K \u2212 \u03c32\n1\n2\n\n(cid:18)\n\nn(cid:88)\n(cid:32) n(cid:88)\n\n1\n2\n\ni=1\n\ni=1\n\nlog\n\n\u03c32d\n2i\n1 \u2212 \u03c1i\n\n(cid:33)(cid:62)\n\nK\u22121\n\n\u03a0i\u03b4i\n\u03c32\n2i\n\n+\n\n\u03b4(cid:62)\ni \u03a0i\u03b4i\n\u03c32\n2i\n\n(cid:32) n(cid:88)\n\n\u03a0i\u03b4i\n\u03c32\n2i\n\ni=1\n\n(cid:19)\n(cid:33)\n\n1 \u00b7 n(cid:88)\n\n, K = Id + \u03c32\n\ni=1\n\n\u03a0i\n\u03c32\n2i\n\n.\n\n(8)\n\nSpatially-independent model. Model (8) correlates the errors of all subvectors. We can loose this\ncondition by setting \u03c31 = 0. In this case K = Id and J = 0 and the model reduces to:\n\n\u2212 log q(\u03b4|x) =\n\nnd\n2\n\nlog 2\u03c0 +\n\n1\n2\n\nlog\n\n\u03c32d\n2i\n1 \u2212 \u03c1i\n\n+\n\n\u03b4(cid:62)\ni \u03a0i\u03b4i\n\u03c32\n2i\n\n.\n\n(9)\n\n(cid:18)\n\nn(cid:88)\n\ni=1\n\n(cid:19)\n\n1 In practice, it is easier for a network to predict ri = \u03c33iwi \u2208 R2 instead of \u03c1i and wi separately, so we use\n\nthe parametrization\n\nFurthermore, the negative sign in eq. (8) can lead to instabilities. We thus rewrite the equation as the sum of\nsquares:\n\n\u03a0i = Id \u2212\n\nrir(cid:62)\n2i + (cid:107)ri(cid:107)2 .\n\u03c32\n\ni\n\n(7)\n\n(\u03b4i \u2212 \u00b5)\n\n(\u03b4i \u2212 \u00b5) +\n\n(cid:62) \u03a0i\n\u03c32\n2i\n\n\u00b5(cid:62)\u00b5\n2\u03c32\n1\n\n\u03c32d\n2i\n1 \u2212 \u03c1i\n\n= \u03c32(d\u22121)\n\n2i\n\n2i + (cid:107)ri(cid:107)2),\n\n(\u03c32\n\n\u03c1i =\n\n(cid:33)(cid:62)\n\n(cid:32) n(cid:88)\n\ni=1\n\n\u03b4(cid:62)\ni \u03a0i\u03b4i\n\u03c32\n2i\n\n\u2212 \u03c32\n1\n2\n\n\u03a0i\u03b4i\n\u03c32\n2i\n\n\u22121\n\nK\n\n(cid:107)ri(cid:107)2\n2i + (cid:107)ri(cid:107)2 ,\n\u03c32\n(cid:33)\nn(cid:88)\n\n=\n\n1\n2\n\ni=1\n\n\u03a0i\u03b4i\n\u03c32\n2i\n\n(cid:32) n(cid:88)\n\ni=1\n\nwhere we have introduced the \u2018mean\u2019 vector:\n\n\u00b5 = \u03c32\n\n1K\n\n\u22121\n\n\u03a0i\n\u03c32\n2i\n\n\u03b4i.\n\nn(cid:88)\n\ni=1\n\n4\n\n\fFigure 2: Uncertainty-aware training pipeline. We extend a standard predictor based on a Hour-\nglass architecture [13, 12] with an additional uncertainty head to estimate uncertainty parameters.\n\nThis requires the model to estimate for each pixel the value of the variance \u03c32\ndirectional correlation parameters ri \u2208 R2.\n\n2i as well as of the\n\nFully-independent model. The elementary model corresponding to eq. (2) is obtained by further\nsetting parameter ri = 0 in eq. (9).\n\n3.3 Label-conditioned uncertainty\n\nNext, we consider modifying the approach so that uncertainty is predicted not only from the input\ndata x, but also based on the label y itself.\nOn a \ufb01rst glance, this may look as simple as adding the argument y to the predictor \u03a6(x) to obtain a\nnew predictor \u03a6(x, y). This is however nonsensical as (\u02c6y, q) = \u03a6(x, y) is tasked with producing an\nestimate of \u02c6y itself, so this would immediately lead to a degenerate solution.\nInstead, we consider two separate networks. The \ufb01rst, \u02c6y = \u03a61(x), is tasked with predicting only\nthe label y. The second, q = \u03a62(x, y), is tasked with predicting only the uncertainty distribution q.\nWithout any constraint, this scheme still does not work as the distribution q can shift the prediction \u02c6y\narbitrarily. This is prevented by the fact that E[q(\u03b4)] = 0; in fact, in practice we require q(y) to be\na simple uni-modal distribution. In this manner, q can effectively only predict the data uncertainty,\nbut \u02c6y must still try to predict the label correctly in order to minimize the log-likelihood loss.\n\nIntrospective ensemble\n\nk=1 C (k))\u22121(cid:80)K\n\n(cid:80)K\nmaximizer is then y = ((cid:80)K\nk=1 log qk(y \u2212 \u02c6y(k)|x) = argmaxy\n\n3.4\nAssume that we have densities qk(\u03b4|x), k = 1, . . . , K generated from an ensemble of K models and\n(cid:80)K\nlet \u02c6y(k) be the corresponding label estimates (we use the superscript to indicate that we index different\nestimates instead of different components of a single vector estimate). We can fuse the estimates by\nk=1(y \u2212 \u02c6y(k))(cid:62)C (k)(y \u2212 \u02c6y(k)). The\n\ufb01nding y = argmaxy\nk=1 C (k) \u02c6y(k). Note that, while the section above gives us\nthe inverse of the concentration matrices of each model, in case of the probabilistic model of the\nhighest order eq. (8) we require the inverse of the sum, which must be obtained numerically. In time-\nconstrained applications, such as real time, rather than solving such a large system of equations, we\ncan utilize conjugate gradient descend to obtain an approximate solution starting from an initial guess\n(which can be obtained as the average of the individual models\u2019 predictions). For the simpler cases of\nensembles of spatially-independent models and fully-independent models, the fused estimates can\nbe computed in closed form. For the spatially independent model (eq. (9)), it follows that the fused\nuv predictions yspa\ni ), which now\nonly requires to invert a small 2x2 matrix formed by accumulating C (k)\nabove each pixel. In case of\n\ni = ((cid:80)K\n\nat position i are de\ufb01ned as yspa\n\nk=1 C (k)\n\ni\n\nk=1 C (k)\n\ni \u02c6yk\n\ni\n\n5\n\n)\u22121((cid:80)K\n\ni\n\n\fFigure 3: Example of predictions produced by our model. Ground truth locations (in red) and\npredicted locations (in green) are shown together with learned isotropic offsets described by \u03c32.\n\ni =(cid:80)K\n\nk=1\n\ni(cid:80)K\n\n\u02c6\u03c3\nk=1 \u02c6\u03c3\n\n\u02c6y(k)\ni\n\n\u22122(k)\ni\n\nthe fully independent model with isotropic covariance matrices (eq. (2)), the ensembled prediction\nyiso\n\nis a mere weighted sum of \u02c6y(k)\n\n\u22122(k)\n\n.\n\ni\n\n4 Application to DensePose\n\nIn this section we show in more detail how the ideas explained above can be applied to the DensePose\nproblem [5]. In this work, we adapt the DensePose setting of [12], where the input is an image x \u2208\nR3\u00d7H\u00d7W tightly containing a person (DensePose can also be applied to full images in combination\nwith an object detector, but we are not concerned with that here). DensePose then trains a network \u03a6\nto predict a label y \u2208 RC\u00d7H\u00d7W where the C = 3 \u00b7 P channels of vector yi for pixel i comprise a\nP -dimensional indicator vector for the part that contains pixel i (e.g. left forearm) and the 2D location\nin the chart of each part, accounting for 2P dimensions. Note that only one of the 2P predicted\nlocationa is used at pixel i, as indicated by the part selector, but all of them are still computed.\nWe extend the basic architecture with an additional uncertainty head that estimates the prediction con-\n\ufb01dences (see \ufb01g. 2). Depending on which speci\ufb01c model is implemented, the output dimensionality\nof this branch may differ. However, each variant amounts to predicting a certain number of additional\nchannels per part, estimating part-speci\ufb01c uncertainty values, and can be generally expressed as\nNu \u00d7 P , where Nu is the number of uncertainty parameters. Note that depending on the application,\nat test time the uncertainty head may be either utilized to get con\ufb01dence estimates, or ignored. In\nthe latter case, the uncertainty aware training results in a boost of model performance at no extra\ncomputational cost during inference.\nAll uncertainty parameters are predicted by applying a set of two convolutional blocks to an interme-\ndiate feature level F, produced by the main network \u03a61. As mentioned in section 3.3, in addition, we\nexplore a variant of the model where the sparse ground truth annotations are passed directly to the\nuncertainty head as an additional input. The annotated points are \ufb01rst mapped onto the image space,\npreprocessed by a set of partial convolutional layers [11] and then concatenated with features F. This\nprocess is illustrated in \ufb01g. 2. An example of model predictions is shown in \ufb01g. 3.\n\n5 Experiments\n\nDatasets. To gain deeper insights on the nature of human annotator errors on the dense labeling\ntask, we \ufb01rst analyzed DensePose annotations obtained on a set of 88 synthetic images, where ground\ntruth UV mapping is known by design (analogously to Section 2.2 of [5]). These images were\nrendered using the SMPL body model [2] and the rendering pipeline of [19].\nWe have empirically observed that, by taking all annotated points covering one body part in one\nimage and applying a simple global af\ufb01ne transformation to them (such as translation or scaling), the\nmean error over the whole image set can be reduced by half. This con\ufb01rms our hypothesis of existing\nstrong correlation between individual errors.\n\n6\n\n\fModel\n\nDensePose-RCNN (R50) [5]\n\nHRNetV2-W48 [17]\n\nHG, 1 stack (Slim DensePose [12])\n\nHG, 2 stacks (Slim DensePose [12])\n\nHG, 8 stacks (Slim DensePose [12])\n\nuv-loss\n\nMSE\n\nfull (ours)\n\nMSE\n\nfull (ours)\n\nMSE\n\nfull (ours)\n\nMSE\n\nfull (ours)\n\nMSE\n\nfull (ours)\n\n1 cm 2 cm\n18.17\n5.21\n18.67\n5.67\n15.19\n4.31\n5.70\n18.81\n15.62\n4.31\n18.23\n5.34\n16.21\n4.44\n5.99\n19.97\n20.25\n6.04\n6.41\n20.98\n\n3 cm\n31.01\n32.70\n27.14\n31.88\n28.30\n31.51\n29.64\n34.16\n35.10\n35.17\n\n5 cm 10 cm 20 cm\n78.37\n51.16\n80.47\n53.14\n78.66\n47.07\n52.20\n82.12\n83.01\n49.92\n52.40\n82.94\n85.99\n52.23\n55.68\n85.58\n87.55\n56.04\n56.48\n87.96\n\n68.21\n71.25\n69.76\n74.21\n74.15\n74.69\n76.50\n77.76\n79.63\n80.02\n\nTable 1: Performance of uncertainty-based models on the DensePose-COCO dataset [5]. Our\nmodels signi\ufb01cantly outperform the baseline variants, with no extra computational cost at inference\n(when uncertainty estimates are not required by application).\n\nData\n\nsimple\nsimple-2D\niid\nfull\n\n(2.9785)\n(2.5748)\n(3.2285)\n(3.0683)\n\ngt-real\n1.0816\n1.4246\n1.7026\n2.3448\n\nSMPL renderings (synthetic)\nMAP\n\ngt-human\n\nDensePose-COCO (real)\ngt-human\n\nMAP\n\n1.1716\n1.4825\n1.8937\n2.3574\n\n(3.0797)\n(2.5892)\n(3.2038)\n(2.9847)\n\n1.3159\n1.3651\n1.4383\n2.14057\n\nTable 2: Negative log-likelihood of human annotations under different models with uncertainty.\nMore advanced models show monotonic increase in this metric w.r.t the ground truth locations.\ngt-human stands for human annotations on synthetic data, gt-real for the ground truth UV maps.\nsimple-2D: assumes independent (but not isotropicaly nor identically distributed) per-pixel errors.\n\nThe majority of experiments in this work were conducted on the DensePose-COCO dataset [5],\ncontaining 48k densely annotated people in the training set and 2.3k in the validation set. We follow\nthe single person protocol for this task and use ground truth annotations for bounding boxes to crop\nimages around each person.\n\nMetrics. For evaluation, we adapt a standard per-pixel metric used in [5] and [12] and report the\npercentage of points predicted with an error lower than a set of prede\ufb01ned thresholds, where the error\nis expressed in geodesic distances measured on the surface of the 3D model. Since, in this work, we\nfocus speci\ufb01cally on the UV-regression part of the DensePose task while keeping the segmentation\npipeline standard, we additionally report performance w.r.t. stricter geodesic thresholds and in two\nsettings when the segmentation is either predicted by the same network or assumed to be perfect and\ncorrespond to the ground truth at test time.\n\nImplementation details. The architecture of the main DensePose predictor \u03a61 is based on the\nHourglass network [13] adapted to this task by [12]. We benchmark performance on 1, 2 and 8 stacks,\nbut conduct most of the ablation studies on a 1-stack network for speed. All networks are trained for\n300 epochs with SGD, batch size 16 and learning rate of 0.1 decreasing by a factor of 10 after 180\nand 270 epochs. Input images are normalized to the resolution of 256 \u00d7 256.\n\nUncertainty models.\nIn our experiments, we analyze several modi\ufb01cations of networks with\nuncertainty heads, which we denote as follows: MSE stands for the uncertainty free baseline of [12]\ntrained with the MSE regression loss; simple corresponds to the elementary uncertainty model\ngiven by 2; simple-2D is a variant of (2) with two distinct \u03c3u and \u03c3v learned separately for u- and\nv-dimensions; iid denotes the spatially-independent model of (9) and full stands for the complete\nmodel given de\ufb01ned by (8).\nAs shown in Table 3, introducing uncertainty into the DensePose training brings signi\ufb01cant gains\nin performance over the considered baseline. Our 2-stack architecture signi\ufb01cantly outperforms\n\n7\n\n\fModel\n\nMSE\nsimple\nsimple-2D\niid\nfull\n\nUV only (GT body parsing)\n\noverall performance\n\n1 cm 2 cm\n19.92\n5.57\n22.80\n6.71\n23.40\n7.02\n6.95\n23.39\n23.08\n6.78\n\n3 cm\n35.74\n38.97\n39.76\n39.70\n39.42\n\n5 cm 10 cm 1 cm 2 cm\n15.62\n61.75\n17.96\n64.30\n18.53\n64.89\n64.95\n18.46\n64.62\n18.23\n\n89.53\n90.04\n90.25\n90.25\n90.14\n\n4.31\n5.27\n5.54\n5.44\n5.34\n\n3 cm\n28.30\n31.08\n31.80\n31.70\n31.51\n\n5 cm 10 cm\n74.15\n49.92\n74.39\n52.04\n74.98\n52.67\n52.66\n74.84\n74.69\n52.40\n\nTable 3: Ablation on uncertainty terms. The left part of the table reports upper bound results in\nassumption of perfect body parsing at test time.\n\nModel\n\nfull\n+gt (train)\n\nUV only (GT body parsing)\n\noverall performance\n\n1 cm 2 cm\n6.78\n23.08\n23.84\n7.18\n\n3 cm\n39.42\n40.33\n\n5 cm 10 cm 1 cm 2 cm\n64.62\n18.23\n18.68\n65.36\n\n90.14\n90.38\n\n5.34\n5.60\n\n3 cm\n31.51\n31.97\n\n5 cm 10 cm\n52.40\n74.69\n74.28\n52.57\n\nTable 4: Label-conditioned uncertainty. Exploiting ground truth labels as an additional direct cue\nhas shown to accelerate learning and results in higher performance.\n\nModel\n\nMSE\nsimple\niid\n\nBest model\n\nAverage\n\nOurs\n\n5 cm 10 cm 20 cm 5 cm 10 cm 20 cm 5 cm 10 cm 20 cm\n49.92\n52.04\n52.66\n\n83.01\n82.74\n83.01\n\n74.15\n74.39\n74.84\n\n75.09\n75.49\n75.29\n\n83.82\n82.84\n82.56\n\n50.49\n54.26\n54.15\n\n\u2013\n\n54.46\n54.55\n\n\u2013\n\n75.55\n75.59\n\n\u2013\n\n82.86\n82.77\n\nTable 5: Introspective ensemble. We collect a number of models with identical architectures but\ntrained with different hyperparameters (weights on the UV-term: 0.1, 0.2 and 0.5). More diverse\nensembles are expected to deliver higher gains in performance in all settings.\n\nmuch more powerful 8-stack models, especially when measured on tighter geodesic thresholds. This\nprovides an additional evidence for the hypothesis that uncertainty-based training facilitates learning\nwith noisy annotations and allows the model to decrease the associated jitter in predictions.\nThe ablation on different variants of the uncertainty models is given in Table 3. In terms of accuracy\nof UV-predictions, the more advanced models (iid and full) perform on par with their simpler\ncounterparts (simple and simple-2D) (note that test complexity of the main predictor is identical\nfor all models). Their advantage however becomes apparent when looking at the log-likelihood of the\nground truth labels evaluated by each of the models (see Table 2). In this setting, full model clearly\nprovides more meaningful representation of the learned distribution, which is no doubt critical for\nnumerous downstream tasks.\n\nLabel-conditioned uncertainty. Following the discussion of Section 3.3, we ablated the effect of\nusing ground truth annotations as a direct cue for learning uncertainty. Table 4 shows immediate\nbene\ufb01ts of doing so in terms of the target UV-metrics, but we also observed signi\ufb01cant increase\nof log-likelihoods of the true labels evaluated by the gt-based model. Note that no ground truth\ninformation is required by the model at test time, as long as uncertainty estimates are not utilized.\n\nIntrospective ensembles. Finally, we benchmark the performance of the proposed ensembling\ntechniques for several variants of the models with uncertainty. Exploiting uncertainty parameters\nfor \ufb01nding the right balance consistently outperforms the averaging late fusion baseline in all tested\nscenarios over a range of models (see Table 5).\n\n6 Conclusions\n\nIn this paper we have investigated the use of introspective uncertainty prediction models to improve\nthe robustness and expressiveness of models for dense structured label prediction. We have introduced\n\n8\n\n\fa method to estimate, using a convolutional neural network, an uncertainty model which potentially\ncorrelates the errors of all the pixels in an image. We have applied these ideas to the DensePose\ntasks, showing how these approaches can result in signi\ufb01cant performance improvements compared\nto the current state-of-the-art. Since the structure of the regressor is unchanged compared to the\nlatter approaches, these improvements are solely imputable to the models\u2019 better understanding of\nthe uncertainty in the data. This is particularly bene\ufb01cial for problems, such as DensePose, where the\nquality of manual labels is intrinsically limited.\n\nReferences\n[1] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural\n\nnetworks. In ICML, 2015.\n\n[2] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and\nMichael J Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a\nsingle image. In ECCV, 2016.\n\n[3] L. Breiman. Bagging predictors. Machine Learning, 24(2), 1996.\n[4] Y. Gal and Z. Gharamani. Dropout as a bayesian approximation: Representing model uncertainty\n\nin deep learning. In ICML, 2016.\n\n[5] R\u0131za Alp G\u00fcler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose\n\nestimation in the wild. In CVPR, 2018.\n\n[6] J. M. Hern\u00e1ndez-Lobato and R. P. Adams. Probabilistic backpropagation for scalable learning\n\nof bayesian neural networks. In ICML, 2015.\n\n[7] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer\n\nvision? In NIPS, 2017.\n\n[8] M. E. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal, and A. Srivastava. Fast and scalable\n\nbayesian deep learning by weight-perturbation in ADAM. In ICML, 2018.\n\n[9] A. Der Kiureghian and O. Ditlevsen. Aleatory or epistemic? does it matter?\n\nWorkshop on Risk Acceptance and Risk Communication, 2007.\n\nIn Special\n\n[10] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty\n\nestimation using deep ensembles. NIPS, 2017.\n\n[11] Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro.\n\nImage inpainting for irregular holes using partial convolutions. In ECCV, 2018.\n\n[12] Natalia Neverova, James Thewlis, R\u0131za Alp G\u00fcler, Andrea Vedaldi, and Iasonas Kokkinos. Slim\n\ndensepose: Thrifty learning from sparse annotations and motion cues. In CVPR, 2019.\n\n[13] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose\n\nestimation. In ECCV, 2016.\n\n[14] D. Novotny, D. Larlus, and A. Vedaldi. Learning 3d object categories by looking around them.\n\nIn ICCV, 2017.\n\n[15] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In\n\nNIPS, 2016.\n\n[16] T. Pearce, M. Zaki, A. Brintrup, and A. Neely. High-quality prediction intervals for deep\n\nlearning: A distribution-free, ensembled approach. In ICML, 2018.\n\n[17] Ken Su, Yang Zhao, Tianheeng Jiang, Borui andd Cheng, Bin Xiao, Dong Liu, Yadong Mu,\nXinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling\npixels and regions. In arXiv:1904.04514v1, 2019.\n\n[18] N. Tagasovska and D. Lopez-Paz. Frequentist uncertainty estimates for deep learning. In\n\narXiv:1811.00908, 2019.\n\n[19] G\u00fcl Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev,\n\nand Cordelia Schmid. Learning from synthetic humans. In CVPR, 2017.\n\n9\n\n\f", "award": [], "sourceid": 522, "authors": [{"given_name": "Natalia", "family_name": "Neverova", "institution": "Facebook AI Research"}, {"given_name": "David", "family_name": "Novotny", "institution": "Facebook AI Research"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "University of Oxford / Facebook AI Research"}]}