Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
The paper proposes a novel neural network-based method to carry out intrinsic image decomposition, that allows it to benefit from training on both labeled images (i.e., with ground truth shape / lighting / reflectance maps available) and un-labeled images (where only the image itself is provided). Specifically, it trains a network that takes an image as input and produces shape / lighting / reflectance maps, using a loss on these output maps when ground truth is available, and also with an "autoencoder" loss by rendering / recombining the predicted maps to produce the an image and comparing it to the input.
While the auto-encoder loss by itself would clearly be insufficient (a network that learned to produce flat shading and reflectance = input image would minimize this loss perfectly), the paper shows when used to augment a supervised training set with a broader set of unlabeled images, it can lead to improved performance and the ability to deal with more diverse scene content.
Overall, the paper demonstrates a new and interesting learning-based approach to intrinsic image decomposition. My only comment about the paper is that it seems to lack a more standard evaluation/comparison to existing algorithms. While the comparisons to SIRFS in Table 1 / Fig 5 are useful, results on a more standard dataset (like the MIT intrinsic image set) would allow the reader to gauge how the method performs in comparison to the larger set of intrinsic image algorithms out there.
Paper describes a method to recover intrinsic images (albedo, shading) using a clever autoencoder formulation.
One autoencoder produces an albedo, a shape model, and a lighting representation. Another combines the shape model
and lighting representation to produce a shaded shape (it's essentially a learned renderer). This shaded shape is then combined
with the albedo to produce a predicted image. This collection of autoencoders is trained with a set of synthetic
example albedos and shading representations, together with their renderings, and with images. The synthetic data
encourages the intermediate representations to take the right form; the real data causes the autoencoders to behave
The idea is novel, and the approach seems to work well and to generalize. Comparison to SIRFS is successful, but
likely somewhat unfair. SIRFS isn't really trained (there's some adjustment of prior parameters), and doesn't have
much access to a shape prior. This process has a strong and important shape prior, though transfer to new
shapes is quite successful.
The final system is somewhat conditioned on category, and must be updated when the category changes (l222 et seq.).
This is a problem, but likely opens room for new ideas: for example, one might set recognition before intrinsic image
recovery rather than after.
In summary, paper contains a novel method for an established problem; offers results that suggest the method is
competitive with the best; and has lots of room for expansion. It should be accepted.
The paper presents an interesting approach on the intrinsic image decomposition problem: given an input rgb image, it decomposes it first into shape (normals), reflectance (albedo) and illumination (point light) using an encoder-decoder deep architecture with 3 outputs. Then there is another encoder-decoder that takes the predicted normals and light and outputs the shading of the shape. Finally, the result comes from a multiplication between the estimated reflectance (from the 1st encoder-decoder) with the estimated shading.
The idea of having a reconstruction loss to recover the input image is interesting, but I believe that is only partially employed in the paper. The network architecture still needs labeled data for the initial training. Also, in lines 164-166 the authors mention that the unlabeled data are used together with labeled so the "representations do not shift too far from those learnt". Does this mean that the unlabeled data corrupts the network in a way that the reconstruction is valid but the intermediate results are not correct? That implies that such data driven inference without taking into account the physical properties of light/material/geometry is not optimal.
I found difficult to parse the experiments and how the network is being deployed. How are the train/validation/test sets being created? For example, in 4.1 the motorbike numbers correspond to a test set of motorbikes that were not seen during the initial training (it is mentioned that the held-out dataset is airplanes)? And for decomposing a new shape into its intrinsic properties, do you directly take the output of the 3 decoders, or you have another round of training, but with including the test data.
Another important limitation is the illumination model being just a single point light source. This simple illumination model (even a spotlight) can not be applied in real images.
Regarding that, the experiments are only on the synthetic dataset proposed in the paper and there is no comparison with [13, 15] (or a discussion why it is missing). Without testing on real images or having a benchmark such as MIT intrinsics, its difficult to argue about the effectiveness of the method outside of synthetic images.
Also, Figure 3 is the output of the network or is it an illustration of what is not captured by simple lambertian shading?
There is some related work that is missing both for radiometric and learned decomposition into intrinsic properties.
Decomposition into shape/illumination/reflectance:
Single Image Multimaterial Estimation, Lombardi et al, CVPR 12
Shape and Reflectance Estimation in the Wild, Oxholm et al, PAMI 2016, ECCV12
Reflectance and Illumination Recovery in the Wild, Lombardi et al, PAMI 2016, ECCV 12
Deep Reflectance Maps, Rematas et al, CVPR 2016
Decomposing Single Images for Layered Photo Retouching, Innamorati et al, CGF 2017 (arxiv 16)
Deep Outdoor Illumination Estimation, Hold-Geoffroy et al, CVPR 2017 (arxiv 16)
Learning shading using deep learning:
Deep Shading: Convolutional Neural Networks for Screen-Space Shading, Nalbach et al CGF 2017 (arxiv 16)