Review for NeurIPS paper: MuSCLE: Multi Sweep Compression of LiDAR using Deep Entropy Models

NeurIPS 2020

MuSCLE: Multi Sweep Compression of LiDAR using Deep Entropy Models

Review 1

Summary and Contributions: This paper proposes a new compression method for point cloud data, especially focusing on the point cloud time series obtained by LiDAR. The proposed compression method builds a neural network that models point occupancy on octree. The proposed model explicitly considers dependency among tree nodes and temporal dependency of the two temporally adjacent point clouds. Temporal dependency is modeled as a conditional probabilistic model based on a 1st order Markov assumption. The experimental results show that the proposed techniques realize the better compression-accuracy trade-off, compared with baselines.

Strengths: - Modeling temporal dependency for LiDAR stream data is a simple but promising idea for this task. - The experiments show a substantial improvement over baseline methods.

Weaknesses: Although the proposed method is a good application of neural networks, it is a combination of techniques that are well-known in the NeurIPS community. In this sense, the technical novelty and depth are limited. [AFTER REBUTTAL] Based on the other reviewers' comments and discussion on the contribution, I change my score.

Correctness: - The proposed method correctly models what the authors argued in the introduction. - The experiments were mostly conducted correctly except for some unclear points (see comments below).

Clarity: The paper is basically well-written and easy to follow, although there are some unclear points. Please see my comments below.

Relation to Prior Work: Differences from prior work is clearly mentioned: (1) consideration of temporal dependency, and (2) intensity compression.

Reproducibility: Yes

Additional Feedback: - Is \sigma at Line 138 an MLP? Or does it involve window functions? - Please clarify the definition of bitrate. I think that the authors does not count the storage for the parameters of trained neural network. Is this right? - Also, if I understand correctly, all the evaluations of compression performance include intensities. How about the compression performance if intensities are excluded from evaluation? (because zlib is a general-purpose compressor and it is not very strong baseline here)

Review 2

Summary and Contributions: In this paper, a point cloud compression approach is proposed by using spatial-temporal information. The main contribution comes from the investigation of using temporal information.

Strengths: - The proposed approach is straightforward and effective. - The experimental results are comprehensive. - Demonstrate the effectiveness of compression in the downstream task is very interesting.

Weaknesses: - The paper builds upon previous work OctSqueeze, the novelty may be limited by adding temporal information only. I am not sure whether it is sufficient enough for NeurIPS. - Considering this paper focus on the compression of multiple sweeps of the point cloud, it is recommended to compare with the V-PCC, i.e., TMC2 in the experimental part. - Besides, I am curious why TMC13 is not included in the baseline method. In my opinion, TMC13 is a strong baseline and should be included in the comparison.

Correctness: - The claims are correct and the evaluation results are extensive.

Clarity: This paper is well written and is easy to follow.

Relation to Prior Work: The paper has provide the sufficient discussion with previous work.

Reproducibility: Yes

Additional Feedback: - The computational complexity. Please provide the running time on GPU and CPU platforms, respectively. Besides, it is also required to provide the running time of traditional codecs. - The proposed is trained by using 16GPUs, so please provide the memory consumption of a single batch. Besides, is it possible train the whole model with fewer GPUs? - Please provide the average number of points for testing datasets. - It seems the improvements over previous OctSqueeze baseline will drop at the low-bitrate setting in Fig.2.

Review 3

Summary and Contributions: Edit: I read the rebuttal and continue to support my positive review (7: good submission; accept). This paper presents a method for compressing lidar data including both 3D location and intensity values. The approach builds on existing learning-based lidar compression models that use an octet representation. It expands on those methods by conditioning on the previous lidar sweep, using "continuous convolution" to better model data that doesn't fall on a grid, and by compressing intensity values. These enhancements lead to better compression rates (up to 35%).

Strengths: 1) Lidar compression is an important problem for the ML community that is interesting academically and for real-world applications (self-driving cars being a primary example). 2) The compression gains are impressive. On standard data sets, the proposed method saves between 7% and 35% at equal quality levels. For reference, for video compression, papers are often published that provide a 2% gain, albeit over much more mature baselines, and new standard are typically built when gains reach 25-30%. 3) Analysis and feature aggregation when the data is irregular and doesn't fall on a grid is difficult to do well since the assumptions of standard approaches (conv nets and MLPs) don't fit. Although this paper builds on existing methods ("continuous convolution" and "deep sets"), it adapts them to the 3D octet structure used to represent lidar data. 4) Evaluation is performed both on direct metrics (F1, PSNR, etc.) and on real downstream tasks.

Weaknesses: The comments about the number of GPU passes needed for the occupancy model was not clear to me (Section 2.5). Some extra information would be helpful. Runtime numbers would also strengthen the paper since real-time encoding is needed for live applications.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: No

Additional Feedback: Regarding reproducibility, I think additional implementation details would be needed to reproduce the results, but I do *not* fault the authors for not including them in the paper. I think the structure and top priorities (background, related work, the new method, and evaluation) is very appropriate for an 8-page paper. Additional information is also given in the supplementary material.

Review 4

Summary and Contributions: The paper proposes a principled approach to modeling point cloud streams and uses the model for entropy coding of such streams. Resulting compression rates appear to beat the state of the art. Paper is mostly easy to read, well motivated and uses reasonable and elegant formulations.

Strengths: First off, results beat the state of the art. (I have not read other work on Lidar stream compression, so I have to go by comparisons in the paper.) Second, the paper is of the type that inspires further thought, which then means that perhaps the impact will be broader than it might seem. Good probability density models are useful for compression, but are also useful for inference: could these models be used directly for downstream tasks like tracking or segmentation? Also, the way the model is set up, it seems to me that it does not preclude the possibility that the model learns scale-invariant features, in almost a fractal view of the world, where the dependencies at different depths depend not on the depth so much as on occupancies of the nearby nodes, thus allowing the model to generalize what is learned about large objects to smaller objects. (A range of experiments can be done along these lines, from investigating the learned parameters to doing ablation studies, like, can the model's depth be directly increased without retraining the network by duplicating layers from the coarser levels; can the same model be retrofitted to work on higher density clouds, and interestingly, how invariant is it to scaling (shrinking or expanding the point cloud) )

Weaknesses: Couple of small issues: This sentence is unclear: "We obtain intensities in the quantized point cloud by taking the nearest neighbor in the source point cloud." Figure 1 contains a type of an illustration that is typical of modern ML papers (and often required by reviewers). But i find these illustrations to be hand-wavy and I (think I) understood the method from the writing, not from the figure. But that's may be the matter of taste. It seems that 't' variable was never defined (though one could infer its meaning). The context variable 'c' would also be better defined before it is used, rather than after. The range encoding reference format seems to be off. At a higher level, while the paper does compare with the state of the art, the formulation before getting into neural nets suggests possibility of using simpler more direct statistical models. The paper would be stronger if the neural networks could be shown to beat these and how they do it. And, as I mentioned in the strengths as well, discussing the probability model and its strength could lead to a discussion of the model's use beyond compression. Finally, some of the analysis I mentioned above in the strengths would make the paper even more exciting.

Correctness: Seems right.

Clarity: The paper was a joy to read. Harkens back to early NIPS when information theory and ML where more closely related.

Relation to Prior Work: I think the paper is well positioned in the background literature.

Reproducibility: Yes

Additional Feedback: