Review for NeurIPS paper: CoSE: Compositional Stroke Embeddings

NeurIPS 2020

CoSE: Compositional Stroke Embeddings

Review 1

Summary and Contributions: The paper describes a network that can generate or extend 2D drawings, without making the sequential structure assumptions of sketch2rnn. There is a sensible encoder/decoder architecture and relational network.

Strengths: The paper gets good results, with some reasonable ideas, on a signficant problem, i.e., modeling unstructured sketches. The qualitative results shown are compelling.

Weaknesses: A user interface is described, but no video is provided to give the reader a sense for how this works and how the interaction is.

Correctness: Yes, as far as I could tell.

Clarity: Yes.

Relation to Prior Work: There is a considerable amount of related work on generating unstructured layouts in other domains that arguably should be cited. See this paper for a survey: Learning Generative Models of 3D Structures Siddhartha Chaudhuri, Daniel Ritchie, Jiajun Wu, Kai Xu, and Hao Zhang Eurographics 2020 STAR as well as: PlanIT: Planning and Instantiating Indoor Scenes with Relation Graph and Spatial Prior Networks Kai Wang, Yu-an Lin, Ben Weissmann, Manolis Savva, Angel X. Chang, and Daniel Ritchie SIGGRAPH 2019 LayoutGAN: Synthesizing Graphic Layouts with Vector-Wireframe Adversarial Networks J. Li, J. Yang, A. Hertzmann, J. Zhang, T. Xu. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI). 2020.

Reproducibility: Yes

Additional Feedback: I remain positive that this is useful incremental advance in modeling collections of strokes. I can't comment on the novelty of the techniques to the world of machine learning as a whole, which was a concern for some other reviewers. I have lowered my score slightly for this reason, but I am in favor of acceptance nonetheless.

Review 2

Summary and Contributions: Post-rebuttal comments. The author's rebuttal addressed some of my concerns, and thus I updated my score accordingly. The paper introduces a generative model for stroke-based drawing using stroke embedding. The authors design an auto-encoder network as stroking embedding and a relational module to predict strokes based on input strokes. The key idea of this paper is to design an auto-encoder to embed a variable-length stroke with its starting position and a latent feature of fixed dimension. This stroke representation space also enables a relational model in latent feature space to model the relationship between strokes and to predict subsequent strokes. In this paper, the authors treat drawings as unordered stroke collections, and each stroke is regarded as a chronological ordered sequence of 2D positions.

Strengths: (1) The idea of factoring local appearance of a stroke from the global structure of the drawing is novel. (2) A stroke embedding module and a relational model are designed for capturing local information and relationship of strokes, respectively. Extensive experiments have been conducted to show the efficiency of this architecture. (3) This paper is well written.

Weaknesses: (1) My major concern about this work is the diversity of the predicted future strokes for input strokes, because there exists many combinations of the predicted strokes that are hard to generate. Although the paper adopts Guassian Mixture for the starting position and the latent code prediction (line 169-177), the random dynamic of strokes is surely still hard to model using the proposed method. (2) Due to the random dynamic of strokes, it is hard to define whether a specific prediction of the next stroke of a drawing is good or bad, especially for flowcharts. (3) The proposed method can only get sampled position of strokes, thus the continuity between strokes may be destroyed. For example, in the right part of Fig.5, the ears and face of the cat head are disconnected. (4) Since one of the major difference between the proposed model and Sketch-RNN is that this model predicts the next start point. The performance of the model in terms of the starting point and the next stroke should be compared respectively. (5) Other comments: -Lacking ablation study of relation model (with/without relation model) -Lacking comparisons of the performance with the starting position and the latent code prediction of different dimensional Guassian Mixture -Lacking training details such as the training time

Correctness: Yes

Clarity: The writing of this paper is good

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Due to the randomness of stroke sketching, it seems difficult to predict the strokes in the next step. A good experimental design that can clearly demonstrate the advantage of stoke prediction function in this paper, would be helpful to increase the impact of this paper.

Review 3

Summary and Contributions: The paper introduces an autoregressive generative model based on an autoencoder for penstrokes, considered as sequences of keypoints. The paper particularly focuses on disentangling local aspects of individual strokes from the global, relational way in which strokes are composed to form a whole sketch.

Strengths: The paper makes strong contributions in representation learning for the strokes, and succeeds at reconstructing and predicting interesting diagrams and drawings. Despite an autoencoder setup, the paper focuses its experiments and evaluation on predicting the completion of a partial drawing.

Weaknesses: The paper could have been just a little more impressive by using some sort of neural or stochastic renderer to map from the stroke sequences to images and vice-versa. I would like to see comparisons to seq2seq models with equal embedding dimensionality to the COSE model, though this only appears in the paper for D=8.

Correctness: The claims and methods appear to be accurate, as far as I can tell. The empirical methodology is NeurIPS-quality.

Clarity: The paper is not only clear, but manages to nicely pass from the intuitive description of its target problem at the beginning to the technical methodology. The figure captions could have been somewhat clearer.

Relation to Prior Work: The paper neatly situates itself with respect to both image-based models, generative ink models, and program synthesis methods.

Reproducibility: Yes

Additional Feedback: The authors have addressed my worry about what happens if gradients propagate through the relational model.

Review 4

Summary and Contributions: This work presents a generative model for compositional sketches. The proposed model auto-regressively models strokes with a transformer to learn relations between the strokes. The proposed model is evaluated on the DiDi and Quick,Draw! datasets.

Strengths: The main strengths are, + the paper is well written and easy to understand. + the novelty with respect to prior work esp. Sketch-RNN Decoder [6] is clearly explained. + the qualitative examples in Fig. 7 clearly shows the benefit wrt to Sketch-RNN Decoder [6]. + the paper includes enough ablations to demonstrate the effectiveness of its various components.

Weaknesses: The rebuttal and discussion clarified my concerns about [1,2] (although I would highly encourage that these works be citied for a more complete related works section). However, I remain unconvinced by the novelty of the approach -- the fact that transformer based models work better compared to simple VAE based models is not surprising to the general NeurIPS audience. However, I do agree that from the point of view of stroke based generative models the work is novel and makes a good contribution to this specific field. ------------------------------------------------------------------------------- The main weakness are, - Limited novelty -- very similar models have been proposed in prior work [1]. Novelty wrt to [1] is not clear -- both methods use a transformer based architecture to model long-range dependencies in strokes. The advantage of an autoregressive structure along with transformers is not clear as transformers contain self-attention layers to capture long range dependencies. Even though the work [1] appears at CVPR 2020 and does not influence the rating of the proposed method, the evaluation metrics like mAP% [1] to evaluate the quality of the latent spaces should have been considered. - Limited baselines. The work should compare with recent works [1,2] and not be limited to the older SketchRNN baseline. An ideal dataset for comparison would be QuickDraw50M or Sketchy. Only qualitative results are provided with QuickDraw and a more detailed evaluation with the prior state of the art would be beneficial and highlight the advantages of the proposed method across datasets. [1] Sketchformer: Transformer-based Representation for Sketched Structure, CVPR 2020. [2] Synthesizing human-like sketches from natural images using a conditional convolutional decoder, WACV 2020.

Correctness: Yes.

Clarity: Yes,

Relation to Prior Work: No.

Reproducibility: Yes

Additional Feedback: