Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper proposes a tensorized version of the transformer to reduce the number of parameters in the neural network architecture. The idea is original, and the empirical results are convincing on language modeling as well as machine translation. The writing however is not very clear (e.g. para 2 in intro, section 2.2) - please have your paper proofread for language and grammar. The results however are quite significant and will help scale the transformer to larger problems. Comments: 1. In theorem 3.1, the relation of the basis vectors to Q, K, V is confusing in the current notation. Might be useful to clearly specify it. 2. Did you try using more than 2 cores or a bigger model with more layers (esp. since the original transformer was limited by GPU memory)? 3. The paper is missing discussion on some very related work: a) Generating Long Sequences with Sparse Transformers Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever **Edit**: Thanks to the authors for their response. Please add in the discussion of the sparse transformer in the related work and important details such as your definition of basis vectors to the main paper to improve the clarity.
The authors propose a new attention mechanism where the large amount of parameters that are required compute the affinities between the states and input units can be alleviated by sharing the parameters used to encode the different align-able units. As such, the number of parameters needed to incorporate the attention mechanism reduces substantially, and additionally, results on language modelling and translation seem to indicate that the proposed mechanism generalizes better with fewer parameters. The paper provides a good self-contained extension the transformer class of models, which is worth publishing.
# Clarity I had difficulty understanding how the authors implemented their solution, so I looked at the code in the supplementary material. This code failed to compile, and had numerous confusing aspects, and the authors did not link to the actual code used in training the model. *Confusing aspect of code #1* There are some parts of the code that are not valid python. For example: ``` for i in range(): cores_tensor[i][i][i] = self.core[i] ``` *Confusing aspect of code #2* Within the MultiLinearAttention class, the authors provide the following branching logic: ``` if n_head > 1: output_1 = self.Tattention(q_s, k_s, v_s, mb_size,d_v) output_2 = self.Tattention(q_s, k_s, v_s, mb_size,d_v) output = (output_1+output_2)*0.5 else: ouput = self.Tattention(q_s, k_s, v_s, mb_size,d_v) ``` My understanding is that `self.Tattention` is a pure function, therefore the True branch should always be numerically identical to the False branch, but at 2x the cost. === # Originality This result seems motivated by the same problem as "Generating Long Sequences With Sparse Transformers" (https://arxiv.org/abs/1904.10509 April 2019), but I could find no comparison with that work. === # Significance I think that the results of using 50% fewer parameters is quite significant. However I would also like to see the total flops usage compared to the baseline, as flops are frequently the limiting factor for training and deployment of models.