Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper proposes to regularize the training of the embedding layers based on a knowledge graph capturing known relationships between entities. Instead of a straight-forward approach using graph laplacian embeddings, the paper proposes stochastic swapping of embeddings that should result in a similar loss function. The paper is well written, and the method seems relevant. The quantitative results are thorough comparing the method with others in 6 example cases. The method is a contribution worthwhile to be presented at NeurIPS and is very likely relevant for the audiance.
The proposed technique (named SSE) applies to neural networks (NN) with embedding of categorical inputs. According to it, some random noise is introduced into the inputs: each categorical input is changed with some (low) probability. This change can be done with (SSE-Graph) or without (SSE-SE) prior knowledge. SSE comes along with a theoretical analysis: training a NN with SSE corresponds to a change of loss. This loss change is theoretically evaluated. The paper is well written and clear. Just one thing: theta^* should have be defined on equation 7. Although natural and simple, the main main idea seems to be original. The experimental results in the SSE-SE framework show that the proposed method works consistently. Moreover, replacing dropout by SSE-SE appears to improve the results. The fact that the author are able to exhibit the optimized loss in their setup and compare it to the "true" loss, allows further development and study of SSE. Questions: since SSE is close to drop-out, and drop-out can be seen as adding Gaussian noise, is there a similar interpretation for SSE?
The paper presents a novel and interesting regularization method, theoretical analysis and good results, yet I fear its main contributions might be limited to recommendation systems or other fields where knowledge graphs are available, easily constructed, or in their absence, intuitively reasonable to assume a complete graph. Outside those types of tasks, I find it presenting arguments which intuitively were not too compelling, as to why other fields or tasks would significantly benefit from such a method, despite showing improved results on some NLP tasks. The simpler version of the regularizer, which in the absence of a knowledge graph assumes a complete graph, permutes embedding indices with a constant*U(1,N) probability. Despite its appealing theoretical properties, it also poses a risk of introducing a bias of its own. The results on NLP tasks didn't show major improvements and lacked in explanation as to why this type of regularizer would be beneficial and effective for different NLP tasks. As mentioned by the authors, training a language model is an instance of an NLP task for which this regularizer fits well, since this type of task aims to estimate a conditional probability distribution over sequences of words and benefits from word augmentation due to the large vocabulary size. Yet, many tasks don't have these properties and I fear that a task like NER might not benefit from this type of method, in the absence of a knowledge graph. The paper did a fairly good job with conveying the main ideas clearly, yet it contains some minor inaccuracies, mainly in the experiments section.