Tao Jin, Zhou Zhao
The majority of existing multimodal sequential learning methods focus on how to obtain effective representations and ignore the importance of multimodal fusion. Bilinear attention network (BAN) is a commonly used fusion method, which leverages tensor operations to associate the features of different modalities. However, BAN has a poor compatibility for more modalities, since the computational complexity of the attention map increases exponentially with the number of modalities. Based on this concern, we propose a new method called generalizable multi-linear attention network (MAN), which can associate as many modalities as possible in linear complexity with hierarchical approximation decomposition (HAD). Besides, considering the fact that softmax attention kernels cannot be decomposed as linear operation directly, we adopt the addition random features (ARF) mechanism to approximate the non-linear softmax functions with enough theoretical analysis. We conduct extensive experiments on four datasets of three tasks (multimodal sentiment analysis, multimodal speaker traits recognition, and video retrieval), the experimental results show that MAN could achieve competitive results compared with the state-of-the-art methods, showcasing the effectiveness of the approximation decomposition and addition random features mechanism.