In this paper we developed a computational hierarchical network model to understand the spatiotemporal sequence learning effects observed in the primate visual cortex. The model is a hierarchical recurrent neural model that learns to predict video sequences using the incoming video signals as teaching signals. The model performs fast feedforward analysis using a deep convolutional neural network with sparse convolution and feedback synthesis using a stack of LSTM modules. The network learns a representational hierarchy by minimizing its prediction errors of the incoming signals at each level of the hierarchy. We found that recurrent feedback in this network lead to the development of semantic cluster of global movement patterns in the population codes of the units at the lower levels of the hierarchy. These representations facilitate the learning of relationship among movement patterns, yielding state-of-the-art performance in long range video sequence predictions on benchmark datasets. Without further tuning, this model automatically exhibits the neurophysiological correlates of visual sequence memories that we observed in the early visual cortex of awake monkeys, suggesting the principle of self-supervised prediction learning might be relevant to understanding the cortical mechanisms of representational learning.