Yoshua Bengio, Olivier Delalleau, Nicolas Roux
We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior with similarity between examples expressed with a local kernel are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge.