Globally Optimal Learning for Structured Elliptical Losses

Part of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)

AuthorFeedback Bibtex MetaReview Metadata Paper Reviews Supplemental

Authors

Yoav Wald, Nofar Noy, Gal Elidan, Ami Wiesel

Abstract

Heavy tailed and contaminated data are common in various applications of machine learning. A standard technique to handle regression tasks that involve such data, is to use robust losses, e.g., the popular Huber’s loss.

In structured problems, however, where there are multiple labels and structural constraints on the labels are imposed (or learned), robust optimization is challenging, and more often than not the loss used is simply the negative log-likelihood of a Gaussian Markov random field. Heavy tailed and contaminated data are common in various applications of machine learning. A standard technique to handle regression tasks that involve such data, is to use robust losses, e.g., the popular Huber’s loss. In structured problems, however, where there are multiple labels and structural constraints on the labels are imposed (or learned), robust optimization is challenging, and more often than not the loss used is simply the negative log-likelihood of a Gaussian Markov random field. In this work, we analyze robust alternatives. Theoretical understanding of such problems is quite limited, with guarantees on optimization given only for special cases and non-structured settings. The core of the difficulty is the non-convexity of the objective function, implying that standard optimization algorithms may converge to sub-optimal critical points. Our analysis focuses on loss functions that arise from elliptical distributions, which appealingly include most loss functions proposed in the literature as special cases. We show that, even though these problems are non-convex, they can be optimized efficiently. Concretely, we prove that at the limit of infinite training data, due to algebraic properties of the problem, all stationary points are globally optimal. Finally, we demonstrate the empirical appeal of using these losses for regression on synthetic and real-life data.