Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track
Fang Kong, Zilong Wang, Shuai Li
A rich line of works study the bandit learning problem in two-sided matching markets, where one side of market participants (players) are uncertain about their preferences and hope to find a stable matching during iterative matchings with the other side (arms). The state-of-the-art analysis shows that the player-optimal stable regret is of order O(KlogT/Δ2) where K is the number of arms, T is the horizon and Δ is the players' minimum preference gap. However, this result may be far from the lower bound Ω(max since the number K of arms (workers, publisher slots) may be much larger than that N of players (employers in labor markets, advertisers in online advertising, respectively). In this paper, we propose a new algorithm and show that the regret can be upper bounded by O(N^2\log T/\Delta^2 + K \log T/\Delta). This result removes the dependence on K in the main order term and improves the state-of-the-art guarantee in common cases where N is much smaller than K. Such an advantage is also verified in experiments. In addition, we provide a refined analysis for the existing centralized UCB algorithm and show that, under \alpha-condition, it achieves an improved O(N \log T/\Delta^2 + K \log T / \Delta) regret.