COHESIV: Contrastive Object and Hand Embedding Segmentation In Video

Dandan Shan, Richard Higgins, David Fouhey

Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

In this paper we learn to segment hands and hand-held objects from motion. Our system takes a single RGB image and hand location as input to segment the hand and hand-held object. For learning, we generate responsibility maps that show how well a hand's motion explains other pixels' motion in video. We use these responsibility maps as pseudo-labels to train a weakly-supervised neural network using an attention-based similarity loss and contrastive loss. Our system outperforms alternate methods, achieving good performance on the 100DOH, EPIC-KITCHENS, and HO3D datasets.