CogVLM: Visual Expert for Pretrained Language Models

Wang, Weihan; Lv, Qingsong; Yu, Wenmeng; Hong, Wenyi; Qi, Ji; Wang, Yan; Ji, Junhui; Yang, Zhuoyi; Zhao, Lei; Song, Xixuan; Xu, Jiazheng; Chen, Keqin; Xu, Bin; Li, Juanzi; Dong, Yuxiao; Ding, Ming; Tang, Jie

doi:10.52202/079017-3860

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Keqin Chen, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang

Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track

Bibtex Paper

Abstract

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular \emph{shallow alignment} method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables a deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 17 classic cross-modal benchmarks, including 1) image captioning datasets: NoCaps, Flicker30k, 2) VQA datasets: OKVQA, TextVQA, OCRVQA, ScienceQA, 3) LVLM benchmarks: MM-Vet, MMBench, SEED-Bench, LLaVABench, POPE, MMMU, MathVista, 4) visual grounding datasets: RefCOCO, RefCOCO+, RefCOCOg, Visual7W. Codes and checkpoints are available at Github.

DOI

10.52202/079017-3860

Abstract

DOI

Name Change Policy