InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Aug 25, 2025·
Weiyun Wang
Equal contribution
,
Zhangwei Gao
Equal contribution
,
Lixin Gu
Equal contribution
,
Hengjun Pu
Equal contribution
,
Long Cui
Equal contribution
Xingguang Wei
Xingguang Wei
Equal contribution
,
Zhaoyang Liu
Equal contribution
,
Linglin Jing
Equal contribution
,
Shenglong Ye
Equal contribution
,
Jie Shao
Equal contribution
,
Zhaokai Wang
Equal contribution
,
Zhe Chen
Equal contribution
,
Hongjie Zhang
,
Ganlin Yang
,
Haomin Wang
,
Qi Wei
,
Jinhui Yin
,
Wenhao Li
,
Erfei Cui
,
Guanzhou Chen
,
Zichen Ding
,
Changyao Tian
,
Zhenyu Wu
,
Jingjing Xie
,
Zehao Li
,
Bowen Yang
,
Yuchen Duan
,
Xuehui Wang
,
Zhi Hou
,
Haoran Hao
,
Tianyi Zhang
,
Songze Li
,
Xiangyu Zhao
,
Haodong Duan
,
Nianchen Deng
,
Bin Fu
,
Yinan He
,
Yi Wang
,
Conghui He
,
Botian Shi
,
Junjun He
,
Yingtong Xiong
,
Han Lv
,
Lijun Wu
,
Wenqi Shao
,
Kaipeng Zhang
,
Huipeng Deng
,
Biqing Qi
,
Jiaye Ge
,
Qipeng Guo
,
Wenwei Zhang
,
Songyang Zhang
,
Maosong Cao
,
Junyao Lin
,
Kexian Tang
,
Jianfei Gao
,
Haian Huang
,
Yuzhe Gu
,
Chengqi Lyu
,
Huanze Tang
,
Rui Wang
,
Haijun Lv
,
Wanli Ouyang
,
Limin Wang
,
Min Dou
,
Xizhou Zhu
,
Tong Lu
,
Dahua Lin
,
Jifeng Dai
,
Weijie Su
,
Bowen Zhou
,
Kai Chen
,
Yu Qiao
,
Wenhai Wang
,
Gen Luo
· 0 min read
Abstract
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process, offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks – narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
Type
Publication
In arXiv