Yi-Fan Zhang | PhD Student

About Me

I am a final-year PhD student at the State Key Laboratory of Pattern Recognition, University of Chinese Academy of Sciences. My research focuses on training and evaluation of multimodal large language models (MLLMs). Recently, I am particularly interested in building agentic MLLMs with infinite context length and unlimited exploration space, as well as developing advanced memory management mechanisms to enhance the perception capabilities of MLLMs.

I have published 15+ papers as first author / co-first author / corresponding author at top-tier venues, with 4,500+ citations in total and 2,100+ citations for my most cited first-author work.

Previously, I have been fortunate to work with Prof. Jingdong Wang at Microsoft Research Asia and Prof. Rong Jin at Alibaba DAMO Academy. I have also interned at ByteDance, Kuaishou, Skywork, and Squirrel AI.

                    I am actively seeking research positions in both industry and academia. If you are interested in collaboration, internship opportunities, or research discussions, please feel free to reach out.
                

Research Interests

Multimodal Model Training

Developing vision-language models (SliME (150+ ⭐), Keye-VL (700+ ⭐)), omni-modal MLLMs (VITA (3k+ ⭐)), and agentic systems (Thyme (500+ ⭐), Skywork R1V4 (3k+ ⭐)). I believe that the agentic capabilities of MLLMs are directly tied to their perception abilities.

Model Evaluation

Building comprehensive evaluation frameworks including MME-RealWorld (30k+ Download), MME-Unify (5k+ Download), MME-VideoOCR, MME-Survey, and VLMEvalKit (3k+ ⭐). I am always pursuing benchmarks that truly align with human preferences and reflect real-world needs.

Post-Training & Reward Modeling

Developing alignment techniques for MLLMs through MM-RLHF (200 ⭐), R1-Reward (250+ ⭐), BaseReward, and contributing to the MLLM Alignment Survey. Recently, I am more interested in rubric-based rewards and self-evolving reward systems.

Applications & ML Systems

Applying MLLMs to practical domains including time-series forecasting, AI for education, and content moderation. I am also interested in continual learning, out-of-distribution generalization, and other ML system challenges.

News

Dec 2025 🎉 Two papers accepted by IEEE T-PAMI (IF: 18.6)!

Oct 2025 🎉 VITA 1.5 (Spotlight) and MME-VideoOCR accepted by NeurIPS 2025!

Sep 2025 🚀 Released Thyme - thinking beyond images with executable code generation.

Jul 2025 🚀 Released Kwai Keye-VL, a cutting-edge MLLM by Kuaishou.

May 2025 🎉 MM-RLHF and DAMO accepted by ICML 2025!

May 2025 🚀 Released R1-Reward for multimodal reward modeling.

Apr 2025 🚀 Released MME-Unify benchmark for unified multimodal models.

Feb 2025 🚀 Released MM-RLHF dataset with 120K human preference annotations.

Jan 2025 🎉 MME-RealWorld accepted by ICLR 2025!

Jun 2024 🚀 Released SliME - Beyond LLaVA-HD for High-Resolution MLLMs.

Mar 2024 🎉 Two papers on ICL and symbolic reasoning accepted by NAACL 2024!

Oct 2023 🎉 OneNet accepted by NeurIPS 2023.

May 2023 🎉 AdaNPC accepted by ICML 2023, DRM accepted by KDD 2023.

Jan 2023 🎉 Environment Label Smoothing accepted by ICLR 2023.

Apr 2022 🎉 DDG selected for CVPR 2022 Oral presentation.

Selected Publications

Full list available on Google Scholar. (* denotes equal contribution, † denotes corresponding author)

First Author Papers

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? First Author

Yi-Fan Zhang, et al.

ICLR 2025

Paper Code Project

Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch First Author 3k+

Yi-Fan Zhang, et al.

Technical Report

Paper Project

Thyme: Think Beyond Images First Author 550+

Yi-Fan Zhang, et al.

Technical Report

Paper Project

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning First Author 250+

Yi-Fan Zhang, et al.

Under review on NeurIPS 2025

Paper Code

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models First Author

Yi-Fan Zhang, et al.

IEEE T-PAMI 2025

Paper Code

Debiasing Large Visual Language Models First Author

Yi-Fan Zhang, et al.

ACM MM 2025

Paper

BaseReward: A Strong Baseline for Multimodal Reward Model First Author

Yi-Fan Zhang, et al.

Preprint

Paper

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment First Author

Yi-Fan Zhang, et al.

ICML 2025

Paper Code Project

Core Contributor & Corresponding Author Papers

Kwai Keye-VL 1.5 Technical Report Main Contributor 700+

Keye Team, Yi-Fan Zhang (Main Contributor), et al.

Technical Report

Paper Code

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios Corresponding

Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang†, et al.

Under review on NeurIPS 2025

Paper Code

Aligning Multimodal LLM with Human Preference: A Survey Corresponding

Tao Yu, Yi-Fan Zhang†, et al.

Under review on EMNLP 2025

Paper

Experience

Kuaishou Technology

Research Intern · Multimodal Large Language Models

Skywork AI

Research Intern · Agentic Multimodal Systems

ByteDance

Research Intern

Squirrel AI

Research Intern · LLMs for Education

Alibaba DAMO Academy

Research Intern · Advised by Prof. Rong Jin

Microsoft Research Asia

Research Intern · Advised by Prof. Jingdong Wang

Selected Awards

2025 Best Paper Nomination Award, ADS Track at KDD 2025

2025 AAAI Innovative Applications Award

2023 Top Cited Paper, Neurocomputing

2023 National Scholarship & Outstanding Student, University of Chinese Academy of Sciences

2020 Top Ten Best Student Models, South China University of Technology (Summa Cum Laude)

2020 Jingtang He Technology Innovation Scholarship (Top 1‰, 5 out of 10,000+)

2019 CUMCM National First Prize (Top 1% globally)

Professional Service

Conference Reviewer

ML/AI: ICML (2022-2026), NeurIPS (2022-2026), ICLR (2023-2026), AISTATS (2025), AAAI (2023-2024)
Vision: CVPR (2022-2024), ICCV (2023, 2025), ECCV (2022, 2024)
NLP: ACL (2025), EMNLP (2023-2024), NAACL (2024)

Journal Reviewer

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
IEEE Transactions on Image Processing (TIP)
International Journal of Computer Vision (IJCV)
Transactions on Machine Learning Research (TMLR)
IEEE Transactions on Information Forensics & Security (T-IFS)

Workshop Organizer

PC Member for MILETS@PAKDD'23, DMLR@ICML'23