Lijun Wu is a Researcher in Shanghai AI Laboratory, an adjunct Ph.D. supervisor in Shanghai Jiao Tong University,Fudan University, Zhongguancun Academy. Previously, he was a Research Scientist in ByteDance Seed, a Senior Researcher in Microsoft Research. He was a member of joint Ph.D. program between Sun Yat-sen University and MSRA, advised by Dr. Tie-Yan Liu and Prof. Jianhuang Lai.
His research interests cover LLMs/MLLMs and AI4Science. His works are published in Nature Communications, Nature Machine Intelligence, TPAMI, NeurIPS, ICML, ICLR, ACL, KDD and so on, with near 10000 citations. He served as Evaluations and Datasets Track Chair for NeurIPS 2026. He has also served as AC in ICML, ICLR, NeurIPS, KDD, ACL, EMNLP, NAACL and so on. He has received numerous prestigious awards, including the 2018 MSRA Ph.D. Fellowship. He secured 8 championships in the WMT2019 Competition. He led team to create the BioT5 series of multimodal biomolecular models, with more than 300k downloads, winning 1st and 2nd place in the ACL 2024 Language+Molecule Shared Task. His directed to develop OpenDataArena, the first LLM post-training data value benchmarking platform in the world. His team secured 2nd place in the 2025 NeurIPS CURE-Bench Internal Reasoning Competition. His research innovations have been translated into practical products. Notably, R-Drop was deployed in Microsoft Translator across over 20 translation tasks and was widely used in business scenarios at companies like Tencent, Baidu, Meituan. Furthermore, he participated in the development of the worldβs first Chinese-English translation system to achieve human parity in 2018.
We are hiring AI researchers working on LLM/MLLM and AI4Science, contact me if you are interested!
π₯ News
2026.3We release the second version of OpenDataArena-Scored-Data, for researchers to work on data-centric research.2026.2π Invited to serve as Evaluations and Datasets Track Chair for NeurIPS-2026!2026.2π₯ We release MMFineReason! Our 4B VLM model achieves 30B model performance! The superior reasoning dataset MMFineReason-1.8M is also released, which has been on the HuggingFace Datasets Trending Top 2! See the tech report.2026.2Invited to serve as Area Chair for KDD-2026.2026.1We introduce SciGenBench and ImgCoder for scientific image synthesis aim to accelerating the understanding and reasoning of VLMs for scientific visual tasks.2026.13 papers are accepted by ICLR-2026, including long context modeling, mixture of SFT data composition, and LLM for scientific reasoning.2026.1π₯ We release ChartVerse-1.8M dataset for strong Chart reasoning, which has been on the HuggingFace Datasets Trending Top 1! Also see the tech report.2026.1π₯ The datasets ODA-Math-460k and ODA-Mixture-100k/500k have been on the HuggingFace Datasets Trending Top 2!2026.1π₯ We release the superior datasets ODA-Math-460k and ODA-Mixture-100k/500k, created by the guidence from OpenDataArena. See the report for details.2026.1We have updated OpenDataArena-Tool with multimodal data training and evaluation support.2025.12π₯ We release OpenDataArena, the first benchmarking platform for post-training data! See our tech report.2025.11π Congratulations to my students, Qizhi Pei, Yi Duan, Honglin Lin, Yu Li and Xin Gao, achieve 2nd place in Internal Reasoning Track of CURE-Bench@NeurIPS2025!
π» Projects & Models & Datasets
- BioT5/BioT5+,
,
, multimodal biomolecular foundation models, HF models > 300k downloads!
- OpenDataArena,
,
οΌ the first Arena for post-training data value benchmarking in the world. The OpenDataArena-Scored-Data have >20k downloads!
- MMFineReason-2.3M/1.8M,
, high-quality reasoning dataset for VLM, HF datasets Trending Top 2, >20k downloads!
- ChartVerse-SFT-1.8M/600k,
, synthetic chart QA dataset for chart reasoning, HF datasets Trending Top 1, 10k downloads!
- ODA-Math-460k,
, strong math reasoning dataset for LLM, HF datasets Trending Top 2, >10k downloads!
- ODA-Mixture-100k/500k,
, strong general reasoning dataset for LLM, HF datasets Trending Top 2, >15k downloads!
- InternVL,
, a series of leading VLM models developed by Shanghai AI Laboratory.
π Repos
- π₯Awesome-AI-Research-Writing
- π₯Awesome-Biomolecule-Language-Cross-Modeling
- π₯Awesome-Bio-Foundation-Models
- π₯Awesome-Docking
π Selected Publications
βοΈ LLM/VLM
-
Arxiv 2026: MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods, Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, Lijun Wu| Project Page |(MMFineReason datasets on HF Trending Top 2, >20k downloads)
-
Arxiv 2025: Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets, Xin Gao, Xiaoyang Wang, Yun Zhu, Mengzhang Cai, Conghui He, Lijun Wu | Project Page |(Technical report for ODA-Math and ODA-Mixture datasets, >25k downloads)
-
Arxiv 2025: OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value, Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu | Project Page |(The first pos-training data value benchmarking platform in the world)
π¬ AI4Science
-
ACL 2024: BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning, Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan ||
(Win 1st/2nd for ACL24 workshop share tasks)
-
EMNLP 2023: BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations, Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui Yan, ||
(>300k downloads)
-
NeurIPS 2023: FABind: Fast and Accurate Protein-Ligand Binding, Qizhi Pei, Kaiyuan Gao, Lijun Wu, Jinhua Zhu, Yingce Xia, Shufang Xie, Tao Qin, Kun He, Tie-Yan Liu, Rui Yan | Project Page ||
β¨οΈ AI
-
FL@FM NeurIPS 2024: Hot Pluggable Federated Learning, Lei Shen, Zhenheng Tang, Lijun Wu, Yonggang Zhang, Xiaowen Chu, Tao Qin, Bo Han (Outstanding Student Paper Award, Oral) -
NeurIPS 2020: R-Drop: Regularized Dropout for Neural Networks, Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, Tie-Yan Liu |(R-Drop has been shaped into Microsoft Translator for 20+ language translations)
-
EMNLP 2019: Exploiting Monolingual Data at Scale for Neural Machine Translation, Lijun Wu, Yiren Wang, Yingce Xia, Tao Qin, Jianhuang Lai, and Tie-Yan Liu (Win the WMT-19 champion)
π Selected Honors and Awards
- 2nd place in Internal Reasoning Track of CURE-Bench@NeurIPS2025, 2025
- Top Area Chairs for NeurIPS-2025, 2025.
- 1st place in Text2Molecule and 2nd place in Molecue2Tedt on Language+Molecule@ACL2024 shared task, 2024
- Runner up of OGB-LSC @ KDD cup, Solution, 2021
- 1st Place of WMT 2019 in 5 translation directions: En->De, De->En, De->Fr, Fr->De and Ru->En, 2019
- Microsoft Research Asia Ph.D. Fellowship, 2018
- Stars of Tomorrow Internship Award of Microsoft Research Asia, 2018
- 1st Place of Global IBM/IEEE Smarter Planet Challenge, 2013
π Experience
- 2024.08-Now, Young Scientist, Shanghai Artificial Intelligence Laboratory
- 2024.05-2024.08, Research Scientist, ByteDance Seed
- 2022.07-2024.05, Senior Researcher, MSR AI4Science
- 2020.6-2022.07, Senoir Researcher, MSRA
- 2014.07-2020.06, Research Intern, MSRA
π¬ Academic Services
- PC: Evaluations and Datasets Track Chair for NeurIPS-26
- AC: KDD-26, ICML-26, ICLR-26, NeurIPS-25, ACL-21/22/23/24/25, EMNLP-23/24/25, NNACL-22/23/24/25, EACL-24, COLING-23, ARR-21/22/23/24/25
- SPC: AAAI-22/23/24/25/26, IJCAI-21
- Conference reviewers: ICLR, ICML, NeurIPS, AAAI, IJCAI, ACL, CVPR, EMNLP, KDD, NAACL, COLING, EACL, AACL
- Journal reviewers: TPAMI, TASLP, KBS, Neurocomputing, CSL