Lijun Wu is a Researcher in Shanghai AI Laboratory. Previously, he was a Research Scientist in ByteDance, a Senior Researcher in Microsoft Research. He got the Ph.D. degree from Sun Yat-sen University (SYSU), and was a member of joint Ph.D. program between SYSU and MSRA, advised by Dr. Tie-Yan Liu and Prof. Jianhuang Lai.
His research interests cover LLMs/MLLMs (e.g., data-centric AI, SFT/RL), AI4Science (e.g., LLM4Science, scientific reasoning). His works are published in top conferences and journals, such as Nature Communications, Nature Machine Intelligence, TPAMI, NeurIPS, ICML, ICLR, ACL, KDD and so on, with more than 8500+ citations. He has served as AC/SPC in top conferences, e.g., ICML, ICLR, NeurIPS, ACL, EMNLP, NAACL, AAAI, IJCAI and so on. He has received numerous prestigious awards, including the 2018 MSRA Ph.D. Fellowship. He secured 8 championships in the WMT2019 Competition. He led team to develop the BioT5 series of multimodal biomolecular models, the models have more than 240k downloads, winning 1st and 2nd place in the ACL 2024 Language+Molecule Shared Task. He guided students to secure 2nd place in the 2025 NeurIPS CURE-Bench Internal Reasoning Competition. His research innovations have been translated into practical products. Notably, R-Drop was deployed in Microsoft Translator across over 20 translation tasks and was widely used in business scenarios at companies like Meituan. Furthermore, he participated in the development of the worldβs first Chinese-English translation system to achieve human parity in 2018.
We are hiring AI researchers working on LLM/MLLM and AI4Science, contact me if you are interested!
π₯ News
2026.1π We introduce SciGenBench and ImgCoder for scientific image synthesis aim to accelerating the understanding and reasoning of VLMs for scientific visual tasks.2026.1π₯π₯π₯ We released ChartVerse-1.8M dataset for strong Chart reasoning, which has been on the HuggingFace Datasets Trending Top 1. Also see the tech report.2026.1π₯π₯π₯ The datasets ODA-Math-460k and ODA-Mixture-100k/500k have been on the HuggingFace Datasets Trending Top 2.2026.1π We release the superior datasets ODA-Math-460k and ODA-Mixture-100k/500k, created by the guidence from OpenDataArena. See the report for details.2026.1π We have updated OpenDataArena-Tool with multimodal data training and evaluation support, you can easily benchmark your multimodal datasets with VLMs.2025.12π We release OpenDataArena, the first open, fair, transparent benchmarking platform for post-training data! Also see our tech report.2025.11π Congratulations to my students achieve 2nd place in Internal Reasoning Track of CURE-Bench@NeurIPS2025!2025.11Invited to serve as Area Chair for ICML-2026.2025.11Our Mol-StruTok is accepted by KDD-2026, a novel tokenization framework for 3D molecule structures.2025.9Our Caco is accepted by NeurIPS-2025, which aims to scaling the reasoning data by code-assisted verfications.2025.83 papers are accepted by EMNLP-2025,topics cover math reasoning and advanced data synthesis. Check CFT, MetaLadder, Middo.2025.8Invited to serve as Area Chair for ICLR-2026.
π» Open-source Projects
- OpenDataArena
,
οΌ a fair, open, and transparent Arena for data value benchmarking.
- InternVL
, a series of leading VLM models developed by Shanghai AI Laboratory.
π Surveys/Repos
- π₯
2024.3We have updated the comprehensive survey about Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey. Check it! - π₯
2023.11We have released a comprehensive report on Large Language Models (GPT-4) on Scienctific Discovery. Check it! - π₯
2022.4We have released a comprehensive survey about Non-Autoregressive Generation for Neural Machine Translation and Beyond. Check it! - π₯Awesome-LLM-Ready-Datasets
- π₯Awesome-Biomolecule-Language-Cross-Modeling
- π₯Awesome-Bio-Foundation-Models
- π₯Awesome-Docking
π Selected Publications
βοΈ LLM/MLLMs
NeurIPS 2025: Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning, Honglin Lin, Qizhi Pei, Xin Gao, Zhuoshi Pan, Yu Li, Juntao Li, Conghui He, Lijun WuArxiv 2025: Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets, Xin Gao, Xiaoyang Wang, Yun Zhu, Mengzhang Cai, Conghui He, Lijun Wu | Project Page |(Technical report for ODA-Math and ODA-Mixture datasets, >20k downloads π)
Arxiv 2025: OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value, Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu | Project Page |(Technical report for OpenDataArena π)
π¬ AI4Science
-
EMNLP 2023: BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations, Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui Yan, ||
(>240k downloads π)
-
ACL 2024: BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning, Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan ||
(Win 1st/2nd for ACL24 workshop share tasks π)
-
NeurIPS 2023: FABind: Fast and Accurate Protein-Ligand Binding, Qizhi Pei, Kaiyuan Gao, Lijun Wu, Jinhua Zhu, Yingce Xia, Shufang Xie, Tao Qin, Kun He, Tie-Yan Liu, Rui Yan | Project Page ||
β¨οΈ AI
-
FL@FM NeurIPS 2024: Hot Pluggable Federated Learning, Lei Shen, Zhenheng Tang, Lijun Wu, Yonggang Zhang, Xiaowen Chu, Tao Qin, Bo Han (Outstanding Student Paper Award, Oral π) -
NeurIPS 2020: R-Drop: Regularized Dropout for Neural Networks, Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, Tie-Yan Liu |(R-Drop has been shaped into Microsoft Translator for 20+ language translations! π)
-
EMNLP 2019: Exploiting Monolingual Data at Scale for Neural Machine Translation, Lijun Wu, Yiren Wang, Yingce Xia, Tao Qin, Jianhuang Lai, and Tie-Yan Liu (Win the WMT-19 champion! π).
π Honors and Awards
- 2nd place in Internal Reasoning Track of CURE-Bench@NeurIPS2025, 2025
- Selected as Top Area Chairs for NeurIPS-2025, 2025.
- 1st place in Text2Molecule and 2nd place in Molecue2Tedt on Language+Molecule@ACL2024 shared task, 2024
- Runner up of OGB-LSC @ KDD cup, 2021, Solution
- Outstanding Graduate Awards of SYSU, 2020
- Outstanding Reviewer of EMNLP, 2019
- 1st Place of WMT 2019 in 5 translation directions: En->De, De->En, De->Fr, Fr->De and Ru->En, 2019
- Microsoft Research Asia Ph.D. Fellowship, 2018
- Graduate Student National Scholarship, 2018
- Stars of Tomorrow Internship Award of Microsoft Research Asia, 2018
- Outstanding Undergraduate Awards of SYSU, 2015
- 1st Place of Global IBM/IEEE Smarter Planet Challenge, 2013
- Undergraduate Student National Scholarship, 2012, 2013
- First Class Scholarship of SYSU, 2012, 2013, 2014
π Experience
- 2025.08-Now, Young Scientist, Shanghai Artificial Intelligence Laboratory
- 2024.05-2024.08, Research Scientist, ByteDance,
- 2022.07-2024.05, Senior Researcher, MSR AI4Science
- 2020.6-2022.07, Senoir Researcher, MSRA
- 2014.07-2020.06, Research Intern, MSRA
π¬ Academic Services
- AC: ICML-2026, ICLR-26, NeurIPS-25, ACL-21/22/23/24/25, EMNLP-23/24/25, NNACL-22/23/24/25, EACL-24, COLING-23, ARR-21/22/23/24/25
- SPC: AAAI-22/23/24/25/26, IJCAI-21
- Conference reviewers: ICLR, ICML, NeurIPS, AAAI, IJCAI, ACL, CVPR, EMNLP, KDD, NAACL, COLING, EACL, AACL
- Journal reviewers: TPAMI, TASLP, KBS, Neurocomputing, CSL