Qidong Huang (黄启栋)

Ph.D, University of Science and Technology of China (USTC)

I am currently a researcher working at Alibaba Group. Before that, I received my Ph.D and bachelor degrees at University of Science and Technology of China , supervised by Prof. Weiming Zhang and Prof. Nenghai Yu. My current work and research focus on large vision-language models.

Email: hqd0037[AT]mail.ustc.edu.cn

The homepage maybe not updated in time, Please refer to my CV for latest situation.

[Google Scholar] [GitHub] [CV] [Twitter]

News

06/2025: Check out our ScaleCap! A pipeline for scalable image captioning and 450k high-quality long image caption dataset!
06/2025: Two papers are accepted by ICCV 2025! See you at Honolulu, Hawaii!
06/2025: One paper is accepted by ACL 2025! Congrats!
02/2025: PyramidDrop is accepted by CVPR 2025! Congrats!
02/2025: Check out our MMRC! A large-scale real-world conversation benchmark for MLLMs!
02/2025: We propose Light-A-Video! A training-free relighting method for video generation!
10/2024: Check out our PyramidDrop, accelerating your LVLM with over 1.7X training speed and 2.0X inference speed!
10/2024: We introduce MIR&MoCa, a LVLM pre-training indicator and a light-weight modality calibration module!
04/2024: OPERA is selected as Highlight in CVPR 2024!
02/2024: Two papers are accepted by CVPR 2024. See you at Seattle!
02/2024: I have one paper accepted by IEEE TIP 2024.
12/2023: Please check out our new work OPERA for mitigating MLLM's hallucination!
07/2023: I have one paper accepted by ACM MM 2023.
07/2023: I have one paper accepted by ICCV 2023. See you at Paris!
07/2023: I have a new homepage.

Biography

2023.08 - 2025.04, Research Intern at Shanghai AI Lab, supervised by Jiaqi Wang and Xiaoyi Dong.

2020.09 - 2025.06, Ph.D. in School of Cyber Science and Technology, University of Science and Technology of China .

2016.09- 2020.06, B.Eng. in School of Information Engineering, University of Science and Technology of China

Preprints

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Long Xing*, Qidong Huang*, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin (*Equal contribution)

[arXiv] [Code] [Dataset]

Publications

	Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu IEEE/CVF International Conference on Computer Vision (ICCV), 2025. [arXiv] [Code]

	OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. (Highlight, 2.8% of submissions) [arXiv] [Code]

	PointCAT: Contrastive Adversarial Training for Robust Point Cloud Recognition Qidong Huang, Xiaoyi Dong, Dongdong Chen, Hang Zhou, Weiming Zhang, Kui Zhang, Gang Hua, Nenghai Yu IEEE Transactions on Image Processing (TIP), 2024. [arXiv] [Code]

	Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting Qidong Huang, Xiaoyi Dong, Dongdong Chen, Yinpeng Chen, Lu Yuan, Gang Hua, Weiming Zhang, Nenghai Yu IEEE/CVF International Conference on Computer Vision (ICCV), 2023. [arXiv] [Code]

	Diversity-Aware Meta Visual Prompting Qidong Huang, Xiaoyi Dong, Dongdong Chen, Weiming Zhang, Feifei Wang, Gang Hua, Nenghai Yu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arXiv] [Code]

	Shape-invariant 3D Adversarial Point Clouds Qidong Huang, Xiaoyi Dong, Dongdong Chen, Hang Zhou, Weiming Zhang, Nenghai Yu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [arXiv] [Code]

	Initiative Defense against Facial Manipulation Qidong Huang, Jie Zhang, Wenbo Zhou, Weiming Zhang, Nenghai Yu (Equal contribution) AAAI Conference on Artificial Intelligence (AAAI), 2021.* [arXiv] [Code]

	Light-A-Video: Training-free Video Relighting via Progressive Light Fusion Yujie Zhou, Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Qidong Huang, Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Anyi Rao, Jiaqi Wang, Li Niu IEEE/CVF International Conference on Computer Vision (ICCV), 2025. [arXiv] [Project]

	MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao The 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. [arXiv] [Code]

	PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arXiv] [Code]

	SimAC: A Simple Anti-Customization Method against Text-to-Image Synthesis of Diffusion Models Feifei Wang, Zhentao Tan, Tianyi Wei, Yue Wu, Qidong Huang* (Corresponding author) IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arXiv]

	Ada3Diff: Defending against 3D Adversarial Point Clouds via Adaptive Diffusion Kui Zhang, Hang Zhou, Jie Zhang, Qidong Huang, Weiming Zhang, Nenghai Yu ACM International Conference on Multimedia (MM), 2023. [arXiv]

	Poison Ink: Robust and Invisible Backdoor Attack Jie Zhang, Dongdong Chen, Qidong Huang, Jing Liao, Weiming Zhang, Huamin Feng, Gang Hua, Nenghai Yu IEEE Transactions on Image Processing (TIP), 2022. [arXiv] [Code]

	Deep Template-based Watermarking Han Fang, Dongdong Chen, Qidong Huang, Jie Zhang, Zehua Ma, Weiming Zhang* and Nenghai Yu IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2020. [Paper]