I am a Research Scientist at Beijing Institute for General Artificial Intelligence (BIGAI). I received my Ph.D. from Department of Statistics at University of California, Los Angeles (UCLA) advised by Professor Song-Chun Zhu. During my Ph.D., I have interned at Google Research, Microsoft Azure AI and Amazon Alexa. Before UCLA, I obtained my degrees of Bachelor and Master from University of Science and Technology of China (USTC). My research interests lie in machine learning, computer vision, and multimodal learning.
2024.02 One paper is accepted by CVPR 2024.
2024.01 Two papers are accepted by ICLR 2024.
2023.06 One paper is accepted by ICCV 2023.
Ph.D. in University of California, Los Angeles (UCLA), 2018.09 - Now
Major: Statistics, Advisor: Prof. Song-Chun Zhu
Master in University of Science and Technology of China (USTC), 2015.08 - 2018.07
Major: Information and Communication Engineering, Advisor: Prof. Jiebo Luo
Bachelor in University of Science and Technology of China (USTC), 2011.09 - 2015.07
Awarded the Guo Moruo Scholarship (郭沫若奖学金), for the best graduate in Department of Automation.
YouRefIt: Embodied Reference Understanding with Language and Gesture
Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Tao Gao , Yixin Zhu, Song-Chun Zhu, Siyuan Huang
ICLR 2021 Embodied Multimodal Leaning Workshop (Short Version).
PDF Project Code
A HINT from Arithmetic: On Systematic Generalization of Perception, Syntax, and Semantics
Qing Li, Siyuan Huang, Yining Hong, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
ICLR 2021 The Role of Mathematical Reasoning in General Artificial Intelligence Workshop (Short Version)
PDF Project Code
Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning
Qing Li, Siyuan Huang, Yining Hong, Yixin Chen, Ying Nian Wu, Song-Chun Zhu.
International Conference on Machine Learning (ICML), 2020
Best Paper Award in ICML 2020 Workshop on Bridge Between Perception and Reasoning: Graph Neural Networks & Beyond.
PDF Supplimentary Project Code
VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People.
Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale J. Stangl, Jeffrey P. Bigham.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019
PDF Project
VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search and Video Hyperlinking
Phuong Anh Nguyen, Qing Li, Zhi-Qi Cheng, Yi-Jie Lu, Hao Zhang, Xiao Wu, Chong-Wah Ngo.
NIST TRECVID Workshop (TRECVID'17), Gaithersburg, USA, Nov 2017
PDF BibTex Challenge Homepage
MSR Asia MSM at THUMOS Challenge 2015
Zhaofan Qiu, Qing Li, Ting Yao, Tao Mei, Yong Rui
In CVPR THUMOS Challenge Workshop, 2015 (2nd place in Action Classification task)
PDF BibTex Challenge Homepage
National Scholarship, 2017
ICMR’16 Student Travel Grants, 2016
Best Paper Finalist in ICMR'16, 2016
Outstanding Graduate in Anhui Province, China, 2015
Guo Moruo Scholarship (郭沫若奖学金), 2014
National Scholarship, 2013
Outstanding Student Scholarship (Gold Award), 2012
Visual Privacy Recognition (VizWiz-Privacy), 2018.07 - 2018.09
Supervised by Prof. Danna Gurari in University of Texas at Austin
- Introducing the first visual privacy dataset originating from blind people. For each image, we manually annotate private regions according to a taxonomy that represents privacy concerns relevant to their images. We also annotate whether the private visual information is needed to answer questions asked about the private images.
- Proposing two tasks to identify (1) whether private information is in an image and (2) whether a question about an image asks about the private content in the image.
Visual Question Answering with Explanation, 2018.01 - 2018.06
Supervised by Prof. Jianfei Cai in NTU, Singapore, and Prof. Jiebo Luo
- Constructed a new dataset of VQA with Explanation (VQA-E), which consists of 181,298 visual questions, answers, and explanations.
- Proposed a novel multi-task learning architecture to jointly predict an answer and generate an explanation for the answer.
Visual Question Answering for Blind People 2017.10 - 2018.01
Supervised by Prof. Danna Gurari in UT Austin and Prof. Jiebo Luo
- Proposed VizWiz, the first goal-oriented VQA dataset arising from a natural setting. VizWiz consists of 31,000 visual questions originating from blind people.
- Analyzed the image-question relevance of VizWiz and benchmarked state-of-the-art VQA algorithms and revealed that VizWiz is a challenging dataset to spur the research on assistive technologies that eliminate accessibility barriers for blind people.
Video Captioning and Ad-hoc Video Search, 2017.02 - 2017.10
Supervised by Prof. Chong-Wah Ngo in City University of Hong Kong
- Proposed a novel framework that can match video and text and generate descriptions for videos by utilizing spatio-temporal attention and applied the proposed framework to the Video to Text task of TRECVID 2017 Competitions.
- Revised the framework to search relevant videos given a text query and won 3rd place in the Ad-hoc Video Search task. Our notebook paper is accepted by NIST TRECVID Workshop 2017.
- Devised a hierarchical co-attention network to improve the AVS system’s adaptability to queries of variable length.
Explainable Visual Question Answering, 2016.08 - 2017.02
Supervised by Dr. Tao Mei in Microsoft Research Asia and Prof. Jiebo Luo
- Proposed a novel framework towards explainable VQA. Our framework can generate attributes and captions for images to explain why the system predicts the specific answer to the question.
- Defined four measurements of the explanations quality and demonstrated strong relationship between the explanations quality and the VQA accuracy. Our current system achieves comparable performance to the state-of-the-art and can improve with explanations quality.
Action and Activity Recognition in Video, 2014.12 - 2015.07
Supervised by Dr. Ting Yao, Dr. Tao Mei in Microsoft Research Asia and Prof. Jiebo Luo
- Proposed a hybrid framework to learn a deep multi-granular spatio-temporal representation for video action recognition by using 2D/3D CNNs and LSTM. Our paper is accepted and selected into the Best Paper Finalist by ICMR 2016 (Accepted Rate: 17%, Best Paper Finalist Rate: 1%). An improved version of the conference paper is accepted by IJMIR 2017.
- Won 2nd place in the Action Classification Task of THUMOS Challenge and presented our work on CVPR THUMOS Workshop in Boston, June 2015. This challenge contains over 430 hours of video data and 45 million frames on 101 action classes.
Highlight Detection for First-Person Video Summarization, 2014.07 - 2014.12
Supervised by Dr. Ting Yao and Dr. Tao Mei in Microsoft Research Asia
- Collected a new large dataset from YouTube for first-person video highlight detection. The dataset consists of 100 hours videos mainly captured by GoPro cameras for 15 sports-related categories.
- Proposed a pairwise deep ranking model to detect highlight segments in videos. My contribution focuses on devising a two-stream CNN (frame and flow) to extract features for video segments.