Research
I'm interested in computer vision, machine learning, and multimedia. Much of my research is about Vision-and-Language Pre-training. Representative papers are highlighted. (* indicates equal contribution)
|
|
Stare at What You See: Masked Image Modeling without Reconstruction
Hongwei Xue,
Peng Gao,
Hongyang Li,
Yu Qiao,
Hao Sun,
Houqiang Li,
Jiebo Luo
CVPR, 2023
[PDF]
[arXiv]
[Code]
we propose an efficient MIM paradigm named MaskAlign. MaskAlign simply learns the consistency of visible patch
features extracted by the student model and intact image features extracted by the teacher model.
|
|
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue*,
Yuchong Sun*,
Bei Liu,
Jianlong Fu,
Ruihua Song,
Houqiang Li,
Jiebo Luo
ICLR, 2023
[PDF]
[arXiv]
[Code]
[PaperWithCode]
We adapt image-text pre-trained models to video-text pre-training (i.e., post-pretraining). In this work, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
|
|
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun,
Hongwei Xue,
Ruihua Song,
Bei Liu,
Huan Yang,
Jianlong Fu
NeurIPS, 2022
[PDF]
[arXiv]
[Code]
We introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from HD-VILA-100M.
|
|
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue*,
Tiankai Hang*,
Yanhong Zeng*,
Yuchong Sun*,
Bei Liu,
Huan Yang,
Jianlong Fu,
Baining Guo
CVPR, 2022
[PDF]
[arXiv]
[Code]
We collect a large dataset which is the first high-resolution dataset including 371.5k hours of 720p videos and the most diversified dataset covering 15 popular YouTube categories.
|
|
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training
Hongwei Xue,
Yupan Huang,
Bei Liu,
Houwen Peng,
Jianlong Fu,
Houqiang Li,
Jiebo Luo
NeurIPS, 2021
[PDF]
[arXiv]
[Supp]
[Presentation]
We propose a fully Transformer model for Vision-and-Language pre-training and explore to study the inter-modal interaction.
|
|
Learning fine-grained motion embedding for landscape animation
Hongwei Xue,
Yupan Huang,
Bei Liu,
Huan Yang,
Jianlong Fu,
Houqiang Li,
Jiebo Luo
ACM MM Oral, 2021
[PDF]
[arXiv]
We propose a model named FGLA to generate high-quality and realistic videos by learning Fine-Grained motion embedding for Landscape Animation.
|
|
Unifying Multimodal Transformer for Bi-directional Image and Text Generation
Yupan Huang,
Hongwei Xue,
Bei Liu,
Yutong Lu
ACM MM, 2021
[PDF]
[arXiv]
[Code]
In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks.
|
|
Semantic Tag Augmented XlanV Model for Video Captioning
Yiqing Huang*,
Hongwei Xue*,
Jiansheng Chen,
Huimin Ma,
Hongbing Ma
ACM MM, 2021
We propose to leverage the semantic tags to bridge the gap between the modalities of vision and language rather than directly concatenating or attending to the visual and linguistic features.
|
|
Sed-Net: Detecting Multi-Type Edits Of Images
Hongwei Xue,
Haomiao Liu,
Jun Li,
Houqiang Li,
Jiebo Luo
ICME, 2020
We propose a deep Siamese network model to to classify different types of image edits between an original image and an edited image.
|
- Reviewer for ACM MM 2021, ICMR 2021, ICME 2022, TMM 2022, CVPR 2023, ICCV 2023.
- National Scholarship in 2021.
- Scholarship for Excellent Students in 2016, 2017, 2018. Freshmen Scholarship 2015.
|
|