Hongwei Xue

I was a researcher at ByteDance Research. Before that, I obtained my Ph.D. degree from University of Science and Technology of China (USTC), My advisors are Jiebo Luo and Houqiang Li. I received the B.S. degree from School of the Gifted Young, USTC.

I spent joyful and fulfilling times at NUS, Tencent WeChat, Shanghai AI Lab, Microsoft Research Asia (MSRA).

Email  /  Google Scholar  /  Github

profile photo
Projects

  • Main contributor of WeCLIP, a powerful multi-modal foundation model developed for various applications across WeChat, including Channels, Official Accounts, and more. Collaborated in constructing data, designing and optimizing the model to enhance cross-modal alignment.

  • Contributor of PixelDance (version released in Sep 2024), a powerful video generation model. Collaborated in advancing the ability of instructional control.

  • Papers

    I'm interested in Multi-Modal Learning, Computer Vision, and Machine Learning. Much of my research is about Vision-and-Language Pre-training. Representative papers are highlighted. (* indicates equal contribution)

    VLoRA Visual Perception by Large Language Model's Weights
    Feipeng Ma, Hongwei Xue (project lead), Yizhou Zhou, Guangting Wang, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun
    NeurIPS, 2024
    [PDF] [Project Page] [Code]

    We propose a novel parameter space alignment paradigm for MLLMs to address the inefficiency of input space alignment paradigm in visual perception, introducing VLoRA that converts visual features to LoRA weights, achieving comparable performance on various benchmarks while significantly reducing computational costs for training and inference.

    MM-GEM Multi-Modal Generative Embedding Model
    Feipeng Ma, Hongwei Xue (project lead), Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun
    Arxiv
    [PDF] [arXiv]

    We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one Large Language Model. we explore the minimalism of multi-modal paradigm in this work.

    MaskAlign Stare at What You See: Masked Image Modeling without Reconstruction
    Hongwei Xue, Peng Gao, Hongyang Li, Yu Qiao, Hao Sun, Houqiang Li, Jiebo Luo
    CVPR, 2023
    [PDF] [arXiv] [Code]

    we propose an efficient MIM paradigm named MaskAlign. MaskAlign simply learns the consistency of visible patch features extracted by the student model and intact image features extracted by the teacher model.

    CLIP-VIP CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
    Hongwei Xue*, Yuchong Sun*, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo
    ICLR, 2023
    [PDF] [arXiv] [Code] [PaperWithCode]

    We adapt image-text pre-trained models to video-text pre-training (i.e., post-pretraining). In this work, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.

    LF-VILA Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
    Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu
    NeurIPS, 2022
    [PDF] [arXiv] [Code]

    We introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from HD-VILA-100M.

    hdvila Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
    Hongwei Xue*, Tiankai Hang*, Yanhong Zeng*, Yuchong Sun*, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo
    CVPR, 2022
    [PDF] [arXiv] [Code]

    We collect a large dataset which is the first high-resolution dataset including 371.5k hours of 720p videos and the most diversified dataset covering 15 popular YouTube categories.

    visualparsing Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training
    Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo
    NeurIPS, 2021
    [PDF] [arXiv] [Supp] [Presentation]

    We propose a fully Transformer model for Vision-and-Language pre-training and explore to study the inter-modal interaction.

    landscape Learning fine-grained motion embedding for landscape animation
    Hongwei Xue, Yupan Huang, Bei Liu, Huan Yang, Jianlong Fu, Houqiang Li, Jiebo Luo
    ACM MM Oral, 2021
    [PDF] [arXiv]

    We propose a model named FGLA to generate high-quality and realistic videos by learning Fine-Grained motion embedding for Landscape Animation.

    unifiying Unifying Multimodal Transformer for Bi-directional Image and Text Generation
    Yupan Huang, Hongwei Xue, Bei Liu, Yutong Lu
    ACM MM, 2021
    [PDF] [arXiv] [Code]

    In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks.

    Semantic Tag Augmented XlanV Model for Video Captioning Semantic Tag Augmented XlanV Model for Video Captioning
    Yiqing Huang*, Hongwei Xue*, Jiansheng Chen, Huimin Ma, Hongbing Ma
    ACM MM, 2021

    We propose to leverage the semantic tags to bridge the gap between the modalities of vision and language rather than directly concatenating or attending to the visual and linguistic features.

    Sed-Net Sed-Net: Detecting Multi-Type Edits Of Images
    Hongwei Xue, Haomiao Liu, Jun Li, Houqiang Li, Jiebo Luo
    ICME, 2020

    We propose a deep Siamese network model to to classify different types of image edits between an original image and an edited image.

    Misc
    • Reviewer of ACM MM, ICMR, ICME, TMM, CVPR, ICCV.
    • National Scholarship at USTC.
    • Scholarship for Excellent Students and Freshmen Scholarship at USTC.

    Based on Jon Barron's website.