Exploring Vision-Language Pre-Training to Enable Multimodal Tasks

Category Science

Sunday - June 25 2023, 18:05 UTC - 2 years ago

tldr #

This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. The objectives of VLP are to enable the model to learn the semantic correspondence between different modalities by pre-training on large-scale data. Five significant aspects: feature extraction, model architecture, pre-train objectives, pre-train datasets, and downstream tasks have been explored in the paper. Finally, researchers discuss the new frontiers of VLP.

content #

In a paper published in Machine Intelligence Research, a team of researchers explored the problem of whether pre-trained models can be applied to multi-modal tasks and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, researchers first review its recent advances in five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, they summarize the specific VLP models in detail. Finally, they discuss the new frontiers in VLP.

VLP consists of vision-language tasks such as image-text and video-text pre-training

Making machines respond in ways similar to humans has been a relentless goal of AI researchers. To enable machines to perceive and think, researchers propose a series of related tasks, such as face recognition, reading comprehension, and human-machine dialog, to train and evaluate the intelligence of machines in a particular aspect. Specifically, domain experts manually construct standard datasets and then train and evaluate relevant models on them.

BERT has been a key milestone in NLP and has numerous applications in unimodal tasks

However, due to the limitations of related technologies, it is often necessary to train on a large amount of labeled data to obtain a better and more capable model. The recent emergence of pre-training models based on the transformer structure has alleviated this problem. They are first pre-trained via self-supervised learning that typically exploits auxiliary tasks (pre-training objectives) to mine supervision signals from large-scale unlabeled data to train the model, thereby learning universal representations. Then, they can achieve surprising effectiveness by fine-tuning with only a tiny amount of manually-labeled data on downstream tasks. Since the advent of BERT in natural language processing (NLP), various pre-training models have sprung up in the uni-modal field. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch.

Image-text pre-training focuses on associating text with what the image looks like

Similar to the uni-modal field, there is also a problem of less high-quality labeled data in the multi-modal field. The natural question is whether the above pre-training method can be applied to multi-modal tasks. Researchers have explored this problem and made significant progress.

In this paper, researchers focus on mainstream vision-language pre-training (VLP), including image-text and video-text pre-training. VLP mainly learns the semantic correspondence between different modalities by pre-training on large-scale data. For example, in image-text pre-training, researchers expect the model to associate "dog" in text with what "dog" looks like in images. In video-text pre-training, they expect the model to map objects/actions in the text to objects/actions in the video. To achieve this goal, the VLP objects and model architecture need to be cleverly designed to allow the model to mine the associations between different modalities.

Video-text pre-training focuses on mapping objects/actions in the text to objects/actions in the video

To give readers a better global grasp of VLP, researchers first comprehensively review its recent advances and focus on five significant aspects: feature extraction, model architecture, pre-train objectives, pre-train datasets, and downstream tasks. Then, they summarize the current VLP models comprehensively. Finally, researchers discuss the new frontiers of VLP.

hashtags #

ml visionlanguagepre-training multimodaltask bert imagetextpre-training videotextpre-training

worddensity #

pre-training (12, 2.52%)
researchers (9, 1.89%)
model (9, 1.89%)
tasks (8, 1.68%)
vlp (8, 1.68%)