COPA : Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

Chaoya Jiang jiangchaoya@pku.edu.cn National Engineering Research Center for Software Engineering, Peking University, Peking UniversityBeijingChina , Haiyang Xu shuofeng.xhy@alibaba-inc.com DAMO Academy, Alibaba GroupHangzhouChina , Wei Ye wye@pku.edu.cn National Engineering Research Center for Software Engineering, Peking University, Peking UniversityBeijingChina , Qinghao Ye yeqinghao.yqh@alibaba-inc.com DAMO Academy, Alibaba GroupHangzhouChina , Chenliang Li lcl193798@alibaba-inc.com DAMO Academy, Alibaba GroupHangzhouChina , Ming Yan ym119608@alibaba-inc.com DAMO Academy, Alibaba GroupHangzhouChina , Bin Bi b.bi@alibaba-inc.com DAMO Academy, Alibaba GroupHangzhouChina , Shikun Zhang zhangsk@pku.edu.cn National Engineering Research Center for Software Engineering, Peking University, Peking UniversityBeijingChina , Ji Zhang zj122146@alibaba-inc.com DAMO Academy, Alibaba GroupHangzhouChina and Fei Huang f.huang@alibaba-inc.com DAMO Academy, Alibaba GroupHangzhouChina

(2023)

Abstract.

Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.

Vision-Language Pretraining; Efficiency; Patch-Text Alignment; Detection

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 31st ACM International Conference on Multimedia; October 29-November 3, 2023; Ottawa, ON, Canada^†^†booktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29-November 3, 2023, Ottawa, ON, Canada^†^†price: 15.00^†^†doi: 10.1145/3581783.3611826^†^†isbn: 979-8-4007-0108-5/23/10^†^†ccs: Information systems Multimedia and multimodal retrieval^†^†ccs: Computing methodologies Image representations^†^†ccs: Computing methodologies Object detection

1. Introduction

Refer to caption — Figure 1. Subfigure (a) illustrates the impact of Text-aware Patch Detector (TPD) in the VQA scenario on various keeping ratios, which is a hyperparameter determining the proportion of retained visual tokens to all tokens. Subfigure (b) demonstrates how Patch-Text Alignment converts object-level annotations to patch-level annotations and optimizes TPD based on the obtained supervision signals. Subfigure (c) presents the VQA accuracy and throughput results for our VLP model and the baseline.

Recently, Vision-Language Pre-training (VLP) (Tan and Bansal, 2019; Chen et al., 2020; Lu et al., 2019; Huang et al., 2020; Su et al., 2020; Li et al., 2020; Chen et al., 2020; Zhou et al., 2020; Li et al., 2021; Yu et al., 2021; Li et al., 2022b) has achieved remarkable success across a wide range of Vision-Language (VL) tasks, establishing itself as a dominant paradigm. Existing VLP methods can be broadly classified into two categories based on their approach to image feature extraction.

Detection-based Models (Tan and Bansal, 2019; Chen et al., 2020; Lu et al., 2019; Li et al., 2020) identify objects/regions within images by employing pre-trained object detectors (Ren et al., 2015; Redmon et al., 2016; He et al., 2017) and learn to align detected objects/regions with text. Although these methods benefit from the valuable fine-grained cross-modal alignment knowledge, they are burdened by the considerable computational cost of object/region detection and the error propagation issues arising from the two-step pre-training strategy.

ViT-based Models refer to more recent efforts (Li et al., 2021; Radford et al., 2021; Kim et al., 2021; Wang et al., 2021a; Singh et al., 2021; Li et al., 2022b; Jiang et al., 2023b, 2022, c, d, a) built on Vision Transformer (ViT) (Dosovitskiy et al., 2021). They utilize ViT as the visual encoder or cross-modal fusion encoder due to its capacity to handle lengthy visual sequences derived from image grids/patches. Although these approaches sidestep the high computational cost and error propagation associated with object detection, they still face two challenges. First, they can only learn coarse-grained alignments between the entire image and text, owing to the absence of fine-grained alignment annotations (e.g., between patches and text), while acquiring fine-grained alignment in VLP is essential for numerous cross-modal understanding and reasoning tasks (e.g., Visual Question Answering and Visual Grounding). Second, many redundant patches in the image do not align with the input text, resulting in extended visual sequences, particularly for high-resolution images. For instance, as illustrated in Figure 1(a), during the inference of the VQA task with an image size of 386 $\times$ 386 for the question ”What is the weather like?”, approximately 70% of the patches in the image depict a bird and sand, which do not correspond to the question. Eliminating these text-irrelevant patches will not impact the model output but can significantly reduce inference time.

To harness the strengths of both model types while mitigating their shortcomings, we propose a novel VLP method named COPA, which effectively incorporates fine-grained object-text alignment knowledge into a ViT-based model in a lightweight and efficient manner. COPA stands for Collaboration between Object- and Patch-Text Alignment. Specifically, by transforming object/region annotations to patch-level annotations, we design an innovative auxiliary pre-training task— Patch-Text Alignment (PTA)—to capture fine-grained cross-modal alignment knowledge and train an accurate Text-aware Patch Detector (TPD). Our method offers two key benefits. First, PTA is jointly trained with other traditional VLP pre-training tasks using ready-made object annotations from just 5% of training images, resulting in an end-to-end architecture without the need for additional computation, e.g., of object detection in prior detection-based models. Second, guided by PTA, the TPD, which is integrated into the ViT-based visual backbone, can accurately identify text-related patches, thereby reducing the length of the visual sequence and further decreasing the overall computational cost. Impressively, we employ a relatively small number of object-level labels from the COCO (Lin et al., 2014)(0.11M) and VG (Krishna et al., 2016)(0.10M) datasets, yet succeed in training a robust and generalizable patch detector that can effectively identify text-relevant patches in large-scale and possibly out-of-domain pre-training data in CC (Sharma et al., 2018)(3M) dataset (see Figure 5).

We evaluate COPA on various representative VL understanding and generation tasks, including visual question answering, cross-modal retrieval, and image captioning. Our findings show that by retaining only 50% text-relevant image patches in the visual backbone, we can achieve competitive or even superior downstream task performance. Simultaneously, the efficiency (more specifically, the throughput) of COPA increases by 88% compared to previous similar VLP models. For example, as illustrated in Figure 1 (c) and Table 4, COPA boosts the throughput of the baseline from 186.42 to 349.71 and even improves by about 0.3 on the VQA test-dev under the same experimental settings. Moreover, by increasing the input image resolution, COPA attains well-designed state-of-the-art downstream task performance (e.g., 78.25 on VQA test-dev, see Table 5) as it benefits from incorporating more image tokens without raising computational costs compared to other baselines.

2. Related Work

2.1. Vision-Language Pre-training

The existing work on vision language pre-training typically falls into two categories: Detector-based VLP model and CNN/ViT-Based VLP models. Previous Detector-based VLP methods (Lu et al., 2019; Li et al., 2019; Tan and Bansal, 2019; Li et al., 2020; Chen et al., 2020; Yu et al., 2021) mainly take a two-step training pipeline approach, which first extracts visual features by a pre-trained object detector and then trains the cross-modal pre-training model to align text and visual features. Even though there are some region-based methods that reduce the computation cost with the lightweight model architecture (Wang et al., 2020a), those methods still suffer from the following weaknesses: (1) the expensive computational cost and time consumption for detecting objects/regions. (2) the error propagation problems caused by the two-step pre-training strategy. More recently, ViTs-based (Li et al., 2021; Kim et al., 2021; Radford et al., 2021; Wang et al., 2021a; Li et al., 2022a, b; Wang et al., 2021c; Kim et al., 2021) methods (especially the patch-based ViT) removes the complicated object detector in feature extraction to conduct end-to-end VL learning. These methods avoid the drawbacks of object detectors but face excessively long visual sequences without fine-grained cross-modal alignment information. Such long visual sequences also bring expensive computation costs but there is no work that focuses on decreasing the high computational cost. In this work, we propose a novel method to detect the text-relevant patches and reduce the redundant undetected patches to a single one in the backbone, thus can decrease the computation cost of VLP models.

2.2. ViTs Acceleration

To accelerate the computation of the transformer(Vaswani et al., 2017) based model, many studies focus on proposing more efficient attention mechanisms (Wang et al., 2020b; Kitaev et al., 2020; Choromanski et al., 2021) or compress Transformer structures (Liu et al., 2021; Heo et al., 2021; Wang et al., 2021b). Recently, some approaches have focused on accelerating ViTs by reducing the number of tokens involved in the inference of ViTs. For example, to expedite ViTs, Ryoo et al. (2021) proposed TokenLearner, in which a relatively small amount of tokens are learned by aggregating the entire feature map weighted by dynamic attention. Rao et al. (2021) introduces a method to reduce tokens for a fully trained ViT, where an extra learnable neural network is added to ViT to select a subset of tokens. Liang et al. (2022) proposes to reduce the computational overhead of inference by proposing a token reorganization method to reduce and reorganize image tokens progressively. However, those methods are unsuitable for VLP as they reduce the image tokens without considering the text context. There is recent VLP work (Jiang et al., 2022) has noticed this issue and attempted to reduce the redundant visual tokens based on coarse-grained cross-modal semantic alignment. However, due to the absence of fine-grained visual-text supervision, they struggle to accurately select text-relevant patches, thereby leading to a degradation in the model performance.

3. Method

In this section, we will first give an overview of our model architecture and then introduce the Text-aware Patch Detector (TPD) in the Vision Transformer (ViT) backbone. Finally, we give the details about the pre-training task of Patch-Text Alignment (PTA).

3.1. Model Architecture

As shown in Figure 2(a), COPA contains a visual encoder, a text encoder, a multimodal fusion encoder for performing cross-modal interaction, and a multimodal decoder for text generation (Note that we implement our method based on Li et al. (2022b) which provides details about the model architecture). The visual encoder named ViT-TPD is a ViT-based network that consists of multiple Transformer layers and a Text-aware Patch Detector (TPD) utilized to detect text-relevant patches.

Formally, suppose we have an input image-text pair denoted as $\left(I,T\right)$ . For the input text, we feed it to the text encoder and get the text representation $T=\{t_{cls},t_{1},t_{2},\cdots,t_{m}\}$ where $t_{cls}$ is the embedding of the text [CLS] token which is used to summarize the global semantic information of the text. For the input image, we divide the input image into $n$ non-overlapping patches $P=\{p_{cls},p_{1},p_{2},\cdots,p_{n}\}$ . Then we feed the patch sequence to the visual encoder ViT-TPD and get the patch sequence representation $V=\{v_{cls},v_{1},v_{2},\cdots,v_{u}\},u\textless n$ . In the ViT-TPD, we apply the text-aware patch detector to detect the text-relevant image patches and fuse other undetected redundant patches to a single token, which can reduce the visual sequence length for training and inference efficiency. After that, the image and text representations are fed into the cross-modal encoder and we get the cross-modal representations $\{c_{cls},c_{1},c_{2},\dots,c_{l}\}$ where $l=u+m$ . The cross-modal representations can be used to finetune downstream multi-modal understanding tasks. Besides, the output cross-modal representations $\{c_{cls},c_{1},c_{2},\dots,c_{l}\}$ of the multi-modal encoder are fed into a Transformer decoder for sequence-to-sequence learning.

3.2. Text-aware Patch Detection

As shown in Figure 2, our ViT-based visual backbone contains $N$ standard Transformer layers and a plug-and-play Text-aware Patch Detector (TPD), which is the only difference from the standard ViT. The TPD dynamically detects the image patches with the guidance of textual input. Specially, suppose the TPD is plugged between the $k_{th}$ ( $1\leq k\textless N$ ) Transformer layer and $(k+1)_{th}$ Transformer layer. Suppose the output patch sequence features of $k_{th}$ Transformer layer is $v^{k}=\{v^{k}_{cls},v^{k}_{1},\cdots,v^{k}_{n}\}$ . We exclude the image [CLS] tokens $v^{k}_{cls}$ and feed the left patch tokens $\{v^{k}_{1},\cdots,v^{k}_{n}\}$ and the text [CLS] feature $t_{cls}$ together to the TPD. The text [CLS] feature is output by the text encoder and represents global information of the input text $T$ .

In the TPD, we first concatenate the text [CLS] feature with each image patch token as follows:

\dot{v}^{k}_{i}=concat(v^{k}_{i},t_{cls})

where $v^{k}_{i}\in R^{d},t_{cls}\in R^{d},\dot{v}^{k}_{i}\in R^{2d},i\in\{1,2,\dots,n\}$ . Then the concatenated patch features $\{\dot{v}^{k}_{i}\}$ are fed to the patch detector. The patch detector is an MLP that contains three linear layers and is used to predict the alignment score between patches and the input text T. The first two linear layers will linearly project the concatenated patch features $\{\dot{v}^{k}_{i}\}$ to the hidden representations $\{h^{k}_{i}\}$ and then the hidden representations $\{h^{k}_{i}\}$ is fed to the last linear layer denoted as $\mathbf{F}_{\theta}$ which can be seen as a classifier to predict whether the patches are relevant to the input text. The output of the last linear layer has only one dimension and will be fed to a Sigmoid activation function. Formally, the alignment score $a_{i}$ between the $i_{th}$ image patch and input text T can be calculated as follow:

a_{i}=Sigmoid(\mathbf{F}_{\theta}(h^{k}_{i})),i\in\{1,2,\dots,n\}

Then, we identify and preserve the image tokens with high alignment scores with the text which corresponds to the $K$ largest elements in the alignment score sequence $\{a_{1},..a_{n}\}$ , where $K=n\times\alpha$ , and $\alpha$ is a hyper-parameter and named Keeping Ratio which is used to control the proportion of detected patches to total patches. The detected top-K image patch tokens are kept and the undetected patch tokens $\{v_{z_{1}},v_{z_{2}},\cdots,v_{z_{n-K}}\}$ which generally have lower alignment scores with the text will be treated as text-irrelevant tokens and further fused by a token fusion operation. We fuse undetected tokens to one token $v_{f}$ by a weighted sum operation to supplement one as follows:

(1)

\left[\hat{a}_{z_{1}},\cdot,\hat{a}_{z_{n-k}}\right]=Softmax(\left[{a}_{z_{1}}% ,\cdot,{a}_{z_{n-k}}\right])

(2)

v_{f}=\sum\limits^{n-k}\limits_{i=1}\hat{a}_{z_{i}}\cdot v_{z_{i}}

After fusing the undetected patch tokens to single one token $v_{f}$ , we reconstruct the $k_{th}$ visual sequence as $v^{k}=\left[v^{k}_{cls},v^{k}_{1},\cdots,v^{k}_{u},{v}^{k}_{f}\right]$ , which consists of the image [CLS] token embedding, the detected text-relevant image patch embeddings, and the fused patch embedding. Then the reduced visual sequence is fed to the next $(k+1)_{th}$ transformer layer. Such Text-aware Patch Detector (TPD) works during both pre-training and finetuning, which can be optimized in the pre-training by the Patch-Text Alignment objectives which will be introduced in the next.

3.3. Patch-Text Alignment

The key component of COPA is the text-aware Patch Detector which needs to detect text-relevant patches according to the fine-grained alignment scores between the image patches and input text. However, such fine-grained patch-text alignment capabilities of traditional ViT-based models are weak as the lack of fine-grained patch-text labels. To address the above difficulties, in this sub-section, we introduce a novel pre-training task named Patch-Text Alignment which facilitates the patch detector training and drives our model to learn the fine-grained patch-text alignment.

We find that in most object objection and visual grounding datasets, the object and region generally be paired with a class label or text description. Therefore, we can transfer every object class label to a text description based on a text template such as ”This is a [class label].”. Thus, for each (object/region) bounding box in an image, we can find a text description. Then, we transform the bounding box annotations to the patch-level labels by following this rule: Given an image and a bounding box annotation, if there is an overlap between an image patch and a bounding box, it will be assigned with label 1, otherwise, it will be assigned with label 0. For different text descriptions and bounding boxes, the labels of the patch are different. In this way, we can generate fine-grained patch-text labels which can be served as the supervisory signal to pre-train our model.

After that, in each step of pre-training, we randomly sample a mini-batch of images from the object detection/visual grounding datasets (e.g., COCO (Lin et al., 2014) or VG (Krishna et al., 2016)). For each image, we randomly select an object/region bounding box and translate the bounding box annotation to the image patch label sequence following the transformation rule we mentioned before. Then, we feed the batch of text descriptions of the bounding boxes and the images together to our VLP model. In the ViT-TPD backbone, we hope the text-aware Patch Detector can detect all patches which have overlap with the region in the bounding box with the guidance of the bounding box text description. Supposing the Text-aware Patch Detector has predicted the alignment scores between image patches and text, we will calculate the binary cross entropy loss between the alignment scores and patch labels as follows:

(3)

\mathbf{L}_{PTA}=\frac{1}{e}\sum_{i=1}^{e}Y_{i}log\left(a_{i}\right)+\left(1-Y% _{i}\right)log\left(1-a_{i}\right)

where $a_{i}$ is the alignment score between $i_{th}$ patch in the image and the input text, $Y_{i}$ is the patch label of $i_{th}$ patch. After calculating the Patch-Text Alignment loss $\mathbf{L}_{PTA}$ , we then randomly sample a mini-batch of normal image-text pairs from the dataset of 4M images (refer to subsection 4.1) and calculate the Image-Text Contrastive (ITC) loss $\mathbf{L}_{ITC}$ , Image-Text Matching (ITM) loss $\mathbf{L}_{ITM}$ , Masked Language Modeling (MLM) loss $\mathbf{L}_{MLM}$ and Prefix Language Modeling (PrefixLM) loss $\mathbf{L}_{Prefix}$ based on other four pre-training objectives (for more details about other pre-training objectives, please refer to Appendix B). We assign equal loss weights to each pre-training loss, and thus the full pre-training loss is:

(4)

\mathbf{L}=\mathbf{L}_{ITC}+\mathbf{L}_{ITM}+\mathbf{L}_{MLM}+\mathbf{L}_{% Prefix}+\mathbf{L}_{PTA}

We also provide the pseudo algorithm 1 to further elaborate on our pre-training schedule. Besides, at the beginning of pre-training, as the PTA loss has not yet converged, thus the performance of the patch detector is not ideal, we detect the image patches directly based on the attention weights of the image [CLS] token to other patch tokens. As the PTA loss gradually converges, we will use the patch detector to detect the text-relevant patches in the TPD module.

Input: Large scale pre-training dataset

\mathcal{D}

, Object/Region Dataset

\mathcal{O}

, the number of pre-training epochs

T

, the pre-training learning rate

\alpha

, the batch size

B_{D}

of dataset

\mathcal{D}

, the batch size

B_{O}

of dataset

\mathcal{O}

1 Initialize the parameters

\theta

of our model

M

;

2 for $t=1$ to $T$ do

3 Randomly sample a mini-batch of

B_{O}

Images

\{\hat{v}_{1},\hat{v}_{2},\dots,\hat{v}_{B_{O}}\}

from

\mathcal{D}

;

4 for $i=1$ to $B_{O}$ do

5 Select a object or region

r_{i}

from image

\hat{v}_{i}

;

6 Convert the object class label

\hat{y}_{i}

to text description

\hat{t}_{i}

;

7 Convert the bounding box annotation of

r_{i}

to patch annotations

Y^{i}=\{y^{i}_{1},y^{i}_{2},\dots,y^{i}_{n}\}

;

9 Run forward of

M

on the mini-batch of image-text pairs

\{\{\hat{v}_{1},\hat{t}_{1}\},\{\hat{v}_{2},\hat{t}_{2}\},\dots,\{\hat{v}_{B_{% O}},\hat{t}_{B_{O}}\}\}

and

\{Y^{1},Y^{2},\dots,Y^{B_{O}}\}

to obtain the loss

\mathcal{L}_{PTA}

;

10 Randomly sample a mini-batch of

B

Image-Text Pairs

\{\{v_{1},t_{1}\},\{v_{2},t_{3}\},\ldots,\{v_{B_{D}},t_{B_{D}}\}\}

from

\mathcal{D}

;

11 Run forward of

M

on the mini-batch of image-text pairs

\{\{v_{1},t_{1}\},\{v_{2},t_{3}\},\ldots,\{v_{B_{D}},t_{B_{D}}\}\}

to obtain the losses

\mathcal{L}_{ITC}

\mathcal{L}_{ITM}

\mathcal{L}_{MLM}

\mathcal{L}_{Prefix}

;

12 Calculate the overall loss:

\mathbf{L}=\mathbf{L}_{ITC}+\mathbf{L}_{ITM}+\mathbf{L}_{MLM}+\mathbf{L}_{% Prefix}+\mathbf{L}_{PTA}

;

13 Backward the overall loss

\mathbf{L}

and update the parameters of

M

using gradient descent with learning rate

\alpha

and the average loss

\mathbf{L}

over the mini-batch:

\theta\leftarrow\theta-\alpha\frac{1}{B}\sum_{i=1}^{B}\nabla_{\theta}\mathcal{% L}(\theta;s_{i})

;

15return

M

with pre-trained parameters

\theta

;

Algorithm 1 Pre-training algorithm of COPA .

4. Experiments

4.1. Data & Setup

Following the previous work (Li et al., 2021), we use the same pre-training dataset with 4M images with texts, which includes two in-domain datasets (MS COCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2016)), and three web out-of-domain datasets (Conceptual Captions (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011). See Appendix A for more details on the pre-training datasets.

We pre-train the model for 30 epochs with a total batch size of 1024 on 16 NVIDIA A100 GPUs. We use a 6-layer Transformer for both the text encoder and the cross-modal skip-connected network, and a 12-layer Transformer for the decoder. The text encoder is initialized using the first 6 layers of the BERT ${}_{base}$ (Devlin et al., 2019) model and the skip-connected network is initialized using the last 6 layers of the BERT ${}_{base}$ . Please see Appendix D for more details of the pre-training setting of our model.

Models	# Pretrain	VQA		COCO Caption								NoCaps
	Data			Cross-entropy Optimization				CIDEr Optimization
	Data	Test-dev	Test-std	B@4	M	C	S	B@4	M	C	S	C	S
E2E-VLP (Xu et al., 2021)	4M	73.25	73.67	36.2	-	117.3	-	-	-	-	-	-	-
OSCAR (Li et al., 2020)	6.5M	73.16	73.44	-	-	-	-	41.7	30.6	140.0	24.5	83.4	11.4
VinVL (Zhang et al., 2021)	5.65M	76.52	76.60	38.5	30.4	130.8	23.4	41.0	31.1	140.9	25.2	97.3	13.8
METER (Dou et al., 2021)	4M	77.68	77.64	-	-	-	-	-	-	-	-	-	-
BLIP (Li et al., 2022a)	14M	77.54	77.62	38.6	-	129.7	-	-	-	-	-	105.1	14.4
VLMo (Wang et al., 2021a)	4M	76.64	76.89	-	-	-	-	-	-	-	-	-	-
ViLBERT (Lu et al., 2019)	3.3M	70.63	70.92	-	-	-	-	-	-	-	-	-	-
VisualBERT (Li et al., 2019)	180K	70.80	71.00	-	-	-	-	-	-	-	-	-	-
SimVLM (Wang et al., 2021c)	1.8B	77.87	78.14	39.0	32.9	134.8	24.0	-	-	-	-	-	-
ALBEF (Li et al., 2021)	14M	75.84	76.04	-	-	-	-	-	-	-	-	-	-
TRIPS (Jiang et al., 2022)	4M	76.23	76.48	-	-	-	-	-	-	-	-	-	-
mPLUG (Li et al., 2022b)	4M	77.55	77.73	39.3	30.1	132.4	23.34	41.2	30.8	140.2	25.2	98.3	12.9
COPA	4M	77.84	77.91	39.5	32.8	133.8	24.12	41.5	31.0	140.4	25.1	98.9	13.1

Table 1. Evaluation Results on VQA, COCO Caption ”Karpathy” test split and NoCaps validation set. B@4: BLEU@4, M: METEOR, C: CIDEr, S: SPICE. More details about comparison models in Appendix LABEL:sup:comparison_models .

Models	# Pretrain	MSCOCO (5K test set)						Flickr30K (1K test set)
Models	data	TR			IR			TR			IR
		R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
ALIGN (Jia et al., 2021)	1.8B	77.0	93.5	96.9	59.9	83.3	89.8	95.3	99.8	100.0	84.9	97.4	98.6
OSCAR (Li et al., 2020)	4M	70.0	91.1	95.5	54.0	80.8	88.5	-	-	-	-	-	-
E2E-VLP (Xu et al., 2021)	4M	-	-	-	-	-	-	86.2	97.5	98.92	73.6	92.4	96.0
UNITER (Chen et al., 2020)	4M	65.7	88.6	93.8	52.9	79.9	88.0	87.3	98.0	99.2	75.6	94.1	96.8
VLMo (Wang et al., 2021a)	4M	78.2	94.4	97.4	60.6	84.4	91.0	95.3	99.9	100.0	84.5	97.3	98.6
ALBEF (Li et al., 2021)	14M	77.6	94.3	97.2	60.7	84.3	90.5	95.9	99.8	100.0	85.6	97.5	98.9
BLIP (Li et al., 2022a)	14M	80.6	95.2	97.6	63.1	85.3	91.1	96.6	99.8	100.0	87.2	97.5	98.8
TRIPS (Jiang et al., 2022)	14M	78.1	94.8	97.6	61.3	84.3	91.4	96.3	99.8	100.0	85.8	98.1	99.0
mPLUG (Li et al., 2022b)	4M	80.2	95.1	97.7	62.5	84.8	90.9	96.4	99.8	100.0	86.5	97.5	98.8
COPA	4M	80.8	95.6	98.1	63.6	85.6	91.6	96.8	99.8	100.0	87.3	97.9	98.9

Table 2. Image-text retrieval results on Flickr30K and COCO datasets.

4.2. Main Result

We evaluate our model COPA on four widely explored vision-language downstream tasks: Visual Question Answering (VQA), Cross-modal Retrieval, Image Caption, and Visual Grounding (VG). We plug the Text-aware Patch Detector before the 6-th Transformer layer in the ViT encoder and set the keeping ratio to 50%, achieving the desired trade-off between the downstream task performance and the model inference speed. The fine-tuning hyperparameters are described in Appendix E. Details of the comparison methods are in Appendix LABEL:sup:comparison_models.

4.2.1. Visual Question Answering

The VQA task (Agrawal et al., 2015) requires the model to answer natural language questions given an image. During fine-tuning and inference of VQA, we feed the [CLS] token of the question to the TPD to detect the question-relevant patch tokens (We also give the visualization of the detected tokens in Figure 4 which indicates the effectiveness and generalization of TPD). We follow (Li et al., 2021) and consider VQA as an answer-generation problem. We report test-dev and test-std scores by submitting our results to the evaluation server¹¹1https://eval.ai/web/challenges/challenge-page/830/overview in Table 1. Compared with the VLP baselines, our COPA can get the better performance (e.g. 77.84 on VQA test-dev) with SOTAs under the same image resolution (384 $\times$ 384) and even speed up about 88% of model inference(see the report results in Table 4 and Table 5). Furthermore, when we increase the image resolution to 512 $\times$ 512 (as shown in Table 5), we can achieve better performance (e.g. 78.25 on VQA test-dev. ) while keeping a similar inference computation cost with the baselines mPLUG (Li et al., 2022b) (e.g. 65.23 of COPA ${}_{512\times 512}$ VS 63.57 of mPLUG (Li et al., 2022b) on Throughput.). The results demonstrate the effectiveness and efficiency of COPA .

4.2.2. Image Captioning

As there is no textual input in the image caption task, we directly detect the patches based on the vision information where we use the attention weight of the image [CLS] token to other image tokens as the detection scores and fusion the image tokens with low attention weight of image [CLS] token. Following (Li et al., 2020), we first fine-tune COPA with cross-entropy loss and then with CIDEr optimization (Rennie et al., 2017) for extra 5 epochs. As shown in Table 1, COPA can get comparable or better results with SOTA models on both COCO Caption (Lin et al., 2014) and Nocaps (Agrawal et al., 2018) datasets.

Model	RefCOCO+
Model	val	testA	testB
UNITER (Chen et al., 2020)	75.90	81.45	66.70
VL-BERT (Su et al., 2020)	72.59	78.57	62.30
ViLBERT(Lu et al., 2019)	72.34	78.52	62.61
VILLA (Gan et al., 2020)	76.17	81.54	66.84
MDETR (Kamath et al., 2021)	79.52	84.09	70.62
UNICORN (Yang et al., 2021)	80.30	85.05	71.88
mPLUG (Li et al., 2022b)	80.07	85.21	71.03
COPA	80.37	86.03	71.81

Table 3. Evaluation results of Visual grounding on ReferCOCO+. We use the accuracy of IOU 0.5 on visual grounding (a prediction is right if the IoU between the grounding-truth box and the predicted bounding box is larger than 0.5)

Models	VQA	FLOPs	Throughput	Latency
UNITER (Chen et al., 2020)	72.70	949.90	6.42	870ms
OSCAR (Li et al., 2020)	73.16	956.40	6.35	860ms
VinVL (Zhang et al., 2021)	76.52	1023.30	7.32	640ms
E2E-VLP (Xu et al., 2021)	73.25	144.3	80.23	70ms
ViLT (Kim et al., 2021)	71.26	55.40	247.530	19ms
ALBEF (Li et al., 2021)	74.54	33.42	197.52	22ms
TRIPS (Jiang et al., 2022)	76.23	20.89	343.05	11ms
mPLUG (Li et al., 2022b)	77.55	36.63	186.42	24ms
COPA	77.84	19.84	349.71	10ms

Table 4. The comparison of the efficiency of different models. Here, we report the VQA test-dev result and FLOPs, throughput, and latency. The FLOPs results of the baselines come from (Kim et al., 2021). Since FLOPs are proportional to input size, for a fair comparison, we keep same the input size with (Kim et al., 2021), which is 197 for image patches (the image resolution is

224\times 224

) length and 40 for text tokens length. We keep the same setting when calculating throughput and latency.

4.2.3. Image-Text Retrieval

We conduct experiments for both image-to-text retrieval (TR) and text-to-image retrieval (IR) on MSCOCO (Lin et al., 2014) and Flickr30K (Plummer et al., 2015) datasets. We jointly optimize the ITC loss and the ITM loss during fine-tuning. The results are reported in Table 2. As shown in Table 2, the experimental results show that our model gets comparable performance with other VLP baselines. For more details, please refer to the Appendix E

4.2.4. Visual Grounding

Following the setting of mPLUG (Li et al., 2022b), we also evaluate COPA on the visual grounding task, which requires models to localize the referred object in the image based on a given text query. In this task, we feed the text query to the TPD to detect the query-relevant patch tokens, then, instead of directly regressing the bounding boxes, we concatenate detected patch features and attended textual features and feed them into the multi-modal decoder to predict the coordinates. Table 3 demonstrates the performance of COPA in the visual grounding task. COPA achieves comparable results with competitive baseline methods.

Detection Location	Keeping ratio	image size	VQA test-dev	FLOPs(G)	Throughput
-	-	$384\times 384$	77.55	84.872	63.57
[6]	50%	$224\times 224$	76.83	19.84	349.71
[6]	50%	$256\times 256$	77.11	23.04	303.03
[6]	50%	$304\times 304$	77.32	32.97	247.62
[6]	50%	$384\times 384$	77.84	47.56	144.38
[6]	50%	$464\times 464$	78.03	75.99	81.02
[6]	50%	$512\times 512$	78.25	83.24	65.23

Table 5. Results of COPA finetuning on VQA task with different resolution images. The Settings for calculating FLOPs and throughput are the same as Table 4 except for the image resolution. The first row in the table reports the result of our baseline mPLUG

4.3. Efficiency of Text-aware Patch Detection

To assess the efficiency of the Text-aware Patch Detection mechanism, we first compare the computational complexity of various models. We report the Floating Point Operations Per Second (FLOPs), a widely used evaluation metric for model computational complexity. Additionally, to evaluate the computational speed of our model, we compare the throughput and latency of different models. We use a Xeon Platinum 8163 CPU and an NVIDIA V100 GPU to measure latency and throughput.

As illustrated in Table 4, COPA exhibits not only the lowest computational complexity (e.g., 19.84 FLOPs) but also the fastest computational speed (e.g., 349.71 throughput and 13ms latency). Moreover, as shown in Figure 3, we evaluate the single GPU memory cost under varying keeping ratios and detection location, which refers to the position where TPD is integrated into the ViT backbone. For example, Detection Location=6 indicates that TPD is inserted before the 6th transformer layer in the ViT backbone. The results demonstrate that Text-aware Patch Detection significantly reduces GPU memory usage compared to the baseline model. We also observe that the keeping ratio influences GPU memory consumption, while the detection location of TPD does not impact it.

4.4. The Impact of Detection Location and Keeping Ratio

To investigate the influence of detection location in the ViT backbone and keeping ratio on the efficiency and effectiveness of COPA , we train COPA using different detection locations and keeping ratios. Note that when calculating FLOPs and throughput, we set the input image size to $224\times 224$ and input text length to 40. As depicted in Table 9, two main conclusions can be drawn:

First, incorporating the Text-aware Patch Detector (TPD) before shallower layers can reduce computational complexity but at the cost of accuracy. For instance, when TPD is placed before the 4th layer, there is a significant increase in throughput, but the accuracy drops considerably. A possible explanation is that the patch embedding in shallow layers may not adequately represent visual semantics, making it challenging to learn the fine-grained patch-text alignment and subsequently leading to a decline in accuracy.

Second, introducing too many undetected image tokens into the TPD module can significantly impair downstream task performance. For example, when positioning the TPD module before the 4th layer in ViT and setting the keeping ratio to 10%, the performance on the VQA task decreases to 76.48, compared to 77.84 achieved by the model with a 50% keeping ratio in the 6th layer.

4.5. Finetuning on Higher Resolution Images

We can regulate the computational cost by fusing different numbers of inattentive tokens. To this end, we fine-tune COPA on the VQA task, using images with varying resolutions as input. The results are reported in Table 5. The experimental findings indicate that by increasing the input image resolution, the model benefits from processing more image tokens, resulting in improved performance. For instance, by fine-tuning COPA with 512 $\times$ 512 resolution images, we can achieve a score of 78.25 on VQA, surpassing the baseline fine-tuned with 384 $\times$ 384 images while maintaining a similar computational complexity.

4.6. Ablation Study

4.6.1. Effectiveness of Text-aware Patch Detector

We also perform ablation studies to investigate the effects of our proposed pre-training task Patch-Text Alignment (PTA) and Text-aware Patch Detector (TPD). In Table 10, w/o PTA indicates we remove the PTA task but keep the TPD in the visual backbone. However, without the PTA task, the TPD can not be optimized and thus is ineffective, therefore, we replace the TPD with a simple strategy in which we directly detect the patch in a transformer layer based on the self-attention weights of image [CLS] token to other patch tokens. As shown in Table 10, we find that without the text guidance, w/o PTA, which directly detects patches based on visual information, will get a significant accuracy drop on VQA compared with the baseline model mPLUG. On the contrary, COPA detects the text-relevant patches based on the TPD and can even get a slight improvement compared with the baseline model mPLUG, which indicates the effectiveness of TPD and PTA.

4.6.2. Effectiveness of Patch-Text Alignment

w/o TPD means we remove the TPD but keep the PTA pre-training task, compared with the baseline model mPLUG (Li et al., 2022b) (w/o TPD & w/o TPA), we find that even though the efficiency of the model can not be improved compared with the baseline, we can get a remarkable improvement on VQA. This experiment results not only shows the efficiency of the text-aware Patch Detection mechanism but also indicate the effectiveness of PTA, which enables our model to learn fine-grained cross-modal alignment, thus leading to improvement in VQA task.

4.7. Extension to Single-stream Model

Models	EL	KR	FLOPs(G)	Throughput	VQA
Models	EL	KR	FLOPs(G)	Throughput	Test-dev
COPA -S	6	50%	26.23	436.74	71.51
ViLT	-	-	55.40	247.53	71.26
COPA	6	50%	19.84	349.71	77.84
mPLUG	-	-	36.63	186.42	77.55

Table 6. The evaluate results of COPA , COPA -S and their baselines ViLT and mPLUG on VQA test-dev. The setting for calculating FLOPs and throughput is the same as Table 4. KR refers to Keeping Ratio, EL refers to Detection Location.

The proposed Text-aware Patch Detection mechanism can also be extended to single-stream models by incorporating the TPD into the multimodal encoder and pre-training the model with the Patch-Text Alignment task. To verify the effectiveness of TPD and Patch-Text Alignment task in single-stream models, we first implement a single-stream model ( COPA -S) based on the ViLT (Kim et al., 2021) framework, which employs a visual transformer as the cross-modal encoder. Next, we insert the TPD before the 6th layer of the cross-modal encoder and pre-train it with the PTA task (For the single-stream model, the TPD can be directly integrated into the cross-modal encoder and detect patches based on the overall [CLS] token’s attention value.). We then evaluate the downstream task performance, computational complexity, and inference speed of COPA and COPA -S (both with and without Text-aware Patch Detection). The results are shown in Table 6, and we observe consistent improvements in inference speed and downstream task performance for both COPA and COPA -S when incorporating Text-aware Patch Detection and Patch-Text Alignment. These results indicate that the proposed image patch selection mechanism is not only efficient but also effective. Notably, compared to the dual-stream model COPA , COPA -S has faster inference due to the parameter efficiency of the single-stream model. However, its performance falls short of state-of-the-art performance on downstream VL tasks.

4.8. Case Study

The proposed Text-aware Patch Detector (TPD) identifies text-consistent image tokens in the vision backbone and retains the detected image patches. To further investigate the effectiveness of the TPD, we visualize the VQA case and the detected text-relevant image patches in Figure 4. It is evident that based on the text questions, the TPD module can effectively detect relevant patches, and even when preserving only 10% of the detected patches, our model can still produce correct answers. For instance, in the first case, the question is ”Are they having breakfast?”, and the TPD effectively detects patches of food and the girl’s mouth in the image, which are highly relevant to the question. In the second case, the question is ”What color is the suitcase?”, indicating that the red suitcase in the image is text-relevant, while other visual objects like the black cat are text-irrelevant. As illustrated in Figure 4, the TPD effectively detects text-relevant patches, which help our model predict the correct answer. It is worth noting that these examples are not cherry-picked, and this phenomenon is commonly observed in other samples.

5. Conclusion

We have presented COPA , an efficient and effective VLP model that successfully learns fine-grained patch-text alignment and reduces lengthy visual sequences for streamlined training. Specifically, we devise a Patch-Text Alignment pre-training task (PTA) based on a Text-aware Patch Detector (TPD). The TPD is incorporated into the ViT backbone to identify text-relevant patches and eliminate redundant ones. PTA allows our model to learn fine-grained patch-text alignment end-to-end by jointly optimizing with other pre-training tasks. Experiments demonstrate that our method enhances efficiency through the reduction of visual sequences while maintaining or even improving the performance of downstream tasks.

References

(1)
Agrawal et al. (2015) Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2015. VQA: Visual Question Answering. International Journal of Computer Vision 123 (2015), 4–31.
Agrawal et al. (2018) Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. 2018. nocaps: novel object captioning at scale. CoRR abs/1812.08658 (2018). arXiv:1812.08658 http://arxiv.org/abs/1812.08658
Bi et al. (2020) Bin Bi, Chenliang Li, Chen Wu, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. 2020. Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation. arXiv preprint arXiv:2004.07159 (2020).
Chen et al. (2020) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In ECCV.
Choromanski et al. (2021) Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy J. Colwell, and Adrian Weller. 2021. Rethinking Attention with Performers. ArXiv abs/2009.14794 (2021).
Cubuk et al. (2020) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 702–703.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).
Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 (2021).
Dou et al. (2021) Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, Michael Zeng, et al. 2021. An Empirical Study of Training End-to-End Vision-and-Language Transformers. arXiv preprint arXiv:2111.02387 (2021).
Gan et al. (2020) Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In NeurIPS.
Goyal et al. (2017) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913.
He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 2980–2988.
Heo et al. (2021) Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. 2021. Rethinking Spatial Dimensions of Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 11916–11925.
Huang et al. (2020) Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. ArXiv abs/2004.00849 (2020).
Jia et al. (2021) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021).
Jiang et al. (2023a) Chaoya Jiang, Rui Xie, Wei Ye, Jinan Sun, and Shikun Zhang. 2023a. Exploiting Pseudo Image Captions for Multimodal Summarization. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:258564588
Jiang et al. (2022) Chaoya Jiang, Haiyang Xu, Chenliang Li, Ming Yan, Wei Ye, Shikun Zhang, Bin Bi, and Songfang Huang. 2022. TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection. In Conference on Empirical Methods in Natural Language Processing.
Jiang et al. (2023b) Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Mingshi Yan, Bin Bi, Shikun Zhang, Fei Huang, and Songfang Huang. 2023b. BUS : Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), 2888–2898. https://api.semanticscholar.org/CorpusID:259937725
Jiang et al. (2023c) Chaoya Jiang, Wei Ye, Haiyang Xu, Miang yan, Shikun Zhang, Jie Zhang, and Fei Huang. 2023c. Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:258557659
Jiang et al. (2023d) Chaoya Jiang, Wei Ye, Haiyang Xu, Qinghao Ye, Mingshi Yan, Ji Zhang, and Shikun Zhang. 2023d. TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training. ArXiv abs/2312.08846 (2023). https://api.semanticscholar.org/CorpusID:266209702
Kamath et al. (2021) Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1780–1790.
Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
Kim et al. (2021) Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In ICML.
Kitaev et al. (2020) Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. ArXiv abs/2001.04451 (2020).
Krishna et al. (2016) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123 (2016), 32–73.
Li et al. (2022b) Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou, and Luo Si. 2022b. mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections.
Li et al. (2022a) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022a. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086 (2022).
Li et al. (2021) Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In NeurIPS.
Li et al. (2019) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. ArXiv abs/1908.03557 (2019).
Li et al. (2020) Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.
Liang et al. (2022) Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. 2022. Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations. ArXiv abs/2202.07800 (2022).
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV.
Liu et al. (2023) Siyi Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun yue Li, Jianwei Yang, Hang Su, Jun-Juan Zhu, and Lei Zhang. 2023. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. ArXiv abs/2303.05499 (2023).
Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 9992–10002.
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.
Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS (2019).
Ordonez et al. (2011) Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In NIPS.
Plummer et al. (2015) Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, J. Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision 123 (2015), 74–93.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
Rao et al. (2021) Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In NeurIPS.
Redmon et al. (2016) Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 779–788.
Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.
Rennie et al. (2017) Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-Critical Sequence Training for Image Captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1179–1195. https://doi.org/10.1109/CVPR.2017.131
Ryoo et al. (2021) Michael S. Ryoo, A. J. Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. 2021. TokenLearner: Adaptive Space-Time Tokenization for Videos. In NeurIPS.
Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics.
Singh et al. (2021) Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2021. FLAVA: A Foundational Language And Vision Alignment Model. ArXiv abs/2112.04482 (2021).
Su et al. (2020) Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. ArXiv abs/1908.08530 (2020).
Tan and Bansal (2019) Hao Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv preprint arXiv:1908.07490 (2019).
Vaswani et al. (2017) Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. ArXiv abs/1706.03762 (2017).
Wang et al. (2020a) Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li, Lijuan Wang, L. Zhang, Jianfeng Gao, and Zicheng Liu. 2020a. MiniVLM: A Smaller and Faster Vision-Language Model. ArXiv abs/2012.06946 (2020).
Wang et al. (2020b) Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020b. Linformer: Self-Attention with Linear Complexity. ArXiv abs/2006.04768 (2020).
Wang et al. (2021a) Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. 2021a. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. ArXiv abs/2111.02358 (2021).
Wang et al. (2021b) Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021b. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 548–558.
Wang et al. (2021c) Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. 2021c. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. ArXiv abs/2108.10904 (2021).
Xu et al. (2021) Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, and Fei Huang. 2021. E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning. ArXiv abs/2106.01804 (2021).
Yang et al. (2021) Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2021. Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling. CoRR abs/2111.12085 (2021). arXiv:2111.12085 https://arxiv.org/abs/2111.12085
Yu et al. (2021) Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. In AAAI.
Yu et al. (2016) Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In European Conference on Computer Vision. Springer, 69–85.
Zeng et al. (2021) Yan Zeng, Xinsong Zhang, and Hang Li. 2021. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. ArXiv abs/2111.08276 (2021).
Zhang et al. (2021) P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao. 2021. VinVL: Making Visual Representations Matter in Vision-Language Models. (2021).
Zhou et al. (2020) Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. ArXiv abs/1909.11059 (2020).

Appendix A Datasets

Following the previous work (Li et al., 2021), we use the same pre-training dataset with 4M images with texts, which includes two in-domain datasets (MS COCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2016)), and three web out-domain datasets (Conceptual Captions (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011).

	COCO	VG	SBU	CC3M
image	113K	100K	860K	3M
text	567K	769K	860K	3M

Table 7. Statistics of the pre-training datasets.

	image	Captions	Objects	Regions
COCO	0.11M	0.55M	0.45M	-
VG	0.10M	-	2.0M	3.7M

Table 8. Statistics of objects/regions annotations used in the pre-training.

Table 7 shows the statistics of the 4M images with texts used in the pre-training stage. Besides, As shown in table 8 we use also use the objects/regions annotations from COCO(Lin et al., 2014) and VG (Krishna et al., 2016) datasets, and we give statistics of object and region annotations of each dataset. Note that we use the object/region annotations provided by Zeng et al. (2021) thus we follow their setting which filtered out some samples because of: 1) invalid annotations (e.g. negative values for bounding boxes or boxes being outside of the images); 2) boxes being too small (¡ 1%); 3) highly overlapped textual descriptions of regions (¿75%), etc. After pre-processing, we keep COCO objects 446,873 (from 859,999), VG objects 2,043,927 (from 3,802,349), VG regions 3,699,598 (from 5,402,953).

Appendix B Pre-training Objectives

We pre-train our model with five standard objectives: Patch-Text Alignment (PTA), Image-Text Contrastive learning (ITC), Image-Text Matching (ITM), Masked Language Modeling (MLM), Prefix Language Modeling (PrefixLM). These pre-training tasks are optimized jointly. As we have talked about the Patch-Text Alignment before, in this subsection, we only introduce the last four pre-training tasks.

Image-text Contrastive (ITC) For COPA , We follow the (Li et al., 2021) and apply ITC to align the image representation and text representation from the unimodal encoders. For the image, the image feature corresponding to the image [CLS] token is chosen as the image representation. For the text, the text token feature corresponding to the text [CLS] token is the text representation.

Image-Text Matching (ITM) The goal of image-text matching is to predict whether the input image and text are matched. We follow the design of (Li et al., 2021) and select hard negative image-text pairs based on the contrastive text-image similarity. We take the text [CLS] embedding of the multimodal encoder’s output as the joint representation, followed by a Multi-Layer Perceptron (MLP) layer for prediction.

Masked Language Modeling (MLM) The task setup is basically the same as in BERT (Devlin et al., 2019), where we randomly mask 15 $\%$ of tokens in text, and the model is asked to predict these masked words with the cross-modal representations.

Prefix Language Modeling (PrefixLM). This task aims to generate the caption given an image and predict the text segment subsequent to the cross-modal context as (Bi et al., 2020). It optimizes a cross-entropy loss by maximizing the likelihood of text in an autoregressive manner.

Appendix C More Experiment Results.

C.1. Generalization of Text-aware Patch Detector

The Text-aware Patch Detector (TPD) is pre-trained based on the Patch-Text Alignment task, utilizing object/region annotations from the COCO (Lin et al., 2014) and VG (Krishna et al., 2016) datasets as supervised signals. Our subsequent experiments demonstrate that using these two relatively small-scale datasets alone is sufficient to train a robust and generalizable TPD without requiring additional object/region annotations while maintaining the scalability of web-based pre-training data.

In detail, we sample 10K image-text pairs from CC (Sharma et al., 2018) dataset, which is crawled from the web and potentially contains out-of-domain data. We then employ the state-of-the-art open-set object detection model, Grounding DINO (Liu et al., 2023), to detect regions in each image corresponding to the text. Following the approach described in subsection 3.3, we convert these regions into patch-level labels. We subsequently evaluate the accuracy and recall of TPD in detecting patches on this test dataset.

As depicted in Figure 5, the accuracy and recall of TPD’s patch predictions gradually improve with each training epoch. By the 30th epoch, TPD’s accuracy reaches 0.87, and its recall approaches 0.74. These results suggest that TPD is effective and robust in detecting text-related image patches, regardless of whether the image-text pairs originate from manual annotations or are crawled from the internet. This showcases TPD’s remarkable generalizability and effectiveness, allowing us to confidently scale up the size of the web-based pre-training dataset without compromising TPD’s ability to handle out-of-domain data from the web.

Location	Keeping ratio	VQA	FLOPs (G)	Throughput
[4]	10%	76.48	10.15	514.26
[4]	30%	77.02	14.84	443.48
[4]	50%	77.33	16.62	363.65
[4]	70%	77.52	23.32	337.45
[6]	10%	76.89	16.51	418.95
[6]	30%	77.53	17.84	361.05
[6]	50%	77.84	19.84	349.71
[6]	70%	77.89	26.77	306.11
[8]	10%	76.93	22.45	275.13
[8]	30%	77.62	23.36	310.56
[8]	50%	77.89	25.66	273.06
[8]	70%	77.93	30.22	221.88
-	100%	77.98	36.63	186.42

Table 9. Results of pre-training and finetuning COPA with different locations and keeping ratios. We report the text-dev score results of VQA, FLOPs, and Throughput. The throughput (image-text/s) is measured on an NVIDIA V100 GPU using the largest possible batch size for our model.

model	VQA	FLOPs(G)	Throughput
COPA	77.84	19.84	349.71
-w/o PTA	77.12	19.43	352.56
-w/o TPD	77.98	36.63	186.42
mPLUG	77.55	36.63	186.42

Table 10. The result of ablations. We finetune COPA on VQA and report test-dev results. The setting for calculating FLOPs and throughput is the same as Table 4. For -w/o PTA, we keep the same setting with COPA (detection location = 6 and keeping ratio = 50%).

Appendix D Pre-training Details

We use the AdamW (Loshchilov and Hutter, 2019) optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e-5 (ViT-B/16) and 1e-4 (BERT ${}_{base}$ ) in the first 1000 iterations, and decayed to 1e-6 following a cosine schedule. COPA is pre-trained for about 20 epochs with 8*A100-80G GPUs on the 4M pre-training dataset for 41 hours. During pre-training, the batch size on a single GPU is 512, and the overall batch size is 1024.

During pre-training, we take random image crops of resolution 256 $\times$ 256 as input and also apply RandAugment (Cubuk et al., 2020) to improve the generalization of vision encoders. For VQA and image captioning tasks, we increase the image resolution during finetuning. For image-text contrastive learning, the queue size is set as 65,536, and the momentum coefficient is set as 0.995.

Appendix E Downstream Task Details

We evaluate COPA on the four downstream vision-language tasks. The hyperparameters that we use for finetuning on the downstream tasks are listed in Table 11. Following (Li et al., 2021), all tasks adopt RandAugment, AdamW optimizer with a weight decay of 0.05 and a cosine learning rate schedule. Next, we introduce the dataset settings in detail.

VQA.

The VQA task (Agrawal et al., 2015) requires the model to answer natural language questions given an image. Most methods (Tan and Bansal, 2019; Wang et al., 2021a; Li et al., 2020; Wang et al., 2021c) deal with visual question-answering tasks as multi-label classification on pre-defined answer sets. This strategy achieves strong performance, but it is not suitable for real-world open scenarios. We conduct an experiment on the VQA2.0 dataset (Goyal et al., 2017), which contains 83k/41k/81k images for training/validation/test. Following (Li et al., 2021), we use both training and validation splits for training, and incorporate additional training data from Visual Genome (Krishna et al., 2016). Besides, we concatenate the question with the object labels and OCR tokens extracted from the image.

Task	LR (ViT-L/BERT ${}_{base}$ )	batch size	epochs
VQA	2e-5/5e-6	1024	8
Captioning $\dagger$	1e-5&8e-7	256	5
Retrieval	1e-5/2e-6	256	5
Visual Grounding	2e-5/2e-6	512	120

Table 11. Finetuning hyperparameters for downstream tasks.

\dagger

denotes two-stage fine-tuning.

Image Captioning.

The image captioning task requires a model to generate an appropriate and fluent caption for a given image. We evaluate image captioning on two datasets COCO Caption (Lin et al., 2014) and NoCaps (Agrawal et al., 2018). COPA finetuned with training data of COCO Caption is tested on both of the datasets. We train COPA on the MS COCO Caption and test on the same Karpathy split (Li et al., 2020; Wang et al., 2021c) and NoCaps validation set. Following (Li et al., 2020), we first fine-tune COPA with the cross-entropy loss for 5 epochs with a learning rate of 1e-5 and a batch size of 256. Based on the fine-tuned model, we then fine-tune it with CIDEr optimization (Rennie et al., 2017) for extra 5 epochs with a smaller learning rate of 8e-7. We use the best checkpoint on COCO Caption and predict on the Nocaps validation set directly. During inference, we use beam search with a beam size of 10 and set the maximum generation length as 20.

Image-Text Retrieval.

We conduct experiments for both image-to-text retrieval (TR) and text-to-image retrieval (IR) on COCO (Lin et al., 2014) and Flickr30K (Plummer et al., 2015) datasets. We adopt the widely-used Karpathy split (Karpathy and Fei-Fei, 2015) for both COCO and Flickr30K. COCO contains 113k/5k/5k images for train/validation/test, and Flickr30K contains 29k/1k/1k images for train/validation/test. Following (Li et al., 2021, 2022a), we jointly optimize the ITC loss and the ITM loss during fine-tuning. During inference, we first select top-k candidates by computing the dot-product similarity between the image and text encoder features (When extracting the image encoder feature, for efficiency of coarse-grained ranking, we replace the TPD with a simple strategy in which we directly detect the patch in a transformer layer based on the self-attention weights of image [CLS] token to other patch tokens ), and then rerank the selected candidates based on their ITM scores (In the fine-grained reranking stage, for the same image, we re-extracting multiple image encoder features based on TPD with the guidance of multiple text candidates.). We set $k=256$ for COCO and $k=128$ for Flickr30K.

Visual Grounding.

The task of visual grounding involves localizing the referred object in an image given a plain text query. Instead of directly regressing bounding boxes, our approach concatenates visual features with textual features, which are then fed into the multi-modal decoder to predict the object’s coordinates. We evaluate our method on the referring expression grounding dataset: RefCOCO+(Yu et al., 2016). The RefCOCO+ dataset contains 19K images and 141K queries.