Benchmarks

The overview of our model SCALE

First slide

The detail of our proposed Inter-Modality Contrastive Learning module

First slide

The (pretrain/finetune) performance gains from sequentially adding more modalities using SCALE on the subset (top) and the whole dataset (bottom). The retrieval performances are based on the features extracted from pretrain and finetune stages.

Model Accuracy mAP@1 mAP@5 mAP@10 Prec@1 Prec@5 Prec@10
Text 77.42 47.70 / 65.10 53.63 / 68.39 51.59 / 66.99 47.70 / 65.10 30.96 / 44.89 24.15 / 33.44
+Image 79.58 51.47 / 67.02 56.16 / 69.85 54.41 / 68.43 51.47 / 67.02 33.41 / 46.29 25.55 / 34.29
+Table 82.83 57.14 / 67.97 61.71 / 70.34 59.64 / 69.38 57.14 / 67.97 38.02 / 46.85 28.99 / 34.36
+Video 84.31 58.57 / 69.79 63.15 / 72.30 61.02 / 70.67 58.57 / 69.79 39.26 / 47.44 29.56 / 34.78
+Audio 85.50 58.72 / 70.62 63.17 / 73.02 61.05 / 71.50 58.72 / 70.62 39.66 / 48.20 30.32 / 35.35
Text 81.11 55.82 / 69.47 60.74 / 72.74 59.02 / 71.79 55.82 / 69.47 36.99 / 48.76 28.04 / 35.84
+Image 83.68 59.81 / 71.51 64.13 / 74.51 62.18 / 73.21 59.81 / 71.51 38.97 / 49.27 30.15 / 36.72
+Table 84.63 61.32 / 72.34 65.53 / 74.86 63.62 / 73.47 61.32 / 72.34 40.66 / 49.77 30.78 / 36.95
+Video 84.90 62.65 / 72.59 65.67 / 75.05 63.87 / 73.62 62.65 / 72.59 41.18 / 49.96 31.01 / 37.04
+Audio 86.57 63.56 / 73.77 67.51 / 76.17 65.39 / 74.73 63.56 / 74.01 42.68 / 50.78 32.17 / 37.42

The performance of our model SCALE under different modality combinations on the coarse- and fine-grained multi-modal retrieval and classification tasks. In the following, I, T, Tab, V and A denote image, text, table, video and audio modalities, respectively.

Modality Combinations Accuracy mAP@1 mAP@5 mAP@10 Prec@1 Prec@5 Prec@10
I+Tab 62.00 44.53 / 45.97 49.62 / 51.89 48.28 / 50.33 44.53 / 45.97 30.89 / 34.08 23.65 / 28.63
I+V 34.57 20.57 / 36.29 26.78 / 42.72 26.41 / 41.38 20.57 / 36.29 14.71 / 26.52 11.78 / 22.34
I+A 27.67 15.73 / 35.64 20.85 / 42.96 20.72 / 41.70 15.73 / 35.64 11.16 / 27.02 9.47 / 22.78
I+T 79.58 67.02 / 62.20 69.85 / 66.97 68.43 / 64.21 67.02 / 62.20 46.29 / 49.85 34.29 / 42.36
I+T+V 80.34 67.35 / 63.05 70.29 / 67.37 68.95 / 64.62 67.35 / 63.05 46.45 / 50.85 34.33 / 43.02
I+T+A 79.73 67.19 / 64.21 70.15 / 68.25 68.64 / 65.35 67.19 / 64.21 46.33 / 50.42 33.32 / 42.93
I+Tab+V 63.09 45.94 / 47.33 51.32 / 53.33 49.78 / 51.28 45.94 / 47.33 31.69 / 35.81 24.12 / 30.05
I+T+Tab 82.83 67.97 / 68.30 70.34 / 72.67 69.38 / 70.07 67.97 / 68.30 46.85 / 57.44 34.36 / 50.59
I+T+Tab+V 84.31 69.79 / 68.40 72.30 / 72.91 70.67 / 70.31 69.79 / 68.40 47.44 / 57.60 34.78 / 51.47
I+Tab+A+V 63.54 47.24 / 48.24 52.07 / 53.89 50.41 / 51.89 47.24 / 48.24 32.19 / 36.29 24.47 / 30.74
I+T+A+V 80.36 68.80 / 66.43 70.84 / 71.12 69.71 / 68.16 68.80 / 66.43 47.24 / 54.03 34.57 / 47.53
I+T+Tab+A 84.33 70.23 / 68.97 72.59 / 73.07 70.94 / 70.77 70.23 / 68.97 47.58 / 57.89 35.33 / 51.60
I+T+Tab+A+V 85.50 70.62 / 69.25 73.02 / 74.08 71.50 / 71.02 70.62 / 69.25 48.20 / 58.76 35.35 / 52.05

Comparisons of image and text modalities on the subset (top) and the whole dataset (bottom).

Method mAP@1 Accuracy NMI Purity
Imagebased 15.17 27.67 63.62 54.86
BERT 47.70 77.42 76.35 68.80
VL-BERT 49.31 78.13 80.51 71.91
ViLBERT 49.18 78.24 80.51 71.91
VisualBERT 49.20 78.41 81.23 72.39
CLIP 49.39 78.35 81.75 72.50
UNITER 49.87 78.54 82.71 73.58
CAPTURE 50.30 78.69 83.06 74.14
SCALE (Ours) 51.47 79.58 84.23 75.81
Imagebased 22.67 30.14 67.49 59.64
BERT 55.82 82.11 87.30 71.75
CLIP 57.73 82.60 90.49 76.48
SCALE (Ours) 59.81 83.68 92.01 78.34