Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning

doi:10.1007/s11704-023-3186-6

Front. Comput. Sci.

2024, Vol. 18

Issue (1) : 181335 https://doi.org/10.1007/s11704-023-3186-6

Artificial Intelligence

Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning

Yang YANG^1,^2,³, Jinyi GUO¹, Guangyu LI¹(

), Lanyu LI⁴(

), Wenjie LI², Jian YANG¹

¹. School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
². Department of Computing, Hong Kong Polytechnic University, Hong Kong 100872, China
³. State Key Lab. for Novel Software Technology, Nanjing University, Nanjing 210094, China
⁴. 14th Research Institute of China Electronics Technology Group Corporation, Nanjing 210094, China

Download: PDF(12286 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Traditional image-sentence cross-modal retrieval methods usually aim to learn consistent representations of heterogeneous modalities, thereby to search similar instances in one modality according to the query from another modality in result. The basic assumption behind these methods is that parallel multi-modal data (i.e., different modalities of the same example are aligned) can be obtained in prior. In other words, the image-sentence cross-modal retrieval task is a supervised task with the alignments as ground-truths. However, in many real-world applications, it is difficult to realign a large amount of parallel data for new scenarios due to the substantial labor costs, leading the non-parallel multi-modal data and existing methods cannot be used directly. On the other hand, there actually exists auxiliary parallel multi-modal data with similar semantics, which can assist the non-parallel data to learn the consistent representations. Therefore, in this paper, we aim at “Alignment Efficient Image-Sentence Retrieval” (AEIR), which recurs to the auxiliary parallel image-sentence data as the source domain data, and takes the non-parallel data as the target domain data. Unlike single-modal transfer learning, AEIR learns consistent image-sentence cross-modal representations of target domain by transferring the alignments of existing parallel data. Specifically, AEIR learns the image-sentence consistent representations in source domain with parallel data, while transferring the alignment knowledge across domains by jointly optimizing a novel designed cross-domain cross-modal metric learning based constraint with intra-modal domain adversarial loss. Consequently, we can effectively learn the consistent representations for target domain considering both the structure and semantic transfer. Furthermore, extensive experiments on different transfer scenarios validate that AEIR can achieve better retrieval results comparing with the baselines.

Keywords image-sentence retrieval transfer learning semantic transfer structure transfer

Corresponding Author(s): Guangyu LI,Lanyu LI

About author:

Peng Lei and Charity Ngina Mwangi contributed equally to this work.

Issue Date: 15 November 2023

Cite this article:

Yang YANG,Jinyi GUO,Guangyu LI, et al. Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning[J]. Front. Comput. Sci., 2024, 18(1): 181335.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-3186-6
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I1/181335

Fig.1 Conceptual differences between supervised image-sentence retrieval and transferable image-sentence retrieval. (a) Supervised image-sentence retrieval is with the alignments between modalities as the supervision; (b) Transferable image-sentence retrieval is with the parallel source domain data and non-parallel target domain data

Fig.2 Illustration of the proposed AEIR for transferable image-sentence retrieval. In detail, AEIR learns the consistent representations for target domain from three aspects: 1) Consistent representation learning of source domain, which utilizes the matching loss for supervised learning in source domain; 2) Semantic transfer, which aligns source and target representations for each modality with a domain adversarial loss; and 3) Structure transfer, which further generalizes the consistent representations to target domain using a cross-domain cross-modal metric based constraint. Consequently, AEIR can learn the consistent representations for the target domain without alignments

Methods	FLICKR30K-to-MSCOCO(1K)						FLICKR30K-to-MSCOCO(5K)						MSCOCO-to-FLICKR30K
	Image2Text			Text2Image			Image2Text			Text2Image			Image2Text			Text2Image
	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10
CCA	13.1	34.4	47.5	9.3	29.8	43.1	3.6	12.8	19.9	2.8	10.4	17.3	6.6	18.4	25.6	5.6	16.1	22.5
UVCL	37.9	71.5	82.7	27.0	58.9	71.4	16.7	39.3	52.0	11.3	29.2	40.4	18.6	41.0	51.9	11.9	29.2	38.0
DCCA	14.0	35.5	48.3	9.4	30.1	44.5	3.7	12.9	20.1	2.8	11.0	18.4	6.8	19.4	29.1	6.5	20.9	30.9
VSE0	15.5	35.4	47.1	10.6	30.1	42.2	5.2	15.2	22.5	3.5	11.5	18.2	20.6	39.8	54.1	14.9	36.1	47.1
UGACH	13.9	32.0	43.5	9.1	27.1	39.1	4.4	13.2	20.3	3.1	10.4	16.3	14.8	35.1	45.5	11.0	29.5	39.7
VSEPP	17.1	38.0	50.7	12.5	33.9	47.6	7.2	19.7	27.5	4.9	14.6	22.2	24.4	48.9	60.1	16.5	38.2	48.9
SCAN	30.5	58.6	70.8	23.7	52.0	66.2	16.5	35.0	45.7	10.6	26.4	36.2	48.4	75.2	83.9	35.4	62.2	72.3
IMRAM	35.3	62.0	74.4	26.2	53.9	66.9	18.5	40.4	51.9	12.8	29.9	40.0	53.1	80.2	87.9	41.1	66.3	75.2
SGRAF	38.6	66.8	76.7	29.6	56.6	68.2	20.2	42.5	54.4	15.1	33.5	44.3	57.7	82.8	89.4	41.9	68.2	77.4
RACG	39.3	67.2	76.6	29.8	57.0	68.1	20.8	42.7	54.1	15.5	33.5	44.2	58.1	82.5	89.2	41.1	68.5	77.0
A3VSE	27.2	49.5	58.1	15.1	41.5	49.3	12.3	31.2	40.2	3.1	19.3	26.4	34.1	55.4	67.1	28.4	43.4	56.9
DMTL	13.3	34.7	44.2	8.8	27.0	39.5	4.8	14.8	20.9	2.6	9.1	14.9	N/A	N/A	N/A	N/A	N/A	N/A
CAPQ	12.8	30.6	41.5	9.5	27.2	37.9	2.1	3.2	4.3	1.0	1.4	3.2	21.2	46.7	59.0	15.8	37.9	49.4
MME	40.2	66.7	77.7	31.0	59.3	71.0	21.1	44.2	55.6	17.1	36.5	47.1	59.7	84.2	90.7	44.4	70.8	79.4
DMTL-A	41.2	69.4	76.8	29.9	57.6	69.0	21.9	45.9	57.0	14.4	34.4	45.5	43.8	70.8	79.5	31.7	58.8	69.3
CDCMR	35.5	61.5	75.4	25.9	52.5	66.0	11.6	27.4	37.0	7.8	20.3	28.5	48.2	73.7	82.6	37.8	63.1	72.9
AEIR	43.1	69.7	79.6	33.0	61.9	72.5	25.1	47.8	58.5	18.3	37.8	48.8	61.8	86.4	91.7	45.2	71.2	79.7

Tab.1 Performance comparison with baselines and traditional methods. Evaluation criteria are R

@

A. The best results are highlighted in bold

Fig.3 Retrieval performance with different sizes of source parallel data. (a) FLICKR30K-to-MSCOCO(1K); (b) FLICKR30K-to-MSCOCO(5K); (c) MSCOCO-to-FLICKR30K

Methods	FLICKR30K-to-MSCOCO (1K)						FLICKR30K-to-MSCOCO (5K)						MSCOCO-to-FLICKR30K
	Image2Text			Text2Image			Image2Text			Text2Image			Image2Text			Text2Image
	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10
w/o $L D T$	42.1	68.8	79.0	31.7	60.2	72.4	22.3	45.0	55.6	17.2	36.5	47.2	59.7	85.5	91.0	44.2	71.2	79.2
w/o $L S T$	42.8	69.3	79.1	32.5	61.5	72.3	23.0	46.2	57.4	17.9	37.6	48.4	60.8	84.5	91.5	45.1	71.2	79.7
w/o $L v S T, L v D T$	42.0	69.4	79.5	32.6	60.9	72.2	23.4	46.3	56.4	17.6	37.1	47.8	61.2	84.1	90.1	45.0	70.7	79.1
w/o $L w S T, L w D T$	43.0	68.4	79.1	32.9	61.4	72.2	22.7	45.7	57.0	18.3	37.6	48.4	60.6	84.2	90.6	44.8	71.0	79.3
w/o Pre	41.4	68.2	78.7	32.5	61.5	72.2	23.5	46.1	56.6	17.5	37.5	48.4	59.2	83.2	90.0	43.4	69.6	78.5
AEIR+GR	42.3	67.3	78.0	32.1	60.4	72.2	24.1	46.0	57.2	17.9	37.4	48.3	61.5	85.8	91.1	44.5	70.9	79.6
AEIR+JS	42.6	69.7	79.5	31.8	59.7	70.5	24.4	46.6	57.3	17.2	36.4	46.7	60.6	84.8	90.8	44.6	71.1	79.6
AEIR+GS	29.0	56.1	66.9	21.6	46.9	60.9	14.9	33.1	42.6	8.6	23.8	33.6	42.8	70.6	81.2	31.3	57.7	68.1
AEIR+LS	38.1	64.7	75.6	21.3	50.3	65.8	20.4	40.8	51.3	8.8	24.1	34.1	53.0	81.3	87.9	37.5	64.8	75.0
AEIR	43.1	69.7	79.6	33.0	61.9	72.5	25.1	47.8	58.5	18.3	37.8	48.8	61.8	86.4	91.7	45.2	71.2	79.7
AEIR (Bert)	45.3	73.5	81.7	35.5	64.2	73.9	28.8	48.9	60.1	22.3	39.9	51.2	62.5	88.4	92.8	46.0	72.9	81.1

Tab.2 Ablation study of AEIR. Evaluation criteria are R

@

Methods	FLICKR30K-to-MSCOCO (1K)						FLICKR30K-to-MSCOCO (5K)						MSCOCO-to-FLICKR30K
	Image2Text			Text2Image			Image2Text			Text2Image			Image2Text			Text2Image
	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10
AEIR(M=1)	42.7	66.7	78.3	31.6	60.3	71.9	22.8	45.4	56.8	17.4	37.0	47.5	60.2	84.8	91.1	44.4	70.3	78.7
AEIR(M=3)	43.1	69.7	79.6	33.0	61.9	72.5	25.1	47.8	58.5	18.3	37.8	48.8	61.8	86.4	91.7	45.2	71.2	79.7
AEIR(M=5)	42.9	69.5	79.4	31.5	60.3	70.4	24.0	46.8	57.7	17.3	36.2	46.6	58.6	83.3	89.6	44.9	70.7	79.4

Tab.3 Performance with different numbers of neighbor. The best results are in bold

Fig.4 Parameter sensitivity of

λ

and

μ

for AEIR in Image2Text. (a) R@1-FLICKR30K(I2T); (b) R@5-FLICKR30K(I2T); (c) R@10-FLICKR30K(I2T); (d) R@1-MSCOCO1K(I2T); (e) R@5-MSCOCO1K(I2T); (f) R@10-MSCOCO1K(I2T); (g) R@1-MSCOCO5K(I2T); (h) R@5-MSCOCO5K(I2T); (i) R@10-MSCOCO5K(I2T)

Fig.5 Parameter sensitivity of

λ

and

μ

for AEIR in Text2Image. (a) R@1-FLICKR30K(T2I); (b) R@5-FLICKR30K(T2I); (c) R@10-FLICKR30K(T2I); (d) R@1-COCO1K(T2I); (e) R@5-COCO1K(T2I); (f) R@10-COCO1K(T2I); (g) R@1-COCO5K(T2I); (h) R@5-COCO5K(T2I); (i) R@10-COCO5K(T2I)

Methods	FLICKR30K-to-MSCOCO (1K)						FLICKR30K-to-MSCOCO (5K)						MSCOCO-to-FLICKR30K
	Image2Text			Text2Image			Image2Text			Text2Image			Image2Text			Text2Image
	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10
VSEPP	17.1	38.0	50.7	12.5	33.9	47.6	7.2	19.7	27.5	4.9	14.6	22.2	24.4	48.9	60.1	16.5	38.2	48.9
VSEPP+	17.9	39.5	53.3	12.8	34.2	47.7	7.6	20.2	28.2	4.9	14.7	22.6	24.6	49.2	60.6	16.7	38.4	49.1
SCAN	35.6	63.0	73.8	18.7	48.2	62.3	17.6	39.1	48.7	8.1	21.3	31.7	48.4	75.2	83.9	35.4	62.2	72.3
SCAN+	38.1	64.7	75.6	21.3	50.3	65.8	20.4	40.8	51.3	8.8	24.1	34.1	53.0	81.3	87.9	37.5	64.8	75.0
SGARF	38.6	66.8	76.7	29.6	56.6	68.2	20.2	42.5	54.4	15.1	33.5	44.3	57.7	82.8	89.4	41.9	68.2	77.4
SGRAF+	43.1	69.7	79.6	33.0	61.9	72.5	25.1	47.8	58.5	18.3	37.8	48.8	61.8	86.4	91.7	45.2	71.2	79.7
RACG	39.3	67.2	76.6	29.8	57.0	68.1	20.8	42.7	54.1	15.5	33.5	44.2	58.1	82.5	89.2	41.1	68.5	77.0
RACG+	42.1	68.7	79.0	32.6	61.5	72.2	23.4	45.5	57.3	17.6	36.2	47.3	60.9	85.5	90.3	43.6	70.7	78.4
A3VSE	29.9	58.1	69.9	25.3	52.2	62.1	15.8	34.0	45.6	11.3	27.8	37.1	45.6	71.9	82.1	34.2	60.4	70.8
A3VSE+	33.2	60.1	71.9	26.0	53.8	63.6	17.9	37.6	48.5	12.2	28.6	38.7	49.8	74.3	84.0	36.5	62.8	72.5

Tab.4 Performance of AEIR with different retrieval model based on the same backbones. For example, VSEPP+ means VSEPP+AEIR. Evaluation criteria are R

@

Fig.6 Value of the objective function of AEIR versus the number of training batches. (a) MSCOCO-to-FLICKR30K; (b) FLICKR30K-to-MSCOCO (1K); (c) FLICKR30K-to-MSCOCO (5K)

Fig.7 Value of the objective function of AEIR without pre-training versus the number of training batches. (a) MSCOCO-to-FLICKR30K; (b) FLICKR30K-to-MSCOCO (1K); (c) FLICKR30K-to-MSCOCO (5K)

Tab.5 Performance comparison with baselines and traditional methods on Pascal Sentences. Evaluation criterion are MAP. The best results are highlighted in bold

Methods	FLICKR30K-to-MSCOCO (1K)						FLICKR30K-to-MSCOCO (5K)						MSCOCO-to-FLICKR30K
	Image2Text			Text2Image			Image2Text			Text2Image			Image2Text			Text2Image
	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10	R $@$ 1	R $@$ 5	R $@$ 10
source	78	95.8	98.2	61.4	89.3	95.4	56.9	82.4	90.5	40.2	68.7	79.8	75.2	93.3	96.6	56.2	81.0	86.5
source+AEIR	78.4	96.1	98.6	61.8	89.3	94.8	56.4	83.0	90.3	40.4	68.8	79.2	75.3	93.1	96.6	56.5	81.3	86.5

Tab.6 Performance on source domain. Evaluation criteria are R

@

K. The best results are highlighted in bold

Fig.8 Qualitative results of text retrieval given image queries on MS-COCO dataset. For each image query we show the top-5 ranked sentences. We observe that our AEIR retrieves the correct results in the top-ranked sentences

Fig.9 Qualitative results of image retrieval given sentence queries on MS-COCO dataset. For each sentence query, we show the top-3 ranked images, ranking from left to right. We outline the true matches in green boxes and false matches in red boxes

Fig.10 Failure results of text retrieval given image queries on MS-COCO dataset. For each image query we show the top-5 ranked sentences

Fig.11 Failure results of image retrieval given sentence queries on MS-COCO dataset. For each sentence query, we show the top-3 ranked images, ranking from left to right

1	Z, Wang X, Liu J, Lin C, Yang H Li . Multi-attention based cross-domain beauty product image retrieval. Science China Information Sciences, 2020, 63( 2): 120112
2	K, Wang Q, Yin W, Wang S, Wu L Wang . A comprehensive survey on cross-modal retrieval. 2016, arXiv preprint arXiv: 1607.06215
3	Y, Peng J, Qi Z, Ye Y Zhuo . Hierarchical visual-textual knowledge distillation for life-long correlation learning. International Journal of Computer Vision, 2021, 129( 4): 921–941
4	Y, Liu Y Y, Guo J, Fang J L, Fan Y, Hao J M Liu . Survey of research on deep learning image-text cross-modal retrieval. Journal of Frontiers of Computer Science & Technology, 2022, 16( 3): 489–511
5	J, Chi Y Peng . Dual adversarial networks for zero-shot cross-media retrieval. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 663−669
6	Zhen L, Hu P, Wang X, Peng D. Deep supervised cross-modal retrieval. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 10386−10395
7	D, Wang X, Gao X, Wang L, He B Yuan . Multimodal discriminative binary embedding for large-scale cross-modal retrieval. IEEE Transactions on Image Processing, 2016, 25( 10): 4540–4554
8	W, Qu D, Wang S, Feng Y, Zhang G Yu . A novel cross-modal hashing algorithm based on multimodal deep learning. Science China Information Sciences, 2017, 60( 9): 092104
9	Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J. CAMP: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5763−5772
10	K H, Lee X, Chen G, Hua H, Hu X He . Stacked cross attention for image-text matching. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 212−228
11	Y, Zhang H Lu . Deep cross-modal projection learning for image-text matching. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 707−723
12	Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In: Proceedings of AAAI Conference on Artificial Intelligence. 2021, 3208−3216
13	Y, Peng J, Qi Y Zhuo . MAVA: multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism. IEEE Transactions on Image Processing, 2020, 29: 2728–2741
14	Z, Ji H, Wang J, Han Y Pang . SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Transactions on Cybernetics, 2022, 52( 2): 1086–1097
15	A, Frome G S, Corrado J, Shlens S, Bengio J, Dean M, Ranzato T Mikolov . DeViSE: a deep visual-semantic embedding model. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2121−2129
16	Song G, Tan X. Sequential learning for cross-modal retrieval. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop. 2019, 4531−4539
17	Feng Y, Ma L, Liu W, Luo J. Unsupervised image captioning. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 4120−4129
18	Gu J, Joty S R, Cai J, Zhao H, Yang X, Wang G. Unpaired image captioning via scene graph alignments. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 10322−10331
19	P Y, Huang G, Kang W, Liu X, Chang A G Hauptmann . Annotation efficient cross-modal retrieval with adversarial attentive alignment. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1758−1767
20	L, Zhen P, Hu X, Peng R S M, Goh J T Zhou . Deep multimodal transfer learning for cross-modal retrieval. IEEE Transactions on Neural Networks and Learning Systems, 2020, 33( 2): 798–810
21	Chen Q, Liu Y, Albanie S. Mind-the-gap! Unsupervised domain adaptation for text-video retrieval. In: Proceedings of AAAI Conference on Artificial Intelligence. 2021, 1072−1080
22	W, Zhao X, Wu J Luo . Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Transactions on Image Processing, 2021, 30: 1180–1192
23	G, Geigle J, Pfeiffer N, Reimers I, Vulić I Gurevych . Retrieve fast, Rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Transactions of the Association for Computational Linguistics, 2022, 10: 503–521
24	Yang Y, Zhang C, Xu Y C, Yu D, Zhan D C, Yang J. Rethinking label-wise cross-modal retrieval from A semantic sharing perspective. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 3300−3306
25	S J, Pan I W, Tsang J T, Kwok Q Yang . Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011, 22( 2): 199–210
26	T R, Scott K, Ridgeway M C Mozer . Adapted deep embeddings: A synthesis of methods for k-shot inductive transfer learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 76−85
27	Y, Wang C, Wang H, Xue S Chen . Self-corrected unsupervised domain adaptation. Frontiers of Computer Science, 2022, 16( 5): 165323
28	J, Yosinski J, Clune Y, Bengio H Lipson . How transferable are features in deep neural networks? In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 3320−3328
29	Y, Ganin E, Ustinova H, Ajakan P, Germain H, Larochelle F, Laviolette M, Marchand V Lempitsky . Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 2016, 17( 1): 2096–2030
30	M, Long Z, Cao J, Wang M I Jordan . Conditional adversarial domain adaptation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 1647−1657
31	Z, Yao Y, Wang M, Long J Wang . Unsupervised transfer learning for spatiotemporal predictive networks. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 999
32	Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, 248−255
33	A, Karpathy L Fei-Fei . Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664−676
34	R, Kiros R, Salakhutdinov R S Zemel . Unifying visual-semantic embeddings with multimodal neural language models. 2014, arXiv preprint arXiv: 1411.2539
35	R, Socher A, Karpathy Q V, Le C D, Manning A Y Ng . Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2014, 2: 207–218
36	F, Faghri D J, Fleet J R, Kiros S Fidler . VSE++: improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference 2018. 2018, 12
37	H, Diao Y, Zhang L, Ma H Lu . Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 1218−1226
38	Tzeng E, Hoffman J, Saenko K, Darrell T. Adversarial discriminative domain adaptation. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2962−2971
39	Z, Luo Y, Zou J, Hoffman L Fei-Fei . Label efficient learning of transferable representations across domains and tasks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 165−177
40	I J, Goodfellow J, Pouget-Abadie M, Mirza B, Xu D, Warde-Farley S, Ozair A C, Courville Y Bengio . Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 2672−2680
41	J, Hoffman E, Tzeng T, Darrell K Saenko . Simultaneous deep transfer across domains and tasks. In: Csurka G, ed. Domain Adaptation in Computer Vision Applications. Cham: Springer, 2017, 173−187
42	F, Zhuang Z, Qi K, Duan D, Xi Y, Zhu H, Zhu H, Xiong Q He . A comprehensive survey on transfer learning. Proceedings of the IEEE, 2021, 109( 1): 43–76
43	M J, Huiskes M S Lew . The MIR flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. 2008, 39−43
44	T, Lin M, Maire S, Belongie J, Hays P, Perona D, Ramanan P, Dollár C L Zitnick . Microsoft COCO: common objects in context. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 740−755
45	H Hotelling . Relations between two sets of variates. In: Kotz S, Johnson N L, eds. Breakthroughs in Statistics: Methodology and Distribution. New York: Springer, 1992, 162−190
46	G, Andrew R, Arora J A, Bilmes K Livescu . Deep canonical correlation analysis. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 1247−1255
47	Zhang J, Peng Y, Yuan M. Unsupervised generative adversarial cross-modal hashing. In: Proceedings of AAAI Conference on Artificial Intelligence. 2018, 539−546
48	Chen H, Ding G, Liu X, Lin Z, Liu J, Han J. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12652−12660
49	S J, Peng Y, He X, Liu Y M, Cheung X, Xu Z Cui . Relation-aggregated cross-graph correlation learning for fine-grained image–text retrieval. IEEE Transactions on Neural Networks and Learning Systems, 2022, doi:
50	Y, Peng Z, Ye J, Qi Y Zhuo . Unsupervised visual-textual correlation learning with fine-grained semantic alignment. IEEE Transactions on Cybernetics, 2022, 52( 5): 3669–3683
51	Saito K, Kim D, Sclaroff S, Darrell T, Saenko K. Semi-supervised domain adaptation via minimax entropy. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 8049−8057
52	D P, Kingma J Ba . Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
53	J Lin . Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 1991, 37( 1): 145–151

[1]

FCS-23186-OF-YY_suppl_1

Download

[1]	Tian ZHENG, Xinyang QIAN, Jiayin WANG. A structural variation genotyping algorithm enhanced by CNV quantitative transfer[J]. Front. Comput. Sci., 2022, 16(6): 166905-.
[2]	Yuanrun FANG, Fu XIAO, Biyun SHENG, Letian SHA, Lijuan SUN. Cross-scene passive human activity recognition using commodity WiFi[J]. Front. Comput. Sci., 2022, 16(1): 161502-.
[3]	Xu-Ying LIU, Sheng-Tao WANG, Min-Ling ZHANG. Transfer synthetic over-sampling for class-imbalance learning with limited minority class data[J]. Front. Comput. Sci., 2019, 13(5): 996-1009.
[4]	Hao SHAO. Query by diverse committee in transfer active learning[J]. Front. Comput. Sci., 2019, 13(2): 280-291.
[5]	Jie XIN,Zhiming CUI,Pengpeng ZHAO,Tianxu HE. Active transfer learning of matching query results across multiple sources[J]. Front. Comput. Sci., 2015, 9(4): 595-607.
[6]	Hebah ELGIBREEN, Mehmet Sabih AKSOY. RULES-IT: incremental transfer learning with RULES family[J]. Front. Comput. Sci., 2014, 8(4): 537-562.
[7]	Qiang YANG, . Three challenges in data mining[J]. Front. Comput. Sci., 2010, 4(3): 324-333.

Viewed

Full text

Abstract

Cited

Shared

Discussed