COURIER: contrastive user intention reconstruction for large-scale visual recommendation

doi:10.1007/s11704-024-3939-x

Front. Comput. Sci.

2025, Vol. 19

Issue (7) : 197602 https://doi.org/10.1007/s11704-024-3939-x

Information Systems

COURIER: contrastive user intention reconstruction for large-scale visual recommendation

Jia-Qi YANG^1,², Chenglei DAI³, Dan OU³, Dongshuai LI³, Ju HUANG³, De-Chuan ZHAN^1,²(

), Xiaoyi ZENG³, Yang YANG⁴(

)

¹. School of Artificial Intelligence, Nanjing University, Nanjing 210023, China
². State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
³. TaoTian Searching and Ranking Team, Alibaba Group, Hangzhou 311121, China
⁴. School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

Download: PDF(13518 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

With the advancement of multimedia internet, the impact of visual characteristics on the decision of users to click or not within the online retail industry is increasingly significant. Thus, incorporating visual features is a promising direction for further performance improvements in click-through rate (CTR). However, experiments on our production system revealed that simply injecting the image embeddings trained with established pre-training methods only has marginal improvements. We believe that the main advantage of existing image feature pre-training methods lies in their effectiveness for cross-modal predictions. However, this differs significantly from the task of CTR prediction in recommendation systems. In recommendation systems, other modalities of information (such as text) can be directly used as features in downstream models. Even if the performance of cross-modal prediction tasks is excellent, it is challenging to provide significant information gain for the downstream models. We argue that a visual feature pre-training method tailored for recommendation is necessary for further improvements beyond existing modality features. To this end, we propose an effective user intention reconstruction module to mine visual features related to user interests from behavior histories, which constructs a many-to-one correspondence. We conduct extensive experimental evaluations on public datasets and our production system to verify that our method can learn users’ visual interests. Our method achieves 0.46% improvement in offline AUC and 0.88% improvement in Taobao GMV (Cross Merchandise Volume) with p-value < 0.01.

Keywords user Intention reconstruction contrastive learning personalized searching image features

Corresponding Author(s): De-Chuan ZHAN,Yang YANG

About author:

Just Accepted Date: 13 June 2024 Issue Date: 11 October 2024

Cite this article:

Jia-Qi YANG,Chenglei DAI,Dan OU, et al. COURIER: contrastive user intention reconstruction for large-scale visual recommendation[J]. Front. Comput. Sci., 2025, 19(7): 197602.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-024-3939-x
https://academic.hep.com.cn/fcs/EN/Y2025/V19/I7/197602

Fig.1 (a) Existing image feature learning methods are tailored for cross-modal prediction tasks. (b) We propose a user intention reconstruction method to mine potential visual features that cannot be reflected by cross-modal labels. In this example, the user searched for “Coat” and received two recommendations (Page-viewed items). The user clicked on the one on the right. Through our user intention reconstruction, we identified similar items from the user’s click history with larger attention, the reconstructed PV item embeddings are denoted as

R p v j

. Then, we optimize the PV embeddings

E p v j

and reconstructions

R p v j

to be closer if the corresponding item is clicked and more far apart otherwise

Fig.2 The contrastive user intention reconstruction method. The images are fed into the image backbone model to obtain the corresponding embeddings. The embeddings of PV (Page-View) sequences are blue-colored, and the embeddings of click sequences are yellow-colored. The reconstructions are in green. Red boxes denote positive PV items

Tab.1 Statistics of public datasets

Tab.2 Averge Recall and NDCG performance comparison on public datasets

Tab.3 Pre-training dataset collected from the women’s clothing category

Tab.4 Daily average statistics of the downstream dataset

Methods	$Δ A U C$ (Women’s Clothing)	$Δ A U C$	$Δ G A U C$
Baseline	0.00% (0.7785)	0.00% (0.8033)	0.00% (0.7355)
Supervised	+0.06% (0.7790)	?0.14% (0.8018)	?0.06% (0.7349)
CLIP [9]	+0.26% (0.7810)	+0.04% (0.8036)	?0.09% (0.7346)
SimCLR [7]	+0.28% (0.7812)	+0.05% (0.8037)	?0.08% (0.7347)
SimSiam [8]	+0.10% (0.7794)	?0.10% (0.8022)	?0.29% (0.7327)
MaskCLIP [44]	+0.31% (0.7815)	+0.03% (0.8035)	?0.03% (0.7352)
COURIER (ours)	+0.46% (0.7830)	+0.16% (0.8048)	+0.19% (0.7374)

Tab.5 The improvements of AUC (

Δ A U C

) in the women’s clothing category. And performances of

Δ A U C, Δ G A U C

in all categories. We report the relative improvements compared to the Baseline method, and the raw values of the metrics are in parentheses

Fig.3 The impact of different values of

τ

on the performance of downstream CTR tasks. The horizontal axis represents the values of

τ

, while the vertical axis denotes the change (%) in the metrics

	$Δ A U C$ (women’s clothing)	$Δ A U C$	$Δ G A U C$
w/o UCS	0.06%	?0.13%	0.11%
w/o Contrast	0.23%	0.03%	?0.11%
w/o Reconstruction	0.25%	0.02%	?0.11%
w/o Neg PV	0.30%	0.07%	?0.06%
COURIER	0.46%	0.16%	0.19%

Tab.6 Ablation studies of COURIER

Tab.7 Influence of different batch on performance

	$Δ A U C$ (women’s clothing)	$Δ A U C$	$Δ G A U C$
w CLIP	0.26%	0.04%	?0.09%
COURIER	0.46%	0.16%	0.19%

Tab.8 Train with text information

Fig.4 The AUC improvements of COURIER compared to the Baseline on different categories. The x-axis is sorted by the improvements

Fig.5 T-SNE visualization of embeddings in different categories. (a) Dress and Jeans; (b) Shirt and Cheongsam; (c) Skirt and Fur

Fig.6 T-SNE visualization of embeddings with different style tags. We also plot some item images with different tags below the corresponding figures. (a) Cool and Sexy; (b) Mature and Cuties; (c) Grace and Antique

	$Δ$ # Order	$Δ$ CTR	$Δ$ GMV
All categories	+0.1%	+0.18%	+0.66%
Women’s clothing	+0.31%	+0.34%	+0.88%

Tab.9 The A/B testing improvements of COURIER

Fig.A1 The downstream CTR model and image representation

Table A1 Performance of inserting image information with Vector, SimScore, and Cluster ID. Since we performed this comparison in the early stage of our development, the exact configurations of each version are hard to describe in detail. And the different versions may not be comparable to each other (different training data sizes, learning rates, training methods, etc.). We only list the version number for clarity. Results within each row are comparable since they are generated from the same version of embeddings. The Baseline does not use images. “?” denotes that we did not evaluate Cluster-ID of these versions

Table A2 Adding projection to COURIER

	$Δ A U C$ (women’s clothing)	$Δ A U C$	$Δ G A U C$
Hinge loss	0.15%	0.02%	0.02%
COURIER	0.46%	0.16%	0.19%

Table A3 Train with hinge loss

1	W, Zhang J, Qin W, Guo R, Tang X He . Deep learning for click-through rate estimation. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 4695−4703
2	G, Zhou N, Mou Y, Fan Q, Pi W, Bian C, Zhou X, Zhu K Gai . Deep interest evolution network for click-through rate prediction. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 5941−5948
3	J Q, Yang D C, Zhan L Gan . Beyond probability partitions: Calibrating neural networks with semantic aware grouping. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2547
4	J Q, Yang D C Zhan . Generalized delayed feedback model with post-click information in recommender systems. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1899
5	Z, Yuan F, Yuan Y, Song Y, Li J, Fu F, Yang Y, Pan Y Ni . Where to go next for recommender systems? ID- vs. modality-based recommender models revisited. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023, 2639−2649
6	Z H Zhou . Learnability with time-sharing computational resource concerns. 2023, arXiv preprint arXiv: 2305.02217
7	T, Chen S, Kornblith M, Norouzi G Hinton . A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 149
8	X, Chen K He . Exploring simple Siamese representation learning. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 15745−15753
9	A, Radford J W, Kim C, Hallacy A, Ramesh G, Goh S, Agarwal G, Sastry A, Askell P, Mishkin J, Clark G, Krueger I Sutskever . Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763
10	J B, Schafer D, Frankowski J, Herlocker S Sen . Collaborative filtering recommender systems. In: Brusilovsky P, Kobsa A, Nejdl W, eds. The Adaptive Web, Methods and Strategies of Web Personalization. Berlin: Springer, 2007, 291−324
11	G, Linden B, Smith J York . Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 2003, 7( 1): 76–80
12	S, Zhang L, Yao A, Sun Y Tay . Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys, 2019, 52( 1): 5
13	P S, Huang X, He J, Gao L, Deng A, Acero L Heck . Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013, 2333−2338
14	J Q, Yang X, Li S, Han T, Zhuang D C, Zhan X, Zeng B Tong . Capturing delayed feedback in conversion rate prediction via elapsed-time sampling. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 4582−4589
15	C, Wu F, Wu T, Qi Y Huang . Empowering news recommendation with pre-trained language models. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021, 1652−1656
16	den Oord A, van S, Dieleman B Schrauwen . Deep content-based music recommendation. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2643−2651
17	L, Wu L, Chen R, Hong Y, Fu X, Xie M Wang . A hierarchical attention model for social contextual image recommendation. IEEE Transactions on Knowledge and Data Engineering, 2020, 32( 10): 1854–1867
18	P, Covington J, Adams E Sargin . Deep neural networks for YouTube recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems. 2016, 191−198
19	R, He J McAuley . Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th International Conference on World Wide Web. 2016, 507−517
20	Y, Wei X, Wang L, Nie X, He R, Hong T S Chua . MMGCN: multi-modal graph convolution network for personalized recommendation of micro-video. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1437−1445
21	Q, Wang Y, Wei J, Yin J, Wu X, Song L Nie . DualGNN: Dual graph neural network for multimedia recommendation. IEEE Transactions on Multimedia, 2023, 25: 1074–1084
22	J, Zhang Y, Zhu Q, Liu S, Wu S, Wang L Wang . Mining latent structures for multimedia recommendation. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 3872−3880
23	Z, Tao X, Liu Y, Xia X, Wang L, Yang X, Huang T S Chua . Self-supervised learning for multimedia recommendation. IEEE Transactions on Multimedia, 2023, 25: 5107–5116
24	P, Yu Z, Tan G, Lu B K Bao . Multi-view graph convolutional network for multimedia recommendation. In: Proceedings of the 31st ACM International Conference on Multimedia. 2023, 6576−6585
25	X, Zhou H, Zhou Y, Liu Z, Zeng C, Miao P, Wang Y, You F Jiang . Bootstrap latent representations for multi-modal recommendation. In: Proceedings of the ACM Web Conference. 2023, 845−854
26	X, Dong X, Zhan Y, Wu Y, Wei M C, Kampffmeyer X, Wei M, Lu Y, Wang X Liang . M5Product: Self-harmonized contrastive learning for e-commercial multi-modal pretraining. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 21220−21230
27	J, Wang F, Yuan M, Cheng J M, Jose C, Yu B, Kong X, He Z, Wang B, Hu Z Li . TransRec: learning transferable recommendation from mixture-of-modality feedback. 2022, arXiv preprint arXiv: 2206.06190
28	J B, Grill F, Strub F, Altché C, Tallec P H, Richemond E, Buchatskaya C, Doersch B A, Pires Z D, Guo M G, Azar B, Piot K, Kavukcuoglu R, Munos M Valko . Bootstrap your own latent A new approach to self-supervised learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1786
29	J, Devlin M W, Chang K, Lee K Toutanova . BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186
30	T, Yao X, Yi D Z, Cheng F, Yu T, Chen A, Menon L, Hong E H, Chi S, Tjoa J, Kang E Ettinger . Self-supervised learning for large-scale item recommendations. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021, 4321−4330
31	R, Qiu Z, Huang H, Yin Z Wang . Contrastive learning for representation degeneration problem in sequential recommendation. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 2022, 813−823
32	Y, Hou S, Mu W X, Zhao Y, Li B, Ding J R Wen . Towards universal sequence representation learning for recommender systems. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022, 585−593
33	J, Wu X, Wang F, Feng X, He L, Chen J, Lian X Xie . Self-supervised graph learning for recommendation. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021, 726−735
34	F, Sun J, Liu J, Wu C, Pei X, Lin W, Ou P Jiang . BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019, 1441−1450
35	Y, Wang X, Wang X, Huang Y, Yu H, Li M, Zhang Z, Guo W Wu . Intent-aware recommendation via disentangled graph contrastive learning. In: Proceedings of the 32nd International Joint Conference on Artificial Intelligence. 2023, 260
36	X, Ren L, Xia J, Zhao D, Yin C Huang . Disentangled contrastive collaborative filtering. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023, 1137−1146
37	A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez Ł, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
38	B, Poole S, Ozair den Oord A, van A A, Alemi G Tucker . On variational bounds of mutual information. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5171−5180
39	B, Mustafa C, Riquelme J, Puigcerver R, Jenatton N Houlsby . Multimodal contrastive learning with LIMoe: the language-image mixture of experts. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 695
40	J, McAuley C, Targett Q, Shi den Hengel A van . Image-based recommendations on styles and substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015, 43−52
41	R, He J McAuley . VBPR: visual Bayesian personalized ranking from implicit feedback. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 144−150
42	X Zhou . MMRec: Simplifying multimodal recommendation. In: Proceedings of the 5th ACM International Conference on Multimedia in Asia Workshops. 2023, 6
43	Z, Sun X, Li X, Sun Y, Meng X, Ao Q, He F, Wu J Li . ChineseBERT: Chinese pretraining enhanced by glyph and pinyin information. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 2065−2075
44	X, Dong J, Bao Y, Zheng T, Zhang D, Chen H, Yang M, Zeng W, Zhang L, Yuan D, Chen F, Wen N Yu . MaskCLIP: Masked self-distillation advances contrastive language-image pretraining. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 10995−11005
45	Z, Liu Y, Lin Y, Cao H, Hu Y, Wei Z, Zhang S, Lin B Guo . Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 9992−10002
46	T, Chen B, Xu C, Zhang C Guestrin . Training deep nets with sublinear memory cost. 2016, arXiv preprint arXiv: 1604.06174
47	P, Micikevicius S, Narang J, Alben G F, Diamos E, Elsen D, García B, Ginsburg M, Houston O, Kuchaiev G, Venkatesh H Wu . Mixed precision training. In: Proceedings of the 6th International Conference on Learning Representations. 2018
48	J, Yi L, Zhang J, Wang R, Jin A K Jain . A single-pass algorithm for efficiently recovering sparse cluster centers of high-dimensional data. In: Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014, II-658−II-666
49	Z Y, Zhang X R, Sheng Y, Zhang B, Jiang S, Han H, Deng B Zheng . Towards understanding the overfitting phenomenon of deep click-through rate models. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022, 2671−2680
50	T, Chen S, Kornblith K, Swersky M, Norouzi G Hinton . Big self-supervised models are strong semi-supervised learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1865
51	K Sohn . Improved deep metric learning with multi-class n-pair loss objective. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1857−1865
52	H, Xuan A, Stylianou X, Liu R Pless . Hard negative examples are hard, but useful. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 126−142

[1]

FCS-23939-OF-JY_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed