Y-Tuning: an efficient tuning paradigm for large-scale pre-trained models via label representation learning

doi:10.1007/s11704-023-3131-8

Frontiers of Computer Science

2024, Vol. 18

Issue (4): 184320 https://doi.org/10.1007/s11704-023-3131-8

本期目录

Y -

Tuning: an efficient tuning paradigm for large-scale pre-trained models via label representation learning

Yitao LIU, Chenxin AN, Xipeng QIU(

)

School of Computer Science, Fudan University, Shanghai 200433, China

全文: PDF(2582 KB) HTML

Abstract：

With current success of large-scale pre-trained models (PTMs), how efficiently adapting PTMs to downstream tasks has attracted tremendous attention, especially for PTMs with billions of parameters. Previous work focuses on designing parameter-efficient tuning paradigms but needs to save and compute the gradient of the whole computational graph. In this paper, we propose $Y$ -Tuning, an efficient yet effective paradigm to adapt frozen large-scale PTMs to specific downstream tasks. $Y$ -Tuning learns dense representations for labels $Y$ defined in a given task and aligns them to fixed feature representation. Without computing the gradients of text encoder at training phrase, $Y$ -Tuning is not only parameter-efficient but also training-efficient. Experimental results show that for $D e B E R T a X X L$ with 1.6 billion parameters, $Y$ -Tuning achieves performance more than $96 %$ of full fine-tuning on GLUE Benchmark with only $2 %$ tunable parameters and much fewer training costs.

Key words： pre-trained model lightweight fine-tuning paradigms label representation

收稿日期: 2023-02-18 出版日期: 2023-06-25

Corresponding Author(s): Xipeng QIU

引用本文:

. [J]. Frontiers of Computer Science, 2024, 18(4): 184320.
Yitao LIU, Chenxin AN, Xipeng QIU.

Y -

Tuning: an efficient tuning paradigm for large-scale pre-trained models via label representation learning. Front. Comput. Sci., 2024, 18(4): 184320.

链接本文:

https://academic.hep.com.cn/fcs/CN/10.1007/s11704-023-3131-8
https://academic.hep.com.cn/fcs/CN/Y2024/V18/I4/184320

Fig.1

Tuning type	Input	Output	Function	Tunable modules	Param efficiency	Training efficiency
Fine-tuning	$x$	$p (y \| x)$	$f ∘ ? (x)$	$f, ?$	×	×
Feature-based-tuning	$x$	$p (y \| x)$	$f ∘ ? ∗ (x)$	$f$	√	√
Adapter-tuning	$x$	$p (y \| x)$	$f ∘ ? ∗ + δ (x)$	$f, δ$	√	×
Prompt-tuning	$x$	$p (y \| x)$	$f ∘ ? ∗ ([p; x])$	$f$ , $p$	√	×
$Y$ -Tuning	$x, Y$	$p (c \| x, Y)$	$f (ψ (Y), ? ∗ (x))$	$f, ψ$	√	√

Tab.1

Fig.2

Method	Total params	Tunable params	CoLA (8.5k)	SST-2 (67k)	MPRC (3.7k)	QQP (364k)	$M N L I m$ (393k)	$M N L I m m$ (393k)	QNLI (105k)	RTE (2.5k)	AVG
$B A R T e n$ -FT	205M	205M	59.3	95.8	89.2	89.5	92.2	89.3	94.3	77.6	85.2
$B A R T e n$ -FbT	223M	19M	42.1	93.2	76.0	86.7	81.3	82.4	88.4	60.6	75.6
$B A R T e n$ - $Y$ T	220M	17M	44.4	94.4	79.2	85.5	81.6	83.0	88.2	62.8	76.9
$B A R T e n$ -FT	205M	205M	51.4	95.6	86.4	73.4	89.4	88.6	94.5	73.9	80.6
$B A R T e n$ -FbT	223M	19M	41.7	90.3	74.0	65.3	81.8	81.4	88.0	56.6	71.1
$B A R T e n$ - $Y$ T	220M	17M	40.9	95.6	76.8	64.2	82.5	82.4	88.1	57.4	72.2

Tab.2

Method	Total params	Tunable params	Training speedup	Memory Usage/%	CoLA	SST-2	MPRC	QQP	MNLI	QNLI	RTE	AVG
$R o B E R T a$ -AT^?	355M	3M	0.6x	88.7	67.4	96.3	92.9	88.5	90.4	94.7	83.4	87.7
$R o B E R T a$ -LoRA?	355M	0.8M	1.4x	41.9	68.2	96.2	90.9	91.6	90.6	94.9	87.4	88.5
$R o B E R T a$ -WARP^?	355M	$≤ 1 M$	1.8x	71.6	60.6	96.0	91.2	84.5	88.2	93.5	86.3	85.8
$R o B E R T a$ - $Y T$	372M	17M	3.2x	18.1	54.4	94.5	85.0	87.4	83.1	88.2	81.9	82.1
$D e B E R T a$ - $Y T$	1.6B	31M	1.6x	26.1	65.8	96.2	90.9	87.8	87.8	93.6	89.2	87.4
$R o B E R T a$ -FT^?	355M	355M	1x	100	68.0	96.4	90.9	92.2	90.2	96.4	86.6	88.7
$D e B E R T a$ -FT^§	1.6B	1.6B	?	?	72.0	97.2	93.1	92.7	91.8	96.0	93.5	90.9

Tab.3

Method	Total params	Tunable params	RTE (2.5k)	BoolQ (9.4k)	CB (0.25k)
$R o B E R T a$ -FT^?	355M	355M	86.6	86.9	98.2
$R o B E R T a$ -PT^?	355M	?	58.8	62.3	71.4
$R o B E R T a$ -FbT	368M	14M	78.3	70.9	89.3
$R o B E R T a$ - $Y T$	372M	17M	82.7	75.2	92.3

Tab.4

Fig.3

Fig.4

Method	CoNLL03 NER	CoNLL03 CHUNK	SQuAD 1.0
$B A R T$ -FT	95.6	91.8	92.0
$R o B E R T a$ -PT^?	86.1	?	12.0
$B A R T$ -FbT	70.9	73.6	73.6
$B A R T$ - $Y T$	88.2	85.9	82.7

Tab.5

Method	Total params	Tunable params	SST-2	$M N L I m / m m$
$R o B E R T a$ -FT	355M	355M	96.4	90.4 / 90.1
$R o B E R T a$ -FbT	368M	14M	92.4	77.4 / 78.4
$R o B E R T a$ - $Y T 1$	372M	17M	92.5	76.4 / 77.2
$R o B E R T a$ - $Y T 2$	372M	17M	93.8	80.7 / 81.0
$R o B E R T a$ - $Y T 4$	372M	17M	94.5	82.8 / 83.3

1	J, Devlin M W, Chang K, Lee K Toutanova . BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186
2	T B, Brown B, Mann N, Ryder M, Subbiah J, Kaplan P, Dhariwal A, Neelakantan P, Shyam G, Sastry A, Askell S, Agarwal A, Herbert-Voss G, Krueger T, Henighan R, Child A, Ramesh D M, Ziegler J, Wu C, Winter C, Hesse M, Chen E, Sigler M, Litwin S, Gray B, Chess J, Clark C, Berner S, McCandlish A, Radford I, Sutskever D Amodei . Language models are few-shot learners. In: Proceedings of the 34th Conference on Neural Information Processing Systems. 2020, 1877−1901
3	X, Qiu T, Sun Y, Xu Y, Shao N, Dai X Huang . Pre-trained models for natural language processing: a survey. Science China Technological Sciences, 2020, 63( 10): 1872–1897
4	N, Houlsby A, Giurgiu S, Jastrzebski B, Morrone Laroussilhe Q, De A, Gesmundo M, Attariyan S Gelly . Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2790−2799
5	A C, Stickland I Murray . BERT and PALs: projected attention layers for efficient adaptation in multi-task learning. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5986−5995
6	X L, Li P Liang . Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 4582−4597
7	B, Lester R, Al-Rfou N Constant . The power of scale for parameter-efficient prompt tuning. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 3045−3059
8	X, Liu Y, Zheng Z, Du M, Ding Y, Qian Z, Yang J Tang . GPT understands, too. 2021, arXiv preprint arXiv: 2103.10385
9	Y, Sun S, Wang S, Feng S, Ding C, Pang J, Shang J, Liu X, Chen Y, Zhao Y, Lu W, Liu Z, Wu W, Gong J, Liang Z, Shang P, Sun W, Liu X, Ouyang D, Yu H, Tian H, Wu H Wang . ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. 2021, arXiv preprint arXiv: 2107.02137
10	T, Sun Y, Shao H, Qian X, Huang X Qiu . Black-box tuning for language-model-as-a-service. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 20841−20855
11	J, Pfeiffer A, Rücklé C, Poth A, Kamath I, Vulić S, Ruder K, Cho I Gurevych . AdapterHub: a framework for adapting transformers. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020, 46−54
12	Le Scao T, Rush A. How many data points is a prompt worth? In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 2627−2636
13	T, Schick H Schutze . Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021, 255−269
14	Petroni F, Rocktaschel T, Riedel S, Lewis P, Bakhtin A, Wu Y, Miller A. Language models as knowledge bases? In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 2463−2473
15	Jiang Z, Xu F F, Araki J, Neubig G. How can we know what language models know? Transactions of the Association for Computational Linguistics, 2020, 8: 423−438
16	A, Aghajanyan S, Gupta L Zettlemoyer . Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 7319−7328
17	Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. Lora: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2021
18	He J, Zhou C, Ma X, Berg-Kirkpatrick T, Neubig G. Towards a unified view of parameter-efficient transfer learning. In: Proceedings of the 10th International Conference on Learning Representations. 2022
19	Y L, Sung J, Cho M Bansal . LST: ladder side-tuning for parameter and memory efficient transfer learning. 2022, arXiv preprint arXiv: 2206.06522
20	M, Lewis Y, Liu N, Goyal M, Ghazvininejad A, Mohamed O, Levy V, Stoyanov L Zettlemoyer . BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7871−7880
21	Y, Liu M, Ott N, Goyal J, Du M, Joshi D, Chen O, Levy M, Lewis L, Zettlemoyer V Stoyanov . RoBERTa: a robustly optimized BERT pretraining approach. 2019, arXiv preprint arXiv: 1907.11692
22	P, He X, Liu J, Gao W Chen . DeBERTa: decoding-enhanced Bert with disentangled attention. In: Proceedings of the 9th International Conference on Learning Representations. 2021
23	Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Le Scao T, Gugger S, Drame M, Lhoest Q, Rush A. Transformers: state-of-the-art natural language processing. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020, 38−45
24	Z, Lan M, Chen S, Goodman K, Gimpel P, Sharma R Soricut . Albert: a lite bert for self-supervised learning of language representations. In: Proceedings of the 8th International Conference on Learning Representations. 2020
25	A, Wang Y, Pruksachatkun N, Nangia A, Singh J, Michael F, Hill O, Levy S R Bowman . SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In: Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019, 32
26	K, Hambardzumyan H, Khachatrian J May . WARP: word-level adversarial ReProgramming. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 4921−4933
27	X, Liu K, Ji Y, Fu W, Tam Z, Du Z, Yang J Tang . P-tuning: prompt tuning can be comparable to fine-tuning across scales and tasks. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 61−68
28	J, Pfeiffer A, Kamath A, Rücklé K, Cho I Gurevych . AdapterFusion: non-destructive task composition for transfer learning. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021, 487−503
29	Jin D, Jin Z, Zhou J T, Szolovits P. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 8018−8025
30	L, Li R, Ma Q, Guo X, Xue X Qiu . BERT-ATTACK: adversarial attack against BERT using BERT. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 6193−6202
31	S, Ren Y, Deng K, He W Che . Generating natural language adversarial examples through probability weighted word saliency. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 1085−1097
32	J, Gao J, Lanchantin M L, Soffa Y Qi . Black-box generation of adversarial text sequences to evade deep learning classifiers. In: Proceedings of 2018 IEEE Security and Privacy Workshops. 2018, 50−56
33	G, Zeng F, Qi Q, Zhou T, Zhang B, Hou Y, Zang Z, Liu M Sun . OpenAttack: an open-source textual adversarial attack toolkit. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. 2021, 363−371
34	X, Li X, Sun Y, Meng J, Liang F, Wu J Li . Dice loss for data-imbalanced NLP tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 465−476
35	Yeh C K, Wu W C, Ko W J, Wang Y C F. Learning deep latent space for multi-label classification. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2838−2844
36	X, Sun B, Wei X, Ren S Ma . Label embedding network: learning label representation for soft training of deep networks. 2017, arXiv preprint arXiv: 1710.10393
37	Wang H, Chen C, Liu W, Chen K, Hu T, Chen G. Incorporating label embedding and feature augmentation for multi-dimensional classification. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 6178−6185
38	G, Wang C, Li W, Wang Y, Zhang D, Shen X, Zhang R, Henao L Carin . Joint embedding of words and labels for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2321−2331
39	H, Zhang L, Xiao W, Chen Y, Wang Y Jin . Multi-task label embedding for text classification. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 4545−4553
40	C, Du Z, Chen F, Feng L, Zhu T, Gan L Nie . Explicit interaction model towards text classification. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 6359−6366
41	Sun C, Huang L, Qiu X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 380−385
42	D, Chai W, Wu Q, Han F, Wu J Li . Description based text classification with reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning, 2020, 1371−1382
43	S, Wang H, Fang M, Khabsa H, Mao H Ma . Entailment as few-shot learner. 2021, arXiv preprint arXiv: 2104.14690

[1]

FCS-23131-OF-YL_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed