Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (3) : 183308    https://doi.org/10.1007/s11704-022-2341-9
RESEARCH ARTICLE
Accelerating BERT inference with GPU-efficient exit prediction
Lei LI1, Chengyu WANG2, Minghui QIU2, Cen CHEN1(), Ming GAO1,3, Aoying ZHOU1
1. Shanghai Engineering Research Center of Big Data Management, School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
2. Alibaba Group, Hangzhou 311121, China
3. KLATASDS-MOE, School of Statistics, East China Normal University, Shanghai 200062, China
 Download: PDF(14533 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

BERT is a representative pre-trained language model that has drawn extensive attention for significant improvements in downstream Natural Language Processing (NLP) tasks. The complex architecture and massive parameters bring BERT competitive performance but also result in slow speed at model inference time. To speed up BERT inference, FastBERT realizes adaptive inference with an acceptable drop in accuracy based on knowledge distillation and the early-exit technique. However, many factors may limit the performance of FastBERT, such as the teacher classifier that is not knowledgeable enough, the batch size shrinkage and the redundant computation of student classifiers. To overcome these limitations, we propose a new BERT inference method with GPU-Efficient Exit Prediction (GEEP). GEEP leverages the shared exit loss to simplify the training process of FastBERT from two steps into only one step and makes the teacher classifier more knowledgeable by feeding diverse Transformer outputs to the teacher classifier. In addition, the exit layer prediction technique is proposed to utilize a GPU hash table to handle the token-level exit layer distribution and to sort test samples by predicted exit layers. In this way, GEEP can avoid batch size shrinkage and redundant computation of student classifiers. Experimental results on twelve public English and Chinese NLP datasets prove the effectiveness of the proposed approach. The source codes of GEEP will be released to the public upon paper acceptance.

Keywords BERT      FastBERT      inference acceleration      model distillation      early exit      text classification     
Corresponding Author(s): Cen CHEN   
Just Accepted Date: 02 December 2022   Issue Date: 13 April 2023
 Cite this article:   
Lei LI,Chengyu WANG,Minghui QIU, et al. Accelerating BERT inference with GPU-efficient exit prediction[J]. Front. Comput. Sci., 2024, 18(3): 183308.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-2341-9
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I3/183308
Fig.1  (a) FastBERT with 12 transformer layers and 11 early exits. A batch of four samples is fed into the network. Samples 1 and 2 (easy samples) exit at Layer 3, While Samples 3 and 4 (hard samples) exit at Layer 12. (b) The shared exit layer in our GEEP method
Fig.2  Exit Layer Distributions for Tokens. (a) Exit layer distribution of token good; (b) exit layer distribution of token very; (c) the exit score of sequence very good
  
Fig.3  The GPU Hash Table, with an Example of Its Processing Steps in GEEP.
Name #Train #Dev. #Test
ChnSentiCorp 9,600 1,200 1,200
Book review 20,000 10,000 10,000
Shopping review 20,000 10,000 10,000
Weibo 99,988 10,000 10,000
THUCNews 50,000 5,000 10,000
LCQMC 238,766 8,802 12,500
Ag.News 120,000 0 7,600
Amz.F 3,000,000 0 650,000
DBpedia 560,000 0 70,000
Yahoo 1,400,000 0 600,00
Yelp.F 650,000 0 50,000
Yelp.P 560,000 0 38,000
Tab.1  Data splits of all the datasets
ChnSentiCorp Book review Shopping review Weibo THUCNews LCQMC
A/% T/sSpeedup A/% T/sSpeedup A/% T/sSpeedup A/% T/sSpeedup A/% T/sSpeedup A/% T/sSpeedup
BERT 94.50 3.661x 87.21 26.621x 96.79 26.521x 97.75 26.581x 96.69 26.561x 86.60 33.101x
FastBERT 92.00 1.502.44x 86.50 20.181.32x 96.25 11.902.23x 97.79 15.511.71x 96.59 11.972.22x 83.90 28.231.17x
90.58 0.993.68x 85.81 14.401.85x 96.08 9.642.75x 97.80 9.282.87x 96.11 6.614.02x 79.70 20.381.62x
88.92 0.705.23x 83.98 7.993.33x 95.96 8.103.27x 97.74 3.357.94x 95.21 3.906.81x 73.63 10.123.27x
GEEP 92.08 1.073.42x 86.52 19.641.36x 96.45 11.122.39x 97.73 15.361.73x 96.55 13.252.00x 86.40 19.291.72x
91.08 0.814.50x 86.28 10.862.45x 96.47 8.992.95x 97.80 8.942.97x 96.26 8.962.96x 85.26 11.182.96x
89.17 0.566.52x 84.09 7.023.79x 96.21 6.833.88x 97.75 4.665.71x 95.19 4.735.61x 80.52 5.855.66x
Ag.news Amz.F DBpedia Yahoo Yelp.F Yelp.P
A/% T/sSpeedup A/% T/sSpeedup A/% T/sSpeedup A/% T/sSpeedup A/% T/sSpeedup A/% T/sSpeedup
BERT 94.54 20.281x 65.53 1717.611x 99.31 184.521x 77.34 158.281x 65.89 131.831x 95.97 100.141x
FastBERT 94.38 16.001.27x 63.34 1470.281.17x 99.29 43.854.21x 76.52 131.981.20x 63.29 114.891.15x 95.69 71.911.39x
93.88 10.191.99x 62.44 1032.321.66x 99.24 31.205.91x 75.97 105.881.49x 62.06 91.731.44x 94.99 50.941.97x
93.28 5.973.40x 61.80 598.322.87x 99.14 24.237.62x 75.51 63.952.48x 60.88 61.812.13x 94.13 35.162.85x
GEEP 94.52 11.691.73x 63.78 1428.941.20x 99.27 63.132.92x 76.99 119.601.32x 65.34 109.921.20x 95.82 67.151.49x
94.41 6.772.99x 63.64 867.251.98x 99.24 47.923.85x 77.06 93.401.69x 65.44 88.391.49x 95.44 42.462.36x
93.72 3.565.70x 62.82 447.343.84x 99.18 32.985.59x 76.69 54.252.92x 64.71 56.072.35x 94.97 26.173.83x
Tab.2  Comparison of accuracy (A), time (T), and speedup (S) between GEEP and the baselines over all the 12 datasets
Fig.4  The Accuracy-time Curve of the GEEP and Baselines in All the Datasets.Curves are made by connecting points obtained from experiments
Name JSD
RANDOM 0.0083
ChnSentiCorp 0.0064
Book review 0.0066
Shopping review 0.0040
Weibo 0.0023
THUCNews 0.0031
LCQMC 0.0073
Ag.News 0.0042
Amz.F 0.0040
DBpedia 0.0008
Yahoo 0.0051
Yelp.F 0.0035
Yelp.P 0.0054
Tab.3  The average of JSD scores for all tokens in all the datasets
Name P/s P/% I/s I/%
ChnSentiCorp 0.22 39.22 0.34 60.44
Book review 1.83 21.01 6.85 78.85
Shopping review 1.86 39.48 2.83 60.17
Weibo 1.82 39.13 2.82 60.57
THUCNews 1.85 39.04 2.87 60.62
LCQMC 2.29 39.18 3.54 60.61
Ag.News 1.38 38.83 2.17 60.89
Amz.F 121.70 39.23 186.34 60.07
DBpedia 12.90 39.13 19.87 60.24
Yahoo 11.12 39.02 17.16 60.23
Yelp.F 9.32 39.23 14.26 60.01
Yelp.P 7.05 39.31 10.76 59.94
Tab.4  The time distribution for the inference algorithm of GEEP
Fig.5  Ablation study of GEEP. Curves are made by connecting points obtained from experiments
Fig.6  Detailed performances of mentioned models in this paper
Model SST-2/% QNLI/% RTE/%
BERT 90.37 89.60 67.10
CascadeBERT 87.84 84.49 62.09
GEEP (cascade) 86.60 85.74 61.73
Tab.5  Accuracy of GEEP (cascade) and CascadeBERT on GLUE tasks
Dataset BERT/% SEL/% LOG10
ChnSentiCorp 94.50 93.67 3.98
Book review 87.21 86.63 4.30
Shopping review 96.79 96.27 4.30
Weibo 97.75 97.73 4.99
THUCNews 96.69 96.46 4.69
LCQMC 86.60 86.51 5.37
Ag.News 94.54 94.53 5.07
Amz.F 65.53 63.84 6.47
DBpedia 99.31 99.28 5.74
Yahoo 77.34 76.91 6.14
Yelp.F 65.89 65.43 5.81
Yelp.P 95.97 95.91 5.74
Tab.6  Accuracy (%) of BERT and SEL (BERT with shared exit loss)
Fig.7  Accuracy drop and Log10(training data size)
  
  
  
  
  
  
1 Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186
2 Radford A, Narasimhan K. Improving language understanding by generative pre-training. See cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf website. 2018
3 Yang Z, Dai Z, Yang Y, Carbonell J G, Salakhutdinov R, Le Q. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 517
4 J, Gou B, Yu S J, Maybank D Tao . Knowledge distillation: a survey. International Journal of Computer Vision, 2021, 129( 6): 1789–1819
5 S, Laskaridis A, Kouris N D Lane . Adaptive inference through early-exit networks: design, challenges and directions. In: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning. 2021, 1−6
6 W, Liu P, Zhou Z, Wang Z, Zhao H, Deng Q Ju . FastBERT: a self-distilling BERT with adaptive inference time. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 6035−6044
7 C, Wang M, Qiu T, Zhang T, Liu L, Li J, Wang M, Wang J, Huang W Lin . EasyNLP: A comprehensive and easy-to-use toolkit for natural language processing. 2022, arXiv preprint arXiv: 2205.00258
8 C, Wang M, Qiu J Huang . Building natural language processing applications with EasyNLP. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022, 5100−5101
9 C, Buciluă R, Caruana A Niculescu-Mizil . Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 535−541
10 G, Hinton O, Vinyals J Dean . Distilling the knowledge in a neural network. 2015, arXiv preprint arXiv: 1503.02531
11 V, Sanh L, Debut J, Chaumond T Wolf . DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2019, arXiv preprint arXiv: 1910.01108
12 Zhang L, Song J, Gao A, Chen J, Bao C, Ma K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3712−3721
13 Berestizshevsky K, Even G. Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: Proceedings of the Artificial Neural Networks and Machine Learning-ICANN 2019: Deep Learning: the 28th International Conference on Artificial Neural Networks. 2019, 306−320
14 A, Gormez E Koyuncu . Class means as an early exit decision mechanism. 2021, arXiv preprint arXiv: 2103.01148v1
15 H, Jiang B, Kim M Y, Guan M Gupta . To trust or not to trust a classifier. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 5546−5557
16 W, Zhou C, Xu T, Ge J J, McAuley K, Xu F Wei . BERT loses patience: fast and robust inference with early exit. In: Proceedings of the Conference on Neural Information Processing Systems. 2020, 18330−18341
17 T, Sun X, Liu W, Zhu Z, Geng L, Wu Y, He Y, Ni G, Xie X, Huang X Qiu . A simple hash-based early exiting approach for language understanding and generation. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2022. 2022, 2409−2421
18 B, Lessley H Childs . Data-parallel hashing techniques for GPU architectures. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 1): 237–250
19 T H, Cormen C E, Leiserson R L, Rivest C Stein . Introduction to Algorithms. 3rd ed. Massachusetts: The MIT Press, 2009
20 Bordawekar R. Evaluation of parallel hashing techniques. In: Proceedings (Findings) of the GPU Technology Conference. See on-demand.gputechconf.com/gtc/2014/presentations/S4507-evaluation-of-parallel-hashing-techniques.pdf website. 2014, 1−27
21 R, Pagh F F Rodler . Cuckoo hashing. Journal of Algorithms, 2004, 51( 2): 122–144
22 A D, Breslow N S Jayasena . Morton filters: faster, space-efficient cuckoo filters via biasing, compression, and decoupled logical sparsity. Proceedings of the VLDB Endowment, 2018, 11( 9): 1041–1055
23 O, Alipourfard M, Moshref Y, Zhou T, Yang M Yu . A comparison of performance and accuracy of measurement algorithms in software. In: Proceedings of the Symposium on SDN Research. 2018, 18
24 A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez Ł, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
25 Voita E, Sennrich R, Titov I. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 4396−4406
26 R, Xiong Y, Yang D, He K, Zheng S, Zheng C, Xing H, Zhang Y, Lan L, Wang T Liu . On layer normalization in the transformer architecture. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 10524−10533
27 T M, Cover J A Thomas . Elements of Information Theory. 2nd ed. Hoboken: John Wiley & Sons, Inc., 2006, 57−58
28 X, Liu Q, Chen C, Deng H, Zeng J, Chen D, Li B Tang . LCQMC: A large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics. 2018, 1952−1962
29 X, Zhang J, Zhao Y LeCun . Character-level convolutional networks for text classification. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 649−657
30 Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q. TinyBERT: distilling BERT for natural language understanding. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, 4163−4174
31 Chen X, He B, Hui K, Sun L, Sun Y. Simplified tinyBERT: Knowledge distillation for document retrieval. In: Proceedings of the 43rd European Conference on Information Retrieval. 2021, 241−248
32 L, Li Y, Lin D, Chen S, Ren P, Li J, Zhou X Sun . CascadeBERT: Accelerating inference of pre-trained language models via calibrated complete models cascade. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2021. 2021, 475−486
33 T, Sun Y, Zhou X, Liu X, Zhang H, Jiang Z, Cao X, Huang X Qiu . Early exiting with ensemble internal classifiers. 2021, arXiv preprint arXiv: 2105.13792
34 W Zhu . LeeBERT: Learned Early Exit for BERT with cross-level optimization. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, 2968−2980
35 X, Ji R, Tang J, Lee Y, Yu J Lin . DeeBERT: dynamic early exiting for accelerating BERT inference. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 2246−2251
36 Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2018, 353−355
[1] FCS-22341-OF-LL_suppl_1 Download
[1] Zhihui YANG, Juan LIU, Xuekai ZHU, Feng YANG, Qiang ZHANG, Hayat Ali SHAH. FragDPI: a novel drug-protein interaction prediction model based on fragment understanding and unified coding[J]. Front. Comput. Sci., 2023, 17(5): 175903-.
[2] Jipeng QIANG, Feng ZHANG, Yun LI, Yunhao YUAN, Yi ZHU, Xindong WU. Unsupervised statistical text simplification using pre-trained language modeling for initialization[J]. Front. Comput. Sci., 2023, 17(1): 171303-.
[3] Zhangjie FU, Yan WANG, Xingming SUN, Xiaosong ZHANG. Semantic and secure search over encrypted outsourcing cloud based on BERT[J]. Front. Comput. Sci., 2022, 16(2): 162802-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed