Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (1) : 171301    https://doi.org/10.1007/s11704-022-0610-2
RESEARCH ARTICLE
Bidirectional Transformer with absolute-position aware relative position encoding for encoding sentences
Le QI, Yu ZHANG(), Ting LIU
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
 Download: PDF(7544 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Transformers have been widely studied in many natural language processing (NLP) tasks, which can capture the dependency from the whole sentence with a high parallelizability thanks to the multi-head attention and the position-wise feed-forward network. However, the above two components of transformers are position-independent, which causes transformers to be weak in modeling sentence structures. Existing studies commonly utilized positional encoding or mask strategies for capturing the structural information of sentences. In this paper, we aim at strengthening the ability of transformers on modeling the linear structure of sentences from three aspects, containing the absolute position of tokens, the relative distance, and the direction between tokens. We propose a novel bidirectional Transformer with absolute-position aware relative position encoding (BiAR-Transformer) that combines the positional encoding and the mask strategy together. We model the relative distance between tokens along with the absolute position of tokens by a novel absolute-position aware relative position encoding. Meanwhile, we apply a bidirectional mask strategy for modeling the direction between tokens. Experimental results on the natural language inference, paraphrase identification, sentiment classification and machine translation tasks show that BiAR-Transformer achieves superior performance than other strong baselines.

Keywords Transformer      relative position encoding      bidirectional mask strategy      sentence encoder     
Corresponding Author(s): Yu ZHANG   
Just Accepted Date: 23 September 2021   Issue Date: 01 March 2022
 Cite this article:   
Le QI,Yu ZHANG,Ting LIU. Bidirectional Transformer with absolute-position aware relative position encoding for encoding sentences[J]. Front. Comput. Sci., 2023, 17(1): 171301.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-0610-2
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I1/171301
Fig.1  The composition of the sentence linear structure
Fig.2  The architecture of BiAR-Tranformer's encoder, where the shadow part in the mask matrices is the masked region
Fig.3  The architectures of applying the BiAR-Transformer on different tasks
Dataset Label Num & Ratio Train Dev. Test
SNLI 3 (1:1:1) 550k 9,842 9,824
MNLI-m 3 (1:1:1) 392k 9,815 9,796
MNLI-mm 3 (1:1:1) 392k 9,832 9,847
QQP 2 (1:1) 380k 10,000 10,000
SST-5 5 (1:2:2:2:1) 8,000 1,000 2,000
WMT14 En-De ? 450k 3,000 3,003
Tab.1  The features of the experimental datasets
Model Dim θ SNLI SST-5 En-De
Transformer [14] 300 ? 82.2 50.4 ?
Transformer [1] ? ? ? ? 27.3
DiSAN [6] 600 2.4m 85.6 51.7 ?
TreeLSTM [15] 150 316K ? 51.0 ?
Star-Transformer [14] 300 ? 86.0 52.9 ?
PSAN [16] 300 2.0m 86.1 ? ?
Distance-based SAN [17] 1200 4.7m 86.3 ? ?
DRCN [18] ? 5.6m 86.5 ? ?
HBMP [19] 600 22m 86.6 ? ?
Transformer w/rpe [20] ? ? ? ? 26.8
Transformer w/rec. pe [21] ? ? ? ? 28.3
Transformer w/reorder. pe [22] ? ? ? ? 28.2
SRAR [23] 300 3.2m 86.8 52.6 28.2
BiAR-Transformer 600 1.8m 86.9* 53.2* 28.2*
Tab.2  Comparison results with encoders training from scratch
Model SNLI QQP MNLI-m/mm SST-5
BERT-base 85.7 89.6 75.6/75.3 56.1
Transformer?bert 86.4 89.8 76.0/75.7 56.5
BiAR?Transformerbert 87.5 90.1 76.4/76.5 57.1
Tab.3  Experimental results comparing with BERT-base
Model Dev Acc. Test Acc.
BiAR-Transformer 87.4 86.9
 - Bi-Mask 87.1 86.5
 - A-RPE 87.0 86.2
 - BiAR-Att. (Bi-Mask & A-RPE) 86.3 85.4
 - Gate-Conn. 86.7 86.3
 - Att-Pooling 86.3 85.7
 - Att-Pooling & Gate-Conn. 85.7 85.3
 - Att-Pooling & Gate-Conn. & BiAR-Att. 84.7 84.3
Tab.4  Ablation study
Model Dev Acc. Test Acc.
No-directional mask 87.1 86.5
Forward mask 86.9 86.2
Backward mask 87.0 86.4
Bidirectional mask 87.4 86.9
Tab.5  Comparison experiments on direction masks
Model Dev Acc. Test Acc.
NPE 87.0 86.2
APE 87.0 86.3
RPE 87.2 86.6
A-RPE 87.4 86.9
Tab.6  Comparison experiments on positional encodings
Fig.4  Heat map of the attention attributions. Take “To doctors perform surgery on patient.” as an example
  
  
  
1 A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N Gomez, Ł Kaiser, I Polosukhin. Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017, 5998−6008
2 M Guo, Y Zhang, T Liu. Gaussian transformer: a lightweight approach for natural language inference. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 6489−6496
3 A W Yu, D Dohan, M T Luong, R Zhao, K Chen, M Norouzi, Q V Le. QANet: combining local convolution with global self-attention for reading comprehension. In: Proceedings of the 6th International Conference on Learning Representations. 2018
4 Z Dai, Z Yang, Y Yang, J Carbonell, Q V Le, R Salakhutdinov. Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 2978−2988
5 J Devlin, M W Chang, K Lee, K Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186
6 T Shen, J Jiang, T Zhou, S Pan, G Long, C Zhang. DiSAN: directional self-attention network for RNN/CNN-free language understanding. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 5446−5455
7 S R Bowman, G Angeli, C Potts, C D Manning. A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015, 632– 642
8 A Williams, N Nangia, S Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018, 1112−1122
9 Z Wang, W Hamza, R Florian. Bilateral multi-perspective matching for natural language sentences. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 4144−4150
10 J Pennington, R Socher, C D Manning. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532−1543
11 T Mikolov, E Grave, P Bojanowski, C Puhrsch, A Joulin. Advances in pre-training distributed word representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018
12 D Hendrycks, K Gimpel. Gaussian error linear units (GELUs). 2016, arXiv preprint arXiv: 1606.08415
13 I Loshchilov, F Hutter. Decoupled weight decay regularization. In: Proceedings of the 7th International Conference on Learning Representations. 2019
14 Q Guo, X Qiu, P Liu, Y Shao, X Xue, Z Zhang. Star-transformer. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 1315−1325
15 K S Tai, R Socher, C D Manning. Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015, 1556−1566
16 W Wu, H Wang, T Liu, S Ma. Phrase-level self-attention networks for universal sentence encoding. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3729−3738
17 J Im, S Cho. Distance-based self-attention network for natural language inference. 2017, arXiv preprint arXiv: 1712.02047
18 S Kim, I Kang, N Kwak. Semantic sentence matching with densely-connected recurrent and co-attentive information. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 6586−6593
19 A Talman , A Yli-Jyrä , J Tiedemann . Sentence embeddings in NLI with iterative refinement encoders. Natural Language Engineering, 2019, 25( 4): 467– 482
20 P Shaw, J Uszkoreit, A Vaswani. Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018, 464– 468
21 K Chen, R Wang, M Utiyama, E Sumita. Recurrent positional embedding for neural machine translation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 1361−1367
22 K Chen, R Wang, M Utiyama, E Sumita. Neural machine translation with reordering embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 1787−1799
23 Z Zheng , S Huang , R Weng , X Y Dai , J Chen . Improving self-attention networks with sequential relations. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28 : 1707– 1716
24 J Hewitt, C D Manning. A structural probe for finding syntax in word representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4129−4138
25 Y Wang, H Y Lee, Y N Chen. Tree transformer: integrating tree structures into self-attention. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 1061−1070
[1] Zhihui YANG, Juan LIU, Xuekai ZHU, Feng YANG, Qiang ZHANG, Hayat Ali SHAH. FragDPI: a novel drug-protein interaction prediction model based on fragment understanding and unified coding[J]. Front. Comput. Sci., 2023, 17(5): 175903-.
[2] Xu Jiafu, Song Fanming. Quantum programming languages[J]. Front. Comput. Sci., 2008, 2(2): 161-166.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed