Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (1) : 171303    https://doi.org/10.1007/s11704-022-1244-0
RESEARCH ARTICLE
Unsupervised statistical text simplification using pre-trained language modeling for initialization
Jipeng QIANG1(), Feng ZHANG1, Yun LI1(), Yunhao YUAN1, Yi ZHU1, Xindong WU2,3
1. Department of Computer Science, Yangzhou University, Yangzhou 225127, China
2. Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education, Hefei 23009, China
3. Mininglamp Academy of Sciences, Mininglamp, Beijing 100089, China
 Download: PDF(20653 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Unsupervised text simplification has attracted much attention due to the scarcity of high-quality parallel text simplification corpora. Recent an unsupervised statistical text simplification based on phrase-based machine translation system (UnsupPBMT) achieved good performance, which initializes the phrase tables using the similar words obtained by word embedding modeling. Since word embedding modeling only considers the relevance between words, the phrase table in UnsupPBMT contains a lot of dissimilar words. In this paper, we propose an unsupervised statistical text simplification using pre-trained language modeling BERT for initialization. Specifically, we use BERT as a general linguistic knowledge base for predicting similar words. Experimental results show that our method outperforms the state-of-the-art unsupervised text simplification methods on three benchmarks, even outperforms some supervised baselines.

Keywords text simplification      pre-trained language modeling      BERT      word embeddings     
Corresponding Author(s): Jipeng QIANG,Yun LI   
Just Accepted Date: 09 September 2021   Issue Date: 01 March 2022
 Cite this article:   
Jipeng QIANG,Feng ZHANG,Yun LI, et al. Unsupervised statistical text simplification using pre-trained language modeling for initialization[J]. Front. Comput. Sci., 2023, 17(1): 171303.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-1244-0
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I1/171303
Word Method High-similar words
Enormous EM Immense, huge, massive, gigantic, colossal, considerable, sizable, myriad, mammoth, undoubtedly, staggering, stupendous, significant, towering, seemingly
BE Big, large, immense, huge, giant, vast, ample, enlarged, oversized, gigantic, massive, excessive, over-whelming, tremendous, ferocious,
Merciless EM Ruthless, relentless, vicious, brutal, inhuman, heartless, unyielding, callous, ruthlessly, mercilessly, cold-blooded, murderous, savage, punishing, torturous
BE Ruthless, cruel, vicious, mean, cold, severe, relentless, lethal, murderous, fierce, horrible, harsh, violent,fatal, savage
Tab.1  The high-similar words of “enormous” and “merciless” are obtained by word embedding modeling (abbr. EM) and BE (abbr. BERT), respectively. For each method, we generate the top 15 high-similar words. Non-similar words and words with inconsistent part-of-speech are shown in bold
Fig.1  Architecture of UnsupPBMT method
Fig.2  
Fig.3  The overview of the process of obtaining similar words using BERT
Method WikiLarge WikiSmall NewSela
SARI FKGL SARI FKGL SARI FKGL
Baselines
Complex 28.70 8.11 4.34 12.40 2.74 8.65
Reference 49.89 8.26 63.62 8.96 70.25 3.48
Supervised methods
PBMT-R (2012) 38.56 8.30 15.97 11.52 15.77 7.95
Hybrid (2014) 31.40 4.70 30.46 9.55 30.00 4.15
EncDecA (2017) 35.66 8.67 13.61 11.41 24.12 5.49
Dress (2017) 37.08 6.79 27.48 7.62 27.37 4.19
Dress-LS (2017) 37.27 6.62 27.24 7.55 26.63 4.21
EntPar (2018) 37.45 7.41 28.24 6.93 32.98 1.38
EditNTS (2019) 38.22 7.30 32.35 5.47 31.41 3.40
ACCESS (2020) 41.87 7.22 ? ? ? ?
Unsupervised methods
UNMT (2019) 33.72 8.23 ? ? ? ?
UNTS (2019) 35.29 7.84 ? ? ? ?
UnsupPBMT (2021) 39.08 8.26 25.12 10.66 23.75 7.36
UnsupPBMT-BERT 40.10 7.52 29.08 7.04 27.36 5.55
Tab.2  Performance of baselines and our method on three corpora. The higher, the better. the lower, the better. ? indicates the results that is not found in the original paper
Method The number of generated words
1?10 1?20 1?30 1?40 1?50
BERT 0.52 0.39 0.33 0.28 0.25
Word2Vec 0.32 0.24 0.20 0.17 0.15
Tab.3  The proportion with which the generated words are synonymous by manual evaluation
Word Method The produced similar words
Enchanting BERT enchanting, fascinating, beautiful, magical, delightful, charming, exquisite, haunting, attractive, engaging, exotic, elegant, attracting, romantic, magnificent, enjoyable, wonderful, appealing, exceptional, accompanying, splendid.
Word2Vec enchanting, captivating, charming, delightful, mesmerizing, breathtaking, bewitching, alluring, beguiling, unforgettable, breath-taking, idyllic, wondrous, marvelous, charm, verdant, exquisitely, lovely, quaint, scenery
Speckled BERT speckled, spotted, freckle, fleck, scattered, streaked, splashed, dotted, stained, sprayed, spot, smeared, splotch, burl, strewn, gleaming, spotting, banded, conspicuous, dispersed.
Word2Vec speckled, mottled, flecks, specks, brownish,reddish, greenish,streaked,silvery, pinkish, dusky, striped, iridescent, whitish,bluish,yellowish,multicolored, gray,brown,blue-green.
Gorgeous BERT gorgeous, beautiful, lovely, wonderful, excellent, stunning, handsome, attractive, sexy, graceful, pretty, flawless, magnificent, fine, exquisite, elegant,great, sophisticated, amazing, spectacular.
Word2Vec gorgeous, beautiful, lovely, stunning, fabulous, amazing, cute, charming, breathtaking, stunningly, splendid, chic, lush, alluring, marvelous, glamorous, awesome, classy, sweet, incredible.
Excused BERT excused, dismissed, omitted, exempt, forgiven, allowed, removed, spared, withdrew, forbidden, admitted, explained, extended, introduced, isolated, interrupted, expelled, exhausted, pardoned, authorized.
Word2Vec excused, dismissed, forgiven, removed, admitted, explained, absences, obliged, absent, excuse, excepted, authorized, isolated, expelled, politely, excuses, regretted, objection, objected, punished.
Ferocious BERT ferocious, fierce, ruthless, deadly, furious, formidable, vicious, brutal, savage, aggressive, grim, murderous, lethal, severe, massive, monstrous, territorial, enormous, impressive, violent.
Word2Vec ferocious, fierce, fearsome, vicious, ferocity, merciless, relentless, snarling, bloodthirsty, monstrous, menacing, furious, ravenous, unstoppable, terrifying, fearless, fury, onslaught, unleashed, vengeful.
Tab.4  The similar words of different words are obtained by BERT and Word2Vec, respectively. Non-similar words and words with inconsistent part-of-speech are shown in bold
Fig.4  Evaluation results of different number of sentences on three test sets. (a) WikiLarge; (b) WikiLarge; (c) WikiSmall; (d) WikiSmall; (e) NewSela; (f) NewSela
Fig.5  Evaluation results of different number of SimilarWords on three test sets. (a) WikiLarge; (b) WikiLarge; (c) WikiSmall; (d) WikiSmall; (e) NewSela; (f) NewSela
SARI FKGL
UnsupPBMT-BERT 40.10 7.52
w/o BERT 39.33 7.33
w/o Similarity 39.18 5.57
w/o Frequency 39.37 7.52
Tab.5  Ablation study results of the ranking features. “w/o” denotes “without”
  
  
  
  
  
  
1 L Martin, la Clergerie É de, B Sagot, A Bordes. Controllable sentence simplification. In: Proceedings of the 12th Conference on Language Resources and Evaluation. 2020, 4689−4698
2 S Nisioi, S Štajner, S P Ponzetto, L P Dinu. Exploring neural text simplification models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017, 85–91
3 S Wubben, A van den Bosch, E Krahmer. Sentence simplification by monolingual machine translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers. 2012, 1015−1024
4 W Xu , C Napoles , E Pavlick , Q Chen , C Callison-Burch . Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 2016, 4 : 401– 415
5 X Zhang, M Lapata. Sentence simplification with deep reinforcement learning. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 584–594
6 Z Zhu, D Bernhard, I Gurevych. A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd International Conference on Computational Linguistics. 2010, 1353−1361
7 W Xu , C Callison-Burch , C Napoles . Problems in current text simplification research: new data can help. Transactions of the Association for Computational Linguistics, 2015, 3 : 283– 297
8 S Surya, A Mishra, A Laha, P Jain, K Sankaranarayanan. Unsupervised neural text simplification. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 2058−2068
9 D Kumar, L Mou, L Golab, O Vechtomova. Iterative edit-based unsupervised sentence simplification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7918−7928
10 J Qiang , X Wu . Unsupervised statistical text simplification. IEEE Transactions on Knowledge and Data Engineering, 2021, 33( 4): 1802– 1806
11 Y Meng, Y Zhang, J Huang, C Xiong, H Ji, C Zhang, J Han. Text classification using label names only: a language model self-training approach. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 9006−9017
12 F Petroni, T Rocktäschel, P Lewis, A Bakhtin, Y Wu, A H Miller, S Riedel. Language models as knowledge bases?. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 2463−2473
13 A Roberts, C Raffel, N Shazeer. How much knowledge can you pack into the parameters of a language model?. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 5418−5426
14 Zhang H, Khashabi D, Song Y, Roth D. TransOMCS: from linguistic graphs to commonsense knowledge. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2020, 4004−4010
15 P Koehn, H Hoang, A Birch, C Callison-Burch, M Federico, N Bertoldi, B Cowan, W Shen, C Moran, R Zens, C Dyer, O Bojar, A Constantin, E Herbst. Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. 2007, 177−180
16 M Artetxe, G Labaka, E Agirre, K Cho. Unsupervised neural machine translation. In: Proceedings of the 6th International Conference on Learning Representations. 2018
17 J Pennington, R Socher, C Manning. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532−1543
18 J N Farr , J J Jenkins , D G Paterson . Simplification of flesch reading ease formula. Journal of Applied Psychology, 1951, 35( 5): 333– 337
19 Heafield K. KenLM: faster and smaller language model queries. In: Proceedings of the 6th Workshop on Statistical Machine Translation. 2011, 187−197
20 G Lample, M Ott, A Conneau, L Denoyer, M Ranzato. Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 5039−5049
21 D Li, Y Zhang, H Peng, L Chen, C Brockett, M T Sun, B Dolan. Contextualized perturbation for textual adversarial attack. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 5053−5069
22 G Glavaš, S Štajner. Simplifying lexical simplification: do we need simplified corpora?. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015, 63−68
23 M Brysbaert , B New . Moving beyond Kučera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 2009, 41( 4): 977– 990
24 J Qiang , Y Li , Y Zhu , Y Yuan , X Wu . Lexical simplification with pretrained encoders. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34( 5): 8649– 8656
25 J Qiang , X Lv , Y Li , Y Yuan , X Wu . Chinese lexical simplification. IEEE/ACV Transactions on Audio, Speech, and Language Processing, 2021, 29 : 1819– 1828
26 S Zhao, R Meng, D He, S Andi, P Bambang. Integrating transformer and paraphrase rules for sentence simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3164−3173
27 S Narayan, C Gardent. Hybrid simplification using deep semantics and machine translation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014, 435−445
28 H Guo, R Pasunuru, M Bansal. Dynamic multi-level multi-task learning for sentence simplification. In: Proceedings of the 27th International Conference on Computational Linguistics. 2018, 462−476
29 Y Dong, Z Li, M Rezagholizadeh, J C K Cheung. EditNTS: an neural programmer-interpreter model for sentence simplification through explicit editing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 3393−3402
30 Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9
31 Z Yang, Z Dai, Y Yang, J Carbonell, R Salakhutdinov, Q V Le. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). 2019, 5754−5764
32 J Devlin, M W Chang, K Lee, K Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 2019, 4171−4186
33 Z Lan, M Chen, S Goodman, K Gimpel, P Sharma, R Soricut. ALBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of the 8th International Conference on Learning Representations. 2020
34 M Lewis, Y Liu, N Goyal, M Ghazvininejad, A Mohamed, O Levy, V Stoyanov, L Zettlemoyer. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2019, 7871−7880
35 C Scarton, L Specia. Learning simplifications for specific target audiences. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 712−718
36 S Narayan, C Gardent. Unsupervised sentence simplification using deep semantics. In: Proceedings of the 9th International Natural Language Generation Conference. 2015, 111−120
37 L Martin, A Fan, la Clergerie É de, A Bordes, B Sagot. MUSS: multilingual unsupervised sentence simplification by mining paraphrases. 2021, arXiv preprint arXiv: 2005.00352
38 M Artetxe, G Labaka, E Agirre. Unsupervised statistical machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3632−3642
39 G Wenzek, M A Lachaux, A Conneau, V Chaudhary, F Guzmán, A Joulin, E Grave. CCNET: extracting high quality monolingual datasets from web crawl data. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020, 4003−4012
40 E Pavlick, C Callison-Burch. Simple PPDB: a paraphrase database for simplification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 143−148
[1] Zhihui YANG, Juan LIU, Xuekai ZHU, Feng YANG, Qiang ZHANG, Hayat Ali SHAH. FragDPI: a novel drug-protein interaction prediction model based on fragment understanding and unified coding[J]. Front. Comput. Sci., 2023, 17(5): 175903-.
[2] Zhangjie FU, Yan WANG, Xingming SUN, Xiaosong ZHANG. Semantic and secure search over encrypted outsourcing cloud based on BERT[J]. Front. Comput. Sci., 2022, 16(2): 162802-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed