|
|
Named entity recognition for Chinese construction documents based on conditional random field |
Qiqi ZHANG1, Cong XUE1, Xing SU1(), Peng ZHOU2, Xiangyu WANG3, Jiansong ZHANG4 |
1. College of Civil Engineering and Architecture, Zhejiang University, Hangzhou 310058, China 2. School of Management Science and Engineering, Central University of Finance and Economics, Beijing 100081, China 3. School of Design and the Built Environment, Curtin University, Perth, Western Australia 6845, Australia 4. School of Construction Management Technology, Purdue University, West Lafayette, IN 47907, USA |
|
|
Abstract Named entity recognition (NER) is essential in many natural language processing (NLP) tasks such as information extraction and document classification. A construction document usually contains critical named entities, and an effective NER method can provide a solid foundation for downstream applications to improve construction management efficiency. This study presents a NER method for Chinese construction documents based on conditional random field (CRF), including a corpus design pipeline and a CRF model. The corpus design pipeline identifies typical NER tasks in construction management, enables word-based tokenization, and controls the annotation consistency with a newly designed annotating specification. The CRF model engineers nine transformation features and seven classes of state features, covering the impacts of word position, part-of-speech (POS), and word/character states within the context. The F1-measure on a labeled construction data set is 87.9%. Furthermore, as more domain knowledge features are infused, the marginal performance improvement of including POS information will decrease, leading to a promising research direction of POS customization to improve NLP performance with limited data.
|
Keywords
NER
NLP
Chinese language
construction document
|
Corresponding Author(s):
Xing SU
|
Just Accepted Date: 16 November 2021
Online First Date: 07 January 2022
Issue Date: 29 May 2023
|
|
1 |
M Al Qady, A Kandil (2010). Concept relation extraction from construction documents using natural language processing. Journal of Construction Engineering and Management, 136(3): 294–302
https://doi.org/10.1061/(ASCE)CO.1943-7862.0000131
|
2 |
M Al Qady, A Kandil (2013). Document discourse for managing construction project documents. Journal of Computing in Civil Engineering, 27(5): 466–475
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000201
|
3 |
M Al Qady, A Kandil (2015). Automatic classification of project documents on the basis of text content. Journal of Computing in Civil Engineering, 29(3): 04014043
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000338
|
4 |
C H Caldas, L Soibelman (2002). Implementing automated methods for document classification in construction management information systems. In: Proceedings of the International Workshop on Information Technology in Civil Engineering. Washington, D.C.: ASCE, 194–210
|
5 |
W Che, Z Li, T Liu (2010). LTP: A Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Beijing: Association for Computational Linguistics, 13–16
|
6 |
H Chen, X Luo (2019). An automatic literature knowledge graph and reasoning network modeling framework based on ontology and natural language processing. Advanced Engineering Informatics, 42: 100959
https://doi.org/10.1016/j.aei.2019.100959
|
7 |
Z Dai, X Wang, P Ni, Y Li, G Li, X Bai (2019). Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records. In: Proceedings of 12th International Congress on Image and Signal Processing, Biomedical Engineering and Informatics (CISP-BMEI). Suzhou: IEEE, 1–5
|
8 |
J Devlin, M W Chang, K Lee, K Toutanova (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint,
|
9 |
H Fan, F Xue, H Li (2015). Project-based as-needed information retrieval from unstructured AEC documents. Journal of Management Engineering, 31(1): A4014012
https://doi.org/10.1061/(ASCE)ME.1943-5479.0000341
|
10 |
K Frantzi, S Ananiadou, H Mima (2000). Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal on Digital Libraries, 3(2): 115–130
https://doi.org/10.1007/s007999900023
|
11 |
V Gangadharan, D Gupta (2020). Recognizing named entities in agriculture documents using LDA based topic modelling techniques. Procedia Computer Science, 171: 1337–1345
https://doi.org/10.1016/j.procs.2020.04.143
|
12 |
A Goyal, V Gupta, M Kumar (2018). Recent named entity recognition and classification techniques: A systematic review. Computer Science Review, 29: 21–43
https://doi.org/10.1016/j.cosrev.2018.06.001
|
13 |
G J Hahm, J H Lee, H W Suh (2015). Semantic relation based personalized ranking approach for engineering document retrieval. Advanced Engineering Informatics, 29(3): 366–379
https://doi.org/10.1016/j.aei.2015.01.003
|
14 |
Z Huang, W Xu, K Yu (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint,
|
15 |
I Jauregi Unanue, E Zare Borzeshi, M Piccardi (2017). Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of Biomedical Informatics, 76: 102–109
https://doi.org/10.1016/j.jbi.2017.11.007
pmid: 29146561
|
16 |
K M Kwayu, V Kwigizile, J Zhang, J S Oh (2020). Semantic n-gram feature analysis and machine learning-based classification of drivers’ hazardous actions at signal-controlled intersections. Journal of Computing in Civil Engineering, 34(4): 04020015
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000895
|
17 |
J Lafferty, A McCallum, F C N Pereira (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning. San Francisco, CA: ACM, 282–289
|
18 |
T Le, H D Jeong, S B Gilbert, E Chukharev-Hudilainen (2018). Parsing natural language queries for extracting data from large-scale geospatial transportation asset repositories. In: Proceedings of Construction Research Congress. New Orleans, LA: ASCE, 70–79
|
19 |
J Lee, J S Yi, J Son (2019). Development of automatic-extraction model of poisonous clauses in international construction contracts using rule-based NLP. Journal of Computing in Civil Engineering, 33(3): 04019003
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000807
|
20 |
J Leskovec (2013). Web data: Amazon reviews. Available at:
|
21 |
S Li, H Cai, V R Kamat (2016). Integrating natural language processing and spatial reasoning for utility compliance checking. Journal of Construction Engineering and Management, 142(12): 04016074
https://doi.org/10.1061/(ASCE)CO.1943-7862.0001199
|
22 |
Z Li, M Sun (2009). Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics, 35(4): 505–512
https://doi.org/10.1162/coli.2009.35.4.35403
|
23 |
K Liu, N El-Gohary (2017). Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in Construction, 81: 313–327
https://doi.org/10.1016/j.autcon.2017.02.003
|
24 |
X Liu, M Zhou (2013). Two-stage NER for tweets with clustering. Information Processing & Management, 49(1): 264–273
https://doi.org/10.1016/j.ipm.2012.05.006
|
25 |
L Luo, Z Yang, P Yang, Y Zhang, L Wang, H Lin, J Wang (2018). An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34(8): 1381–1388
https://doi.org/10.1093/bioinformatics/btx761
pmid: 29186323
|
26 |
X Lv, N M El-Gohary (2016a). Semantic annotation for supporting context-aware information retrieval in the transportation project environmental review domain. Journal of Computing in Civil Engineering, 30(6): 04016033
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000565
|
27 |
X Lv, N M El-Gohary (2016b). Enhanced context-based document relevance assessment and ranking for improved information retrieval to support environmental decision making. Advanced Engineering Informatics, 30(4): 737–750
https://doi.org/10.1016/j.aei.2016.08.004
|
28 |
M Majumder, U Barman, R Prasad, K Saurabh, S K Saha (2012). A novel technique for name identification from homeopathy diagnosis discussion forum. Procedia Technology, 6: 379–386
https://doi.org/10.1016/j.protcy.2012.10.045
|
29 |
C D Manning, H Schutze (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press
|
30 |
T Poibeau, L Kosseim (2001). Proper name extraction from non-journalistic texts. In: Proceedings of 11th Computational Linguistics in the Netherlands. Tilburg: Brill, 144–157
|
31 |
J Pustejovsky, A Stubbs (2012). Natural Language Annotation for Machine Learning: A Guide to Corpus-building for Applications. Sebastopol, CA: O’Reilly Media
|
32 |
A P Quimbaya, A S Múnera, R A G Rivera, J C D Rodríguez, O M M Velandia, A A G Peña, C Labbé (2016). Named entity recognition over electronic health records through a combined dictionary-based approach. Procedia Computer Science, 100: 55–61
https://doi.org/10.1016/j.procs.2016.09.123
|
33 |
S K Saha, P Mitra, S Sarkar (2012). A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition. Knowledge-Based Systems, 27: 322–332
https://doi.org/10.1016/j.knosys.2011.09.015
|
34 |
A Singhal (2001). Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin, 24(4): 35–43
|
35 |
J Sun, K Lei, L Cao, B Zhong, Y Wei, J Li, Z Yang (2020). Text visualization for construction document information management. Automation in Construction, 111: 103048
https://doi.org/10.1016/j.autcon.2019.103048
|
36 |
A J P Tixier, M R Hallowell, B Rajagopalan, D Bowman (2016). Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Automation in Construction, 62: 45–56
https://doi.org/10.1016/j.autcon.2015.11.001
|
37 |
X Xu, H Cai (2020). Semantic approach to compliance checking of underground utilities. Automation in Construction, 109: 103006
https://doi.org/10.1016/j.autcon.2019.103006
|
38 |
S Yu, H Duan, Y Wu (2018). Corpus of multi-level processing for modern Chinese. Available at: (in Chinese)
|
39 |
F Zhang, H Fleyeh, X Wang, M Lu (2019). Construction site accident analysis using text mining and natural language processing techniques. Automation in Construction, 99: 238–248
https://doi.org/10.1016/j.autcon.2018.12.016
|
40 |
J Zhang, N M El-Gohary (2015). Automated information transformation for automated regulatory compliance checking in construction. Journal of Computing in Civil Engineering, 29(4): B4015001
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000427
|
41 |
J Zhang, N M El-Gohary (2016). Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking. Journal of Computing in Civil Engineering, 30(2): 04015014
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000346
|
42 |
P Zhou, N El-Gohary (2016). Ontology-based multilabel text classification of construction regulatory documents. Journal of Computing in Civil Engineering, 30(4): 04015058
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000530
|
43 |
P Zhou, N El-Gohary (2017). Ontology-based automated information extraction from building energy conservation codes. Automation in Construction, 74: 103–117
https://doi.org/10.1016/j.autcon.2016.09.004
|
44 |
Y Zou, A Kiviniemi, S W Jones (2017). Retrieving similar cases for construction project risk management using Natural Language Processing techniques. Automation in Construction, 80: 66–76
https://doi.org/10.1016/j.autcon.2017.04.003
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|