Please wait a minute...
Frontiers of Engineering Management

ISSN 2095-7513

ISSN 2096-0255(Online)

CN 10-1205/N

Postal Subscription Code 80-905

Front. Eng    2023, Vol. 10 Issue (2) : 237-249    https://doi.org/10.1007/s42524-021-0179-8
RESEARCH ARTICLE
Named entity recognition for Chinese construction documents based on conditional random field
Qiqi ZHANG1, Cong XUE1, Xing SU1(), Peng ZHOU2, Xiangyu WANG3, Jiansong ZHANG4
1. College of Civil Engineering and Architecture, Zhejiang University, Hangzhou 310058, China
2. School of Management Science and Engineering, Central University of Finance and Economics, Beijing 100081, China
3. School of Design and the Built Environment, Curtin University, Perth, Western Australia 6845, Australia
4. School of Construction Management Technology, Purdue University, West Lafayette, IN 47907, USA
 Download: PDF(829 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Named entity recognition (NER) is essential in many natural language processing (NLP) tasks such as information extraction and document classification. A construction document usually contains critical named entities, and an effective NER method can provide a solid foundation for downstream applications to improve construction management efficiency. This study presents a NER method for Chinese construction documents based on conditional random field (CRF), including a corpus design pipeline and a CRF model. The corpus design pipeline identifies typical NER tasks in construction management, enables word-based tokenization, and controls the annotation consistency with a newly designed annotating specification. The CRF model engineers nine transformation features and seven classes of state features, covering the impacts of word position, part-of-speech (POS), and word/character states within the context. The F1-measure on a labeled construction data set is 87.9%. Furthermore, as more domain knowledge features are infused, the marginal performance improvement of including POS information will decrease, leading to a promising research direction of POS customization to improve NLP performance with limited data.

Keywords NER      NLP      Chinese language      construction document     
Corresponding Author(s): Xing SU   
Just Accepted Date: 16 November 2021   Online First Date: 07 January 2022    Issue Date: 29 May 2023
 Cite this article:   
Qiqi ZHANG,Cong XUE,Xing SU, et al. Named entity recognition for Chinese construction documents based on conditional random field[J]. Front. Eng, 2023, 10(2): 237-249.
 URL:  
https://academic.hep.com.cn/fem/EN/10.1007/s42524-021-0179-8
https://academic.hep.com.cn/fem/EN/Y2023/V10/I2/237
Word Variation/Abbreviation
外悬挑梁 (cantilever beam) 悬挑外梁
加气混凝土砌块 (aerated concrete block) 混凝土加气块, 加气块
水泥粉喷搅拌桩 (cement powder spray pile) 粉喷搅拌桩, 粉喷桩
Tab.1  Examples of word variation and/or abbreviation
Fig.1  Corpus design pipeline.
Task Target NEs Construction document
Identification of responsibility and legal issues Organization
Party
Law/Regulation
Contract
Bidding document correspondence
Other formal communication
Cost analysis Material
Equipment
Building parts
Progress report
Project quota
Progress analysis Material
Equipment
Building parts
Construction plan
Progress report
Daily log
Meeting minutes
Quality analysis Material
Building parts
Construction techniques
Construction plan
Quality report
Daily log
Meeting minutes
Safety analysis Equipment
Building parts
Date
Body parts
Injury types
Construction plan
Safety report
Daily log
Meeting minutes
Tab.2  NER tasks, target NEs, and associated construction documents
// Prepare initial tag sequences before fusion:
[1] For each sentence s= [c 1, c2, ..., cj, ..., cN], where cj represents the jth character in s, record the results from the three segmentation tools as R1, R2, and R3
[2] Mark the first character of each word by tag “F” and the rest by tag “E” in R1, R2, and R3
[3] Let Ti= [t i1, ti2, ..., tij, ..., tiN] represents the tag sequence of Ri, where tij represents the jth tag of the jth character in Ri
// Form the final tag sequence by fusion:
[4] For each character cj in s, a list Tj= [t 1j, t2j, t3j] exists
[5] Count the number of tag “F” and tag “E” in Tj as NFj and NEj, respectively
[6] Assign “F” or “E” as the final tag of character cj on the basis of max(NFj, NEj ) and form a final tag sequence TFinal
// Transform the final tag sequence into segmented tokens as the fusion result:
[7] Scan from left to right; mark an F before an F as an individual token
[8] Mark an F before an E as the beginning of a token; mark an E before an F as the end of a token; search back for the nearest beginning of a token; and combine the beginning–end pair together with every character in between as a token
Tab.3  Procedures of the ensemble method
Fig.2  Illustration of the ensemble method.
Model LTP Jieba THULAC Ensemble
Accuracy 0.905 0.914 0.918 0.963
Tab.4  Accuracy comparison
Tag location Beginning Inside Outside
Tag representation B I O
Tab.5  Tag representation
Types Example (in Chinese) Example (in English)
Location 顶层, 一区 top floor, zone one
Building components 梁, 墙, 10#塔吊 beam, wall, 10# tower crane
Building material 混凝土, 砖, 钢 concrete, brick, steel
Tab.6  Nested NE element types
Annotator A B C Average Kappa
A 92.8% 93.9% 93.4%
B 92.8% 92.3% 92.6%
C 93.9% 92.3% 93.1%
Average 93.4% 92.6% 93.1% 93.0%
Tab.7  Annotation consistency matrix
Word Tag Word type Character type
O
施工 O
单位 O
尽快 O
上报 O Left-hand indicator
人工 B Modifier “工” is a single modifier suffix and “人工” is a double modifier suffix
挖孔 I Modifier “孔” is a single modifier suffix and “挖孔” is a double modifier suffix
I Kernel “桩” is a kernel suffix
O Left-hand indicator and right-hand indicator
土钉 B Modifier “钉” is a single modifier suffix and “土钉” is a double modifier suffix
I Modifier “墙” is a modifier suffix
锚杆 I Kernel “杆” is a single kernel suffix and “锚杆” is a double kernel suffix
O Right-hand indicator
变更 O
费用 O
O
Tab.8  Types of words/characters in a sentence
Threshold name Value
TH(kernel)
TH(modifier )
3
TH(left_indicator)
TH(right_indicator)
1
TH(s_kernel_suffix)
TH(s_modifier_suffix)
TH(d_kernel_suffix)
TH(d_modifier_suffix)
10
Tab.9  Selected thresholds of statistical feature
Tags Precision Recall F1-measure
B 0.835 0.790 0.812
I 0.892 0.816 0.853
O 0.954 0.990 0.972
Average 0.879
Tab.10  Performance of the CRF model
Fig.3  Visualization of part of the results.
Model Introduced model Bi-LSTM-CRF BERT-Bi-LSTM-CRF
F1-measure 0.879 0.813 0.827
Tab.11  Performance comparison
No. Features F1-measure of tag B F1-measure of tag I F1-measure of tag O Average
1 TF, CF 0.653 0.766 0.944 0.788
2 TF, CF, POSF 0.742 0.809 0.957 0.836
3 TF, WF 0.745 0.814 0.975 0.845
4 TF, WF, POSF 0.781 0.839 0.977 0.866
5 TF, WF, CF 0.798 0.859* 0.978* 0.878
6 TF, WF, CF, POSF 0.812* 0.853 0.972 0.879*
Tab.12  F1-measures of different feature combinations
Tag location Weight Feature
L=B 1.968 SL=B ,pos= b
1.472 SL=B ,pos= nh
1.207 SL=B ,s_modifier_suffix
1.201 SL=B ,pos= ws
1.191 SL=B ,pos= n
−1.635 SL=B ,pos= q
L=I 3.056 SL=I ,pos= wp
1.601 SL=I ,kernel
1.124 SL=I ,pos= nz
−1.296 SL=I ,pos= b
−1.460 SL=I ,right_indicator
−1.574 SL=I ,pos= a
L=O 2.841 SL=O ,pos= p
2.064 SL=O ,pos= wp
1.964 SL=O ,pos= nt
−2.168 SL=O ,modifier
−2.454 SL=O ,pos= nh
−3.077 SL=O ,kernel
Tab.13  Parameters of the SFs (top six)
Fig.4  F1-measures with different amounts of training data.
1 M Al Qady, A Kandil (2010). Concept relation extraction from construction documents using natural language processing. Journal of Construction Engineering and Management, 136(3): 294–302
https://doi.org/10.1061/(ASCE)CO.1943-7862.0000131
2 M Al Qady, A Kandil (2013). Document discourse for managing construction project documents. Journal of Computing in Civil Engineering, 27(5): 466–475
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000201
3 M Al Qady, A Kandil (2015). Automatic classification of project documents on the basis of text content. Journal of Computing in Civil Engineering, 29(3): 04014043
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000338
4 C H Caldas, L Soibelman (2002). Implementing automated methods for document classification in construction management information systems. In: Proceedings of the International Workshop on Information Technology in Civil Engineering. Washington, D.C.: ASCE, 194–210
5 W Che, Z Li, T Liu (2010). LTP: A Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Beijing: Association for Computational Linguistics, 13–16
6 H Chen, X Luo (2019). An automatic literature knowledge graph and reasoning network modeling framework based on ontology and natural language processing. Advanced Engineering Informatics, 42: 100959
https://doi.org/10.1016/j.aei.2019.100959
7 Z Dai, X Wang, P Ni, Y Li, G Li, X Bai (2019). Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records. In: Proceedings of 12th International Congress on Image and Signal Processing, Biomedical Engineering and Informatics (CISP-BMEI). Suzhou: IEEE, 1–5
8 J Devlin, M W Chang, K Lee, K Toutanova (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint,
9 H Fan, F Xue, H Li (2015). Project-based as-needed information retrieval from unstructured AEC documents. Journal of Management Engineering, 31(1): A4014012
https://doi.org/10.1061/(ASCE)ME.1943-5479.0000341
10 K Frantzi, S Ananiadou, H Mima (2000). Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal on Digital Libraries, 3(2): 115–130
https://doi.org/10.1007/s007999900023
11 V Gangadharan, D Gupta (2020). Recognizing named entities in agriculture documents using LDA based topic modelling techniques. Procedia Computer Science, 171: 1337–1345
https://doi.org/10.1016/j.procs.2020.04.143
12 A Goyal, V Gupta, M Kumar (2018). Recent named entity recognition and classification techniques: A systematic review. Computer Science Review, 29: 21–43
https://doi.org/10.1016/j.cosrev.2018.06.001
13 G J Hahm, J H Lee, H W Suh (2015). Semantic relation based personalized ranking approach for engineering document retrieval. Advanced Engineering Informatics, 29(3): 366–379
https://doi.org/10.1016/j.aei.2015.01.003
14 Z Huang, W Xu, K Yu (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint,
15 I Jauregi Unanue, E Zare Borzeshi, M Piccardi (2017). Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of Biomedical Informatics, 76: 102–109
https://doi.org/10.1016/j.jbi.2017.11.007 pmid: 29146561
16 K M Kwayu, V Kwigizile, J Zhang, J S Oh (2020). Semantic n-gram feature analysis and machine learning-based classification of drivers’ hazardous actions at signal-controlled intersections. Journal of Computing in Civil Engineering, 34(4): 04020015
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000895
17 J Lafferty, A McCallum, F C N Pereira (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning. San Francisco, CA: ACM, 282–289
18 T Le, H D Jeong, S B Gilbert, E Chukharev-Hudilainen (2018). Parsing natural language queries for extracting data from large-scale geospatial transportation asset repositories. In: Proceedings of Construction Research Congress. New Orleans, LA: ASCE, 70–79
19 J Lee, J S Yi, J Son (2019). Development of automatic-extraction model of poisonous clauses in international construction contracts using rule-based NLP. Journal of Computing in Civil Engineering, 33(3): 04019003
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000807
20 J Leskovec (2013). Web data: Amazon reviews. Available at:
21 S Li, H Cai, V R Kamat (2016). Integrating natural language processing and spatial reasoning for utility compliance checking. Journal of Construction Engineering and Management, 142(12): 04016074
https://doi.org/10.1061/(ASCE)CO.1943-7862.0001199
22 Z Li, M Sun (2009). Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics, 35(4): 505–512
https://doi.org/10.1162/coli.2009.35.4.35403
23 K Liu, N El-Gohary (2017). Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in Construction, 81: 313–327
https://doi.org/10.1016/j.autcon.2017.02.003
24 X Liu, M Zhou (2013). Two-stage NER for tweets with clustering. Information Processing & Management, 49(1): 264–273
https://doi.org/10.1016/j.ipm.2012.05.006
25 L Luo, Z Yang, P Yang, Y Zhang, L Wang, H Lin, J Wang (2018). An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34(8): 1381–1388
https://doi.org/10.1093/bioinformatics/btx761 pmid: 29186323
26 X Lv, N M El-Gohary (2016a). Semantic annotation for supporting context-aware information retrieval in the transportation project environmental review domain. Journal of Computing in Civil Engineering, 30(6): 04016033
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000565
27 X Lv, N M El-Gohary (2016b). Enhanced context-based document relevance assessment and ranking for improved information retrieval to support environmental decision making. Advanced Engineering Informatics, 30(4): 737–750
https://doi.org/10.1016/j.aei.2016.08.004
28 M Majumder, U Barman, R Prasad, K Saurabh, S K Saha (2012). A novel technique for name identification from homeopathy diagnosis discussion forum. Procedia Technology, 6: 379–386
https://doi.org/10.1016/j.protcy.2012.10.045
29 C D Manning, H Schutze (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press
30 T Poibeau, L Kosseim (2001). Proper name extraction from non-journalistic texts. In: Proceedings of 11th Computational Linguistics in the Netherlands. Tilburg: Brill, 144–157
31 J Pustejovsky, A Stubbs (2012). Natural Language Annotation for Machine Learning: A Guide to Corpus-building for Applications. Sebastopol, CA: O’Reilly Media
32 A P Quimbaya, A S Múnera, R A G Rivera, J C D Rodríguez, O M M Velandia, A A G Peña, C Labbé (2016). Named entity recognition over electronic health records through a combined dictionary-based approach. Procedia Computer Science, 100: 55–61
https://doi.org/10.1016/j.procs.2016.09.123
33 S K Saha, P Mitra, S Sarkar (2012). A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition. Knowledge-Based Systems, 27: 322–332
https://doi.org/10.1016/j.knosys.2011.09.015
34 A Singhal (2001). Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin, 24(4): 35–43
35 J Sun, K Lei, L Cao, B Zhong, Y Wei, J Li, Z Yang (2020). Text visualization for construction document information management. Automation in Construction, 111: 103048
https://doi.org/10.1016/j.autcon.2019.103048
36 A J P Tixier, M R Hallowell, B Rajagopalan, D Bowman (2016). Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Automation in Construction, 62: 45–56
https://doi.org/10.1016/j.autcon.2015.11.001
37 X Xu, H Cai (2020). Semantic approach to compliance checking of underground utilities. Automation in Construction, 109: 103006
https://doi.org/10.1016/j.autcon.2019.103006
38 S Yu, H Duan, Y Wu (2018). Corpus of multi-level processing for modern Chinese. Available at: (in Chinese)
39 F Zhang, H Fleyeh, X Wang, M Lu (2019). Construction site accident analysis using text mining and natural language processing techniques. Automation in Construction, 99: 238–248
https://doi.org/10.1016/j.autcon.2018.12.016
40 J Zhang, N M El-Gohary (2015). Automated information transformation for automated regulatory compliance checking in construction. Journal of Computing in Civil Engineering, 29(4): B4015001
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000427
41 J Zhang, N M El-Gohary (2016). Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking. Journal of Computing in Civil Engineering, 30(2): 04015014
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000346
42 P Zhou, N El-Gohary (2016). Ontology-based multilabel text classification of construction regulatory documents. Journal of Computing in Civil Engineering, 30(4): 04015058
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000530
43 P Zhou, N El-Gohary (2017). Ontology-based automated information extraction from building energy conservation codes. Automation in Construction, 74: 103–117
https://doi.org/10.1016/j.autcon.2016.09.004
44 Y Zou, A Kiviniemi, S W Jones (2017). Retrieving similar cases for construction project risk management using Natural Language Processing techniques. Automation in Construction, 80: 66–76
https://doi.org/10.1016/j.autcon.2017.04.003
[1] Wang XIANG, Peiyu MENG, Jinyu WEN. Development and advances in modular multilevel converter based HVDC projects in China[J]. Front. Eng, 2023, 10(1): 183-189.
[2] Kaile ZHOU, Zenghui ZHANG, Lu LIU, Shanlin YANG. Energy storage resources management: Planning, operation, and business model[J]. Front. Eng, 2022, 9(3): 373-391.
[3] Peng ZHOU, Shuaizhi GAO, Yue LV, Ge ZHAO. Energy transition management towards a low-carbon world[J]. Front. Eng, 2022, 9(3): 499-503.
[4] Yong YANG, Hui WANG, Andreas LÖSCHEL, Peng ZHOU. Energy transition toward carbon-neutrality in China: Pathways, implications and uncertainties[J]. Front. Eng, 2022, 9(3): 358-372.
[5] Zicheng ZHOU, Luojia WANG, Kerui DU, Shuai SHAO. Energy rebound effect in China’s manufacturing sector: Fresh evidence from firm-level data[J]. Front. Eng, 2022, 9(3): 439-451.
[6] Xiaohan QIU, Jinyang ZHAO, Yadong YU, Tieju MA. Levelized costs of the energy chains of new energy vehicles targeted at carbon neutrality in China[J]. Front. Eng, 2022, 9(3): 392-408.
[7] Xinyu CHEN, Yaxing LIU, Michael MCELROY. Transition towards carbon-neutral electrical systems for China: Challenges and perspectives[J]. Front. Eng, 2022, 9(3): 504-508.
[8] Gunnar J. LÜHR, Marian G. C. BOSCH-REKVELDT, Mladen RADUJKOVIC. Key stakeholders’ perspectives on the ideal partnering culture in construction projects[J]. Front. Eng, 2022, 9(2): 312-325.
[9] Qian SHI, Chenyu LIU, Chao XIAO. Machine learning in building energy management: A critical review and future directions[J]. Front. Eng, 2022, 9(2): 239-256.
[10] Lieyun DING, Weiguang JIANG, Cheng ZHOU. IoT sensor-based BIM system for smart safety barriers of hazardous energy in petrochemical construction[J]. Front. Eng, 2022, 9(1): 1-15.
[11] Jiawen HU, Qiuzhuang SUN, Zhi-Sheng YE, Xiaoliang LING. Sequential degradation-based burn-in test with multiple periodic inspections[J]. Front. Eng, 2021, 8(4): 519-530.
[12] Jiwei ZHU, Hua GAO, Jiangrui WANG. Analysis of synergy degree and its influencing factors in hydropower EPC project management[J]. Front. Eng, 2021, 8(3): 402-411.
[13] Andrew LOCKLEY, Ted von HIPPEL. The carbon dioxide removal potential of Liquid Air Energy Storage: A high-level technical and economic appraisal[J]. Front. Eng, 2021, 8(3): 456-464.
[14] Christoph Paul SCHIMANSKI, Gabriele PASETTI MONIZZA, Carmen MARCHER, Dominik T. MATT. Development of a BIM-based production planning and control system for Lean Construction through advancement and integration of existing management techniques[J]. Front. Eng, 2021, 8(3): 429-441.
[15] Hao GONG, Baicun WANG, Haijun LIANG, Zuoxian LUO, Yaofeng CAO. Strategic analysis of China’s geothermal energy industry[J]. Front. Eng, 2021, 8(3): 390-401.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed