A multi-level approach to highly efficient recognition of Chinese spam short messages
Weimin WANG1(), Dan ZHOU2
1. School of Computer Science & Engineering, Jiangsu University of Science and Technology, Jiangsu 212003, China 2. School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100190, China
The problem of spam short message (SMS) recognition involves many aspects of natural language processing. A good solution to solving the problem can not only improve the quality of people experiencing the mobile life, but also has a positive role on promoting the analysis of short text occurring in current mobile applications, such as Webchat and microblog. As spam SMSes have characteristics of sparsity, transformation and real-timedness, we propose three methods at different levels, i.e., recognition based on symbolic features, recognition based on text similarity, and recognition based on pattern matching. By combining these methods, we obtain a multi-level approach to spam SMS recognition. In order to enrich the pattern base to reduce manual labor and time, we propose a quasi-pattern learning method, which utilizes quasi-pattern matching results in the pattern matching process. Themethod can learnmany interesting and new patterns from the SMS corpus. Finally, a comprehensive analysis indicates that our spam SMS recognition approach achieves a precision rate as high as 95.18%, and a recall rate of 95.51%.
. [J]. Frontiers of Computer Science, 2018, 12(1): 135-145.
Weimin WANG, Dan ZHOU. A multi-level approach to highly efficient recognition of Chinese spam short messages. Front. Comput. Sci., 2018, 12(1): 135-145.
Chen Y W. The research of treatment for spam message in China. Dissertation for the Doctoral Degree. Shanghai: Shanghai Jiao Tong University, 2010
2
Huang L Y. On the countermeasures of junk message. Journal of Chongqing University of Posts and Telecommunications (Social Science Edition), 2010, 3: 25–30
3
Jia X Z. A study on legal governance of spam messages in China. Dissertation for the Doctoral Degree. Changchun: Jilin University, 2013
4
Yi Y F. Principles and implementation of spam short message monitoring. Zhongxing Telecom Technology, 2005, 11(6): 49–54
5
Zhang Y, Fu J M. Identifying and trace backing short message spam. Application Research of Computers, 2006, 23(3): 245–247
6
Wang B, Pan W F. A survey of content-based anti-spam email filtering. Journal of Chinese Information Processing, 2006, 19(5): 1–10
7
Shan G Y, Fan X H, Yang Y X. Short message service system security analysis. Information Network Security, 2003, 11: 52–54
8
Shi J. An effective spam short message filtering system. Dissertation for the Doctoral Degree. Chengdu: University of Electronic Science and Technology of China, 2010
9
Wang R, Tan W. Management of spam SMS based on big data mining. Telecom Engineering Technics and Standardization, 2015, 2: 78–82
10
Qian Q, Wan B. Spam messages intercept strategy research based on the generalized digit. China New Communication, 2015, 4: 42–43
11
Zhang Y J, Liu J L, Gao S B. Spam short message classifier model based on association rules. Journal of Nantong University (Natural Science Edition), 2014, 3: 6–12
12
Sun D. Application and implementation of Hadoop cloud computing technology in junk message filtering. Netinfo Security, 2015, 7: 13–19
13
Uysal A K, Gunal S, Ergin S, Gunal E S. A novel framework for SMS spam filtering. In: Proceedings of 2012 International Symposium on Innovations in Intelligent Systems and Applications (INISTA). 2012 https://doi.org/10.1109/INISTA.2012.6246947
14
Duan L Z, Li N, Huang L J. A new spam short message classification. In: Proceedings of the 1st International Workshop on Education Technology and Computer Science. 2009 https://doi.org/10.1109/etcs.2009.299
15
Rafique M Z, Farooq M. SMS SPAM detection by operating on bytelevel distributions using hidden markov models. In: Proceedings of the 20th Virus Bulletin International Conference. 2010
16
Chen K X, Chen J Y. An improved spam short message filtering technology based on the naive Bayesian algorithm. Fujian Computer, 2014, 3: 42–43
17
Wu N N, Wu M G, Chen S. Real-time monitoring and filtering system for mobile SMS. In: Proceedings of the 3rd IEEE Conference on Industrial Electronics and Applications. 2008
18
Ma N. Research on content based spam short message identifying. Dissertation for the Doctoral Degree. Beijing: Beijing University of Posts and Telecommunications, 2014
19
Huang W L. Research on key techniques of spam short message filtering. Dissertation for the Doctoral Degree. Hangzhou: Zhejiang University, 2008
20
Li Y T. Research on spam short message text classification algorithm. Heilongjiang Science and Technology Information, 2015, 19: 144
21
Gong C C. Research on short text language computing. Dissertation for the Doctoral Degree. Beijing: The Institute of Computing Technology of the Chinese Academy of Sciences, 2008
22
Ma X, Xu W R, Guo J, Hu R L. SMS-2008: an annotated Chinese short messages corpus. Journal of Chinese Information, 2009, 23(4): 22–26
23
He X. Design and implementation of junk short message filtering system. Dissertation for the Doctoral Degree. Chengdu: University of Electronic Science and Technology of China, 2009
24
Li H, Zhang Y, Lu H. Junk SMS filtering based on context. Computer Engineering, 2008, 34(12): 154–156