Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2014, Vol. 8 Issue (6) : 996-1011    https://doi.org/10.1007/s11704-014-3430-1
RESEARCH ARTICLE
Detection of semantically similar code
Tiantian WANG1,*(),Kechao WANG1,2,Xiaohong SU1,Peijun MA1
1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
2. School of Software, Harbin University, Harbin 150086, China
 Download: PDF(969 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The traditional similar code detection approaches are limited in detecting semantically similar codes, impeding their applications in practice. In this paper, we have improved the traditional metrics-based approach as well as the graphbased approach and presented a metrics-based and graphbased combined approach. First, source codes are represented as augmented system dependence graphs. Then, metricsbased candidate similar code extraction is performed to filter out most of the dissimilar code pairs so as to lower the computational complexity. After that, code normalization is performed on the candidate similar codes to remove code variations so as to detect similar code at the semantic level. Finally, program matching is performed on the normalized control dependence trees to output semantically similar codes. Experiment results show that our approach can detect similar codes with code variations, and it can be applied to large software.

Keywords similar code detection      system dependence graph      code normalization      semantically equivalent     
Corresponding Author(s): Tiantian WANG   
Issue Date: 27 November 2014
 Cite this article:   
Peijun MA,Xiaohong SU,Tiantian WANG, et al. Detection of semantically similar code[J]. Front. Comput. Sci., 2014, 8(6): 996-1011.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-014-3430-1
https://academic.hep.com.cn/fcs/EN/Y2014/V8/I6/996
1 Bettenburg N, Sh<?Pub Caret?>ang W Y, Ibrahim W, Adams B, Zou Y, Hassan A E. An empirical study on inconsistent changes to code clones at the release level. Science of Computer Programming, 2012, 77(6): 760-776
https://doi.org/10.1016/j.scico.2010.11.010
2 Duala-Ekoko E, Robillard M P. Clone region descriptors: representing and tracking duplication in source code. ACM Transactions on Software Engineering and Methodology, 2010, 20(1): Article No. 3
3 Krinke J. A study of consistent and inconsistent changes to code clones. In: Proceedings of the 14th Working Conference on Reverse Engineering. 2007, 170-178
https://doi.org/10.1109/WCRE.2007.7
4 Nguyen H A, Nguyen T T, Pham N H, Al-Kofahi J, Nguyen T N. Clone management for evolving software. IEEE Transactions on Software Engineering, 2012, 38(5): 1008-1026
https://doi.org/10.1109/TSE.2011.90
5 Thummalapenta S, Cerulo L, Aversano L, Penta M D. An empirical study on the maintenance of source code clones. Empirical Software Engineering, 2010, 15(1): 1-34
https://doi.org/10.1007/s10664-009-9108-x
6 Bruntink M, Van Deursen A, Van Engelen R, Tourwe T. On the use of clone detection for identifying crosscutting concern code. IEEE Transactions on Software Engineering, 2005, 31(10): 804-818
https://doi.org/10.1109/TSE.2005.114
7 Li J, Ernst M D. CBCD: cloned buggy code detector. In: Proceedings of the 34th International Conference on Software Engineering. 2012, 310-320
8 Li Z, Lu S, Myagmar S, Zhou Y. CP-Miner: finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 2006, 32(3): 176-192
https://doi.org/10.1109/TSE.2006.28
9 Rahman F, Bird C, Devanbu P. Clones: what is that smell? Empirical Software Engineering, 2012, 17(4-5): 503-530
https://doi.org/10.1007/s10664-011-9195-3
10 Roy C K, Cordy J R, Koschke R. Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Science of Computer Programming, 2009, 74(7): 470-495
https://doi.org/10.1016/j.scico.2009.02.007
11 Church K W, Helfman J I. Dotplot: a program for exploring selfsimilarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 1993, 2(2): 153-174
12 Ducasse S, Rieger M, Demeyer S. A language independent approach for detecting duplicated code. In: Proceedings of the IEEE International Conference on Software Maintenance. 1999, 109-118
13 Manber U. Finding similar files in a large file system. In: Proceedings of the 1994 Usenix Winter Technical Conference. 1994, 1-10
14 Roy C K, Cordy J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: Proceedings of the 16th IEEE International Conference on Program Comprehension. 2008, 172-181
15 Baker B S. On finding duplication and near-duplication in large software systems. In: Proceedings of the 2nd Working Conference on Reverse Engineering. 1995, 86-95
https://doi.org/10.1109/WCRE.1995.514697
16 Baker B S. Finding clones with dup: analysis of an experiment. IEEE Transactions on Software Engineering, 2007, 33(9): 608-621
https://doi.org/10.1109/TSE.2007.70720
17 Kamiya T, Kusumoto S, Inoue K. CCFinder: a multilinguistic tokenbased code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 2002, 28(7): 654-670
https://doi.org/10.1109/TSE.2002.1019480
18 Livieri S, Higo Y, Matushita M, Inoue K. Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: Proceedings of the 29th International Conference on Software Engineering. 2007, 106-115
19 Ueda Y, Kamiya T, Kusumoto S, Inoue K. On detection of gapped code clones using gap locations. In: Proceedings of the 9th Asia-Pacific Software Engineering Conference. 2002, 327-336
20 Higo Y, Kamiya T, Kusumoto S, Inoue K. Method and implementation for investigating code clones in a software system. Information and Software Technology, 2007, 49(9): 985-998
https://doi.org/10.1016/j.infsof.2006.10.005
21 Baxter I D, Yahin A, Moura L, Sant’Anna M, Bier L. Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance. 1998, 368-377
22 Koschke R, Falke R, Frenzel P. Clone detection using abstract syntax suffix trees. In: Proceedings of the 13th Working Conference on Reverse Engineering. 2006, 253-262
23 Prechelt L, Malpohl G, Philippsen M. JPlag: finding plagiarisms among a set of programs. Technical Report, Department of Informatics, University of Karlsruhe. 2000
24 Wahler V, Seipel D, Wolff J, Fischer G. Clone detection in source code by frequent itemset techniques. In: Proceedings of the 4th IEEE International Workshop on Source Code Analysis and Manipulation. 2004, 128-135
25 Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K. Measuring clone based reengineering opportunities. In: Proceedings of the 6th International Software Metrics Symposium. 1999, 292-303
26 Davey N, Barson P, Field S, Frank R, Tansley D. The development of a software clone detector. International Journal of Applied Software Technology, 1995, 1(3-4), 219-236
27 Kontogiannis K A, DeMori R, Merlo E, Galler M, Bernstein M. Pattern matching for clone and concept detection. Automated Software Engineering, 1996, 3(1-2): 77-108
https://doi.org/10.1007/BF00126960
28 Mayrand J, Leblanc C, Merlo E M. Experiment on the automatic detection of function clones in a software system using metrics. In: Proceedings of the International Conference on Software Maintenance. 1996, 244-253
https://doi.org/10.1109/ICSM.1996.565012
29 Patenaude J F, Merlo E, Dagenais M, Lague B. Extending software quality assessment techniques to java systems. In: Proceedings of the 7th International Workshop on Program Comprehension. 1999, 49-56
https://doi.org/10.1109/WPC.1999.777743
30 Schleimer S, Wilkerson D S, Aiken A. Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 2003, 76-85
https://doi.org/10.1145/872757.872770
31 Komondoor R, Horwitz S. Using slicing to identify duplication in source code. Lecture Notes in Computer Science, 2001, 2126: 40-56
https://doi.org/10.1007/3-540-47764-0_3
32 Krinke J. Identifying similar code with program dependence graphs. In: Proceedings of the 8th Working Conference on Reverse Engineering. 2001, 301-309
https://doi.org/10.1109/WCRE.2001.957835
33 Liu C, Chen C, Han J, Yu P S. GPlag: detection of software plagiarism by program dependence graph analysis. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 872-881
https://doi.org/10.1145/1150402.1150522
34 Qu W, Jiang M, Jia Y. Software reuse detection using an integrated space-logic domain model. In: Proceeding of the IEEE International Conference on Information Reuse and Integration. 2007, 638-643
35 Gabel M, Jiang L, Su Z. Scalable detection of semantic clones. In: Proceedings of the 30th International Conference on Software Engineering. 2008, 321-330
36 Ferrante J, Ottenstein K J, Warren J D. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 1987, 9(3): 319-349
https://doi.org/10.1145/24039.24041
37 Binkley, D, Horwitz, S, Reps, T. The Multi-Procedure Equivalence Theorem. CS Technical Reports, Computer Sciences Department, University of Wisconsin-Madison. 1989
38 Church K W, Helfman J I. Dotplot: a program for exploring selfsimilarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 1993, 2(2): 153-174
39 Horwitz S, Prins J, Reps T. On the adequacy of program dependence graphs for representing programs. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 1988, 146-157
https://doi.org/10.1145/73560.73573
40 Xu S, San Chee Y. Transformation-based diagnosis of student programs for programming tutoring systems. IEEE Transactions on Software Engineering, 2003, 29(4): 360-384
https://doi.org/10.1109/TSE.2003.1191799
41 Ammarguellat Z. A control-flow normalization algorithm and its complexity. IEEE Transactions on Software Engineering, 1992, 18(3): 237-251
https://doi.org/10.1109/32.126773
42 Williams M H, Ossher H L. Conversion of unstructured flow diagrams to structured form. The Computer Journal, 1978, 21(2): 161-167
https://doi.org/10.1093/comjnl/21.2.161
43 Yang W. Identifying syntactic differences between two programs. Software: Practice and Experience, 1991, 21(7): 739-755
https://doi.org/10.1002/spe.4380210706
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed