Behaviour-diverse automatic penetration testing: a coverage-based deep reinforcement learning approach

doi:10.1007/s11704-024-3380-1

Front. Comput. Sci.

2025, Vol. 19

Issue (3) : 193309 https://doi.org/10.1007/s11704-024-3380-1

Artificial Intelligence

Behaviour-diverse automatic penetration testing: a coverage-based deep reinforcement learning approach

Yizhou YANG^1,², Longde CHEN³, Sha LIU³, Lanning WANG⁴, Haohuan FU⁵, Xin LIU³(

), Zuoning CHEN³

¹. Zhongguancun Laboratory, Beijing 100081, China
². Zhejiang Lab, Hangzhou 311121, China
³. National Research Centre of Parallel Computer Engineering and Technology, Wuxi 214000, China
⁴. Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China
⁵. Department of Earth System Science, Tsinghua University, Beijing 100084, China

Download: PDF(5451 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Reinforcement Learning (RL) is gaining importance in automating penetration testing as it reduces human effort and increases reliability. Nonetheless, given the rapidly expanding scale of modern network infrastructure, the limited testing scale and monotonous strategies of existing RL-based automated penetration testing methods make them less effective in practical application. In this paper, we present CLAP (Coverage-Based Reinforcement Learning to Automate Penetration Testing), an RL penetration testing agent that provides comprehensive network security assessments with diverse adversary testing behaviours on a massive scale. CLAP employs a novel neural network, namely the coverage mechanism, to address the enormous and growing action spaces in large networks. It also utilizes a Chebyshev decomposition critic to identify various adversary strategies and strike a balance between them. Experimental results across various scenarios demonstrate that CLAP outperforms state-of-the-art methods, by further reducing attack operations by nearly 35%. CLAP also provides enhanced training efficiency and stability and can effectively perform pen-testing over large-scale networks with up to 500 hosts. Additionally, the proposed agent is also able to discover pareto-dominant strategies that are both diverse and effective in achieving multiple objectives.

Keywords network security penetration testing reinforcement learning artificial intelligence

Corresponding Author(s): Xin LIU

Just Accepted Date: 28 February 2024 Issue Date: 29 May 2024

Cite this article:

Yizhou YANG,Longde CHEN,Sha LIU, et al. Behaviour-diverse automatic penetration testing: a coverage-based deep reinforcement learning approach[J]. Front. Comput. Sci., 2025, 19(3): 193309.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-024-3380-1
https://academic.hep.com.cn/fcs/EN/Y2025/V19/I3/193309

Fig.1 Penetration testing as a sequential decision making process

Fig.2 Architecture of our proposed method. The attack agent’s observations are fed through separate MLP extractors to Actor-Critic Network and RND respectively. Different CLAP’s Neural Network components are highlighted with coloured boxes. The coverage score

C o u t

is fused with logits output of the actor network

a o u t

. (a) System architecture; (b) fusion layer

Fig.3 An illustration of vectorised host information. It encompasses the physical address of the host, agent reachability, and other info of the host

The specific configurations presented here vary based on the used pen-testing simulators.

Tab.1 Benchmark network scenarios

Module	NN type	Layers	Activation	Hidden size	Input shape	Output shape
Critic	MLP	4	Tanh	256	$\| S \|$	1
Actor	MLP	3	Tanh	256	$\| S \|$	$\| A \|$
Coverage	MLP	3	Tanh	256	$\| A \|$	$\| A \|$
Cw_learner	MLP	1	Tanh	128	$\| S \|$	$\| A \|$
Aw_learner	MLP	1	Tanh	128	$\| S \|$	$\| A \|$
RND Predictor	MLP	3	Tanh	256	$\| S \|$	256
RND Target	MLP	3	Tanh	256	$\| S \|$	256

Tab.2 Neural network module configurations

Hyperparameter	Default value
RL Hyperparameters
Learning Rate	$2.5 × 10 ? 4$
Num Steps	512
Gamma	0.99
GAE Lambda	0.95
Batch Size	4096
Update Epochs	4
Clip Coefficient	0.2
Entropy Coefficient	0.01
Value Function Coefficient	0.5
Max Grad Norm	0.5
RND Hyperparameters
Intrinsic Reward Coefficient	1.0
Extrinsic Reward Coefficient	2.0
Intrinsic Reward Gamma	0.99

Tab.3 Hyperparameters configurations

Fig.4 Training performance of different methods across various NASim scenarios.

In Fig.4 and Fig.4 Baselines are not compared due to their poor performance under these scenarios

(a) Training performance in the NASim-Pocp1 Network scenario (left: number of actions, right: rewards); (b) training performance in the NASim-200 hosts network scenario (left: number of actions, right: rewards); (c) training performance in the NASim-500 host network scenario (left: number of actions; right: rewards)

Fig.5 Mean episodic length for different methods

Fig.6 Ablation study on different modules of CLAP in the nasim:Pocp2Gen network

Fig.7 Distribution of attack strategies under multi-objective settings

1	A, Applebaum D, Miller B, Strom C, Korban R Wolf . Intelligent, automated red team emulation. In: Proceedings of the 32nd Annual Conference on Computer Security Applications. 2016
2	J Hoffmann . Simulated penetration testing: from "dijkstra" to "turing test++". In: Proceedings of the 25th International Conference on Automated Planning and Scheduling. 2015
3	D, Silver J, Schrittwieser K, Simonyan I, Antonoglou A, Huang A, Guez T, Hubert L, Baker M, Lai A, Bolton Y, Chen T, Lillicrap F, Hui L, Sifre den Driessche G, van T, Graepel D Hassabis . Mastering the game of Go without human knowledge. Nature, 2017, 550( 7676): 354–359
4	O, Vinyals I, Babuschkin W M, Czarnecki M, Mathieu A, Dudzik J, Chung D H, Choi R, Powell T, Ewalds P, Georgiev J, Oh D, Horgan M, Kroiss I, Danihelka A, Huang L, Sifre T, Cai J P, Agapiou M, Jaderberg A S, Vezhnevets R, Leblond T, Pohlen V, Dalibard D, Budden Y, Sulsky J, Molloy T L, Paine C, Gulcehre Z, Wang T, Pfaff Y, Wu R, Ring D, Yogatama D, Wünsch K, Mckinney O, Smith T, Schaul T, Lillicrap K, Kavukcuoglu D, Hassabis C, Apps D Silver . Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575( 7782): 350–354
5	M, Laskin K, Lee A, Stooke L, Pinto P, Abbeel A Srinivas . Reinforcement learning with augmented data. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
6	Z, Hu R, Beuran Y Tan . Automated penetration testing using deep reinforcement learning. In: Proceedings of 2020 IEEE European Symposium on Security and Privacy Workshops. 2020, 2−10
7	S, Zhou J, Liu D, Hou X, Zhong Y Zhang . Autonomous penetration testing based on improved deep Q-network. Applied Sciences, 2021, 11( 19): 8823
8	K, Tran A, Akella M, Standen J, Kim D, Bowman T, Richer C T Lin . Deep hierarchical reinforcement agents for automated penetration testing. 2021, arXiv preprint arXiv: 2109.06449
9	J, Schwartz H Kurniawati . Autonomous penetration testing using reinforcement learning. 2019, arXiv preprint arXiv: 1905.05965
10	J, Schwartz H Kurniawatti . NASim: network attack simulator. Networkattacksimulator.readthedocs.io/, 2019
11	C, Baillie M, Standen J, Schwartz M, Docking D, Bowman J Kim . CybORG: an autonomous cyber operations research gym. 2020, arXiv preprint arXiv: 2002.10667
12	D, Shmaryahu G, Shani J, Hoffmann M Steinmetz . Simulated penetration testing as contingent planning. In: Proceedings of the 28th International Conference on Automated Planning and Scheduling. 2018, 241−249
13	H S, Lallie K, Debattista J Bal . A review of attack graph and attack tree visual syntax in cyber security. Computer Science Review, 2020, 35: 100219
14	L, Erdődi F M Zennaro . The agent web model: modeling web hacking for reinforcement learning. International Journal of Information Security, 2022, 21( 2): 293–309
15	Y, Li J, Yan M Naili . Deep reinforcement learning for penetration testing of cyber-physical attacks in the smart grid. In: Proceedings of 2022 International Joint Conference on Neural Networks. 2022, 1−9
16	R, Gangupantulu T, Cody A, Rahma C, Redino R, Clark P Park . Crown jewels analysis using reinforcement learning with attack graphs. In: Proceedings of 2021 IEEE Symposium Series on Computational Intelligence. 2021, 1−6
17	M C, Ghanem T M Chen . Reinforcement learning for efficient network penetration testing. Information, 2019, 11( 1): 6
18	F M, Zennaro L Erdődi . Modelling penetration testing with reinforcement learning using capture-the-flag challenges: trade-offs between model-free learning and a priori knowledge. IET Information Security, 2023, 17( 3): 441–457
19	D, Pathak P, Agrawal A A, Efros T Darrell . Curiosity-driven exploration by self-supervised prediction. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017, 2778−2787
20	P, Wang J, Liu X, Zhong G, Yang S, Zhou Y Zhang . DUSC-DQN: an improved deep Q-network for intelligent penetration testing path design. In: Proceedings of the 7th International Conference on Computer and Communication Systems. 2022, 476−480
21	A S, Vezhnevets S, Osindero T, Schaul N, Heess M, Jaderberg D, Silver K Kavukcuoglu . Feudal networks for hierarchical reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 3540−3549
22	W, Czarnecki S, Jayakumar M, Jaderberg L, Hasenclever Y W, Teh N, Heess S, Osindero R Pascanu . Mix & match agent curricula for reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1087−1095
23	G, Farquhar L, Gustafson Z, Lin S, Whiteson N, Usunier G Synnaeve . Growing action spaces. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 285
24	T, Murata H, Ishibuchi M Gen . Specification of genetic search directions in cellular multi-objective genetic algorithms. In: Proceedings of the 1st International Conference on Evolutionary Multi-Criterion Optimization. 2001, 82−95
25	K, Deb H Jain . An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, Part I: solving problems with box constraints. IEEE Transactions on Evolutionary Computation, 2014, 18( 4): 577–601
26	C H, Hsu S H, Chang J H, Liang H P, Chou C H, Liu S C, Chang J Y, Pan Y T, Chen W, Wei D C Juan . MONAS: multi-objective neural architecture search using reinforcement learning. 2018, arXiv preprint arXiv: 1806.10332
27	H, Mossalam Y M, Assael D M, Roijers S Whiteson . Multi-objective deep reinforcement learning. 2016, arXiv preprint arXiv: 1610.02707
28	M, Jaderberg W M, Czarnecki I, Dunning L, Marris G, Lever A G, Castañeda C, Beattie N C, Rabinowitz A S, Morcos A, Ruderman N, Sonnerat T, Green L, Deason J Z, Leibo D, Silver D, Hassabis K, Kavukcuoglu T Graepel . Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 2019, 364( 6443): 859–865
29	R, Shen Y, Zheng J, Hao Z, Meng Y, Chen C, Fan Y Liu . Generating behavior-diverse game AIs with evolutionary multi-objective deep reinforcement learning. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2021, 3371−3377
30	Strom B E, Applebaum A, Miller D P, Nickels K C, Pennington A G, Thomas C B. Mitre att&ck: Design and philosophy. Mitre Product MP, 2018
31	J, Schulman F, Wolski P, Dhariwal A, Radford O Klimov . Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347
32	J, Schulman P, Moritz S, Levine M I, Jordan P Abbeel . High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the 4th International Conference on Learning Representations. 2016
33	Y, Burda H, Edwards A J, Storkey O Klimov . Exploration by random network distillation. In: Proceedings of the 7th International Conference on Learning Representations. 2019
34	R S, Sutton A G Barto . Reinforcement Learning: An Introduction. 2nd ed. Cambridge: MIT Press, 2018
35	D M, Roijers P, Vamplew S, Whiteson R Dazeley . A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 2013, 48: 67–113
36	I, Oh S, Rho S, Moon S, Son H, Lee J Chung . Creating pro-level AI for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games, 2022, 14( 2): 212–220
37	R, Agarwal M, Schwarzer P S, Castro A C, Courville M Bellemare . Deep reinforcement learning at the edge of the statistical precipice. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 29304−29320
38	Standen M, Bowman D, Hoang S, Richer T, Lucas M, Van Tassel R. Cyber autonomy gym for experimentation challenge 1, 2021
39	F H Katz . Breadth vs. depth: best practices teaching cybersecurity in a small public university sharing models. The Cyber Defense Review, 2018, 3( 2): 65–72

[1]

FCS-23380-OF-YY_suppl_1

Download

[1]	Xiao MA, Shen-Yi ZHAO, Zhao-Heng YIN, Wu-Jun LI. Clustered Reinforcement Learning[J]. Front. Comput. Sci., 2025, 19(4): 194313-.
[2]	Tao HE, Ming LIU, Yixin CAO, Zekun WANG, Zihao ZHENG, Bing QIN. Exploring & exploiting high-order graph structure for sparse knowledge graph completion[J]. Front. Comput. Sci., 2025, 19(2): 192306-.
[3]	Lei YUAN, Feng CHEN, Zongzhang ZHANG, Yang YU. Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation[J]. Front. Comput. Sci., 2024, 18(6): 186331-.
[4]	Chengxing JIA, Fuxiang ZHANG, Tian XU, Jing-Cheng PANG, Zongzhang ZHANG, Yang YU. Model gradient: unified model and policy learning in model-based reinforcement learning[J]. Front. Comput. Sci., 2024, 18(4): 184339-.
[5]	Yuya CUI, Degan ZHANG, Jie ZHANG, Ting ZHANG, Lixiang CAO, Lu CHEN. Multi-user reinforcement learning based task migration in mobile edge computing[J]. Front. Comput. Sci., 2024, 18(4): 184504-.
[6]	Xumeng WANG, Ziliang WU, Wenqi HUANG, Yating WEI, Zhaosong HUANG, Mingliang XU, Wei CHEN. VIS+AI: integrating visualization with artificial intelligence for efficient data analysis[J]. Front. Comput. Sci., 2023, 17(6): 176709-.
[7]	Jian AN, Siyuan WU, Xiaolin GUI, Xin HE, Xuejun ZHANG. A blockchain-based framework for data quality in edge-computing-enabled crowdsensing[J]. Front. Comput. Sci., 2023, 17(4): 174503-.
[8]	Qiming FU, Zhechao WANG, Nengwei FANG, Bin XING, Xiao ZHANG, Jianping CHEN. MAML²: meta reinforcement learning via meta-learning for task categories[J]. Front. Comput. Sci., 2023, 17(4): 174325-.
[9]	Donghong HAN, Yanru KONG, Jiayi HAN, Guoren WANG. A survey of music emotion recognition[J]. Front. Comput. Sci., 2022, 16(6): 166335-.
[10]	Xiaoqin ZHANG, Huimin MA, Xiong LUO, Jian YUAN. LIDAR: learning from imperfect demonstrations with advantage rectification[J]. Front. Comput. Sci., 2022, 16(1): 161312-.
[11]	Hong QIAN, Yang YU. Derivative-free reinforcement learning: a review[J]. Front. Comput. Sci., 2021, 15(6): 156336-.
[12]	Li ZHANG, Yuxuan CHEN, Wei WANG, Ziliang HAN, Shijian Li, Zhijie PAN, Gang PAN. A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games[J]. Front. Comput. Sci., 2021, 15(5): 155334-.
[13]	Peng YANG, Qi YANG, Ke TANG, Xin YAO. Parallel exploration via negatively correlated search[J]. Front. Comput. Sci., 2021, 15(5): 155333-.
[14]	Yao QIN, Hua WANG, Shanwen YI, Xiaole LI, Linbo ZHAI. A multi-objective reinforcement learning algorithm for deadline constrained scientific workflow scheduling in clouds[J]. Front. Comput. Sci., 2021, 15(5): 155105-.
[15]	Hongwei LI, Yingpeng HU, Yixuan CAO, Ganbin ZHOU, Ping LUO. Rich-text document styling restoration via reinforcement learning[J]. Front. Comput. Sci., 2021, 15(4): 154328-.

Viewed

Full text

Abstract

Cited

Shared

Discussed