Бенчмаркинг возможностей больших языковых моделей в задачах тестирования на проникновение

Alexander Mikhaylovich Kuzmin; Dmitry Vladimirovich Latokhin; Alexander Arturovich Anistratenko; Eugene Sergeevich Afonin; Alexander Victorovich Liskin; Anton Mikhailovich Ivanov

doi:10.25559/SITITO.022.202601.13-27

Alexander Mikhaylovich Kuzmin ПАО Сбербанк http://orcid.org/0000-0001-8352-5811
Dmitry Vladimirovich Latokhin АО "Лаборатория Касперского" http://orcid.org/0009-0006-8158-2218
Alexander Arturovich Anistratenko ПАО Сбербанк http://orcid.org/0009-0006-4110-2080
Eugene Sergeevich Afonin ПАО Сбербанк http://orcid.org/0009-0002-3737-149X
Alexander Victorovich Liskin АО "Лаборатория Касперского" http://orcid.org/0009-0006-2029-9136
Anton Mikhailovich Ivanov АО "Лаборатория Касперского" http://orcid.org/0009-0000-4753-9132

DOI: https://doi.org/10.25559/SITITO.022.202601.13-27

Аннотация

Стремительное развитие больших языковых моделей (БЯМ) создало беспрецедентные возможности для автоматизации задач кибербезопасности, одновременно породив значительные риски, связанные с использованием этих технологий злоумышленниками. Организации все чаще стремятся оценить возможность безопасного и эффективного внедрения БЯМ в процессы обеспечения информационной безопасности, однако комплексный стандарт оценки способностей БЯМ в контексте тестирования на проникновение до сих пор отсутствует. В настоящей статье представлена новая двумерная система бенчмаркинга, позволяющая раздельно оценивать способности агентов на основе БЯМ к планированию и исполнению атак. Предложены два взаимодополняющих бенчмарка: CSL-Benchmark для оценки стратегического планирования на основе 846 курируемых шагов атак, полученных из реальных заданий по тестированию на проникновение, и K-Benchmark для оценки практического исполнения в реальных Docker-средах, реализующих 34 техники MITRE ATT&CK в категориях Первоначальный доступ, Закрепление и Повышение привилегий. Оба бенчмарка основаны на корпоративной матрице MITRE ATT&CK и используют оценщиков на базе БЯМ для обеспечения согласованных сигналов успеха. Проведена оценка одиннадцати современных языковых моделей, выявившая, что лучшие модели достигают 78,22% успеха в задачах планирования (Claude Sonnet 4.5) и 76,47% в задачах исполнения (GPT-5, Qwen 3 Max). Анализ идентифицировал критические режимы отказа, включая галлюцинации, некорректное использование инструментов, потерю контекста и нарушения области действия. Полученные результаты демонстрируют значительный потенциал БЯМ для автоматизации тестирования на проникновение при сохранении их непригодности для полностью автономного развертывания.

Сведения об авторах

Alexander Mikhaylovich Kuzmin, ПАО Сбербанк

исполнительный директор, Лаборатория кибербезопасности

Dmitry Vladimirovich Latokhin, АО "Лаборатория Касперского"

руководитель группы классификации программного обеспечения

Alexander Arturovich Anistratenko, ПАО Сбербанк

руководитель направления, Лаборатория кибербезопасности

Eugene Sergeevich Afonin, ПАО Сбербанк

исполнительный директор, Лаборатория кибербезопасности

Alexander Victorovich Liskin, АО "Лаборатория Касперского"

руководитель Управления исследования угроз

Anton Mikhailovich Ivanov, АО "Лаборатория Касперского"

технический директор, Департамент исследований и разработки

Литература

1. Parkar S., Mishra D.K. Cybersecurity Workforce Development and Training: A Comprehensive Review on the Significance, Strategies, Opportunities and Challenges. In: 2024 International Conference on Intelligent Systems for Cybersecurity (ISCS). Gurugram, India: IEEE Press; 2024. p. 1-5. https://doi.org/10.1109/ISCS61804.2024.10581241
2. Hassanin M., Moustafa N. A Comprehensive Overview of Large Language Models (LLMs) for Cyber Defences: Opportunities and Directions. arXiv:2405.14487. 2024. https://doi.org/10.48550/arXiv.2405.14487
3. Yao Y., et al. A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly. High-Confidence Computing. 2024;4(2):100211. https://doi.org/10.1016/j.hcc.2024.100211
4. Xiong W., et al. Cyber Security Threat Modeling Based on the MITRE Enterprise ATT&CK Matrix. Software and Systems Modeling. 2022;21(1):157-177. https://doi.org/10.1007/s10270-021-00898-7
5. Shen X., et al. Decoding the MITRE Engenuity ATT&CK Enterprise Evaluation: An Analysis of EDR Performance in Real-World Environments. In: Proceedings of the 19th ACM Asia Conference on Computer and Communications Security (ASIA CCS '24). New York, NY, USA: Association for Computing Machinery; 2024. p. 96-111. https://doi.org/10.1145/3634737.3645012
6. Zambianco M., Facchinetti C., Siracusa D. A Proactive Decoy Selection Scheme for Cyber Deception Using MITRE ATT&CK. Computers & Security. 2025;148:104144. https://doi.org/10.1016/j.cose.2024.104144
7. Sánchez-Zas C., et al. Dynamic Characterisation of Cyberattacks Based on the MITRE ATT&CK Framework Applied to the Optimisation of a Mitigation Selection Process. Future Generation Computer Systems. 2026;177:108272. https://doi.org/10.1016/j.future.2025.108272
8. Pratama D., et al. CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher. Sensors. 2024;24(21):6878. https://doi.org/10.3390/s24216878
9. Deng G., et al. PENTESTGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing. In: Proceedings of the 33rd USENIX Conference on Security Symposium. 2024. p. 847-864. https://doi.org/10.48550/arXiv.2308.06782
10. Happe A., Kaplan A., Cito J. LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks. Empirical Software Engineering. 2026;31(3):70. https://doi.org/10.1007/s10664-025-10758-3
11. Fang R., et al. LLM Agents Can Autonomously Exploit One-Day Vulnerabilities. arXiv:2404.08144. 2024. https://doi.org/10.48550/arXiv.2404.08144
12. Xu J., et al. AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-Attacks. arXiv:2403.01038. 2024. https://doi.org/10.48550/arXiv.2403.01038
13. Shen X., et al. PentestAgent: Incorporating LLM Agents to Automated Penetration Testing. In: Proceedings of the 20th ACM Asia Conference on Computer and Communications Security (ASIA CCS '25). New York, NY, USA: Association for Computing Machinery; 2025. p. 375-391. https://doi.org/10.1145/3708821.3733882
14. Muzsai L., Imolai D., Lukács A. HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing. arXiv:2412.01778. 2024. https://doi.org/10.48550/arXiv.2412.01778
15. Kong H., et al. VulnBot: Autonomous Penetration Testing for a Multi-Agent Collaborative Framework. arXiv:2501.13411. 2025. https://doi.org/10.48550/arXiv.2501.13411
16. Wan S., et al. CyberSecEval 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models. arXiv:2408.01605. 2024. https://doi.org/10.48550/arXiv.2408.01605
17. Zhang A.K., et al. Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. In: The Thirteenth International Conference on Learning Representations (ICLR 2025). 2025. https://doi.org/10.48550/arXiv.2408.08926
18. Shao M., et al. NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security. Advances in Neural Information Processing Systems. 2024;37:57472-57498. https://doi.org/10.52202/079017-1832
19. Anurin A., et al. Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities. In: Workshop on Datasets and Evaluators of AI Safety (AIRR Workshop, NeurIPS 2024). 2024. https://doi.org/10.48550/arXiv.2410.09114
20. Alam M.T., et al. CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence. In: Advances in Neural Information Processing Systems. Vancouver, Canada; 2024. Vol. 37. p. 50805-50825. https://doi.org/10.52202/079017-1607
21. Gioacchini L., et al. AutoPenBench: A Vulnerability Testing Benchmark for Generative Agents. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track (EMNLP 2025). Association for Computational Linguistics; 2025. p. 1615-1624. https://doi.org/10.18653/v1/2025.emnlp-industry.114
22. Isozaki I., et al. Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements. In: Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization. New York, NY, USA: Association for Computing Machinery; 2025. p. 404-419. https://doi.org/10.1145/3708319.3733804
23. Happe A., Cito J. Understanding Hackers' Work: An Empirical Study of Offensive Security Practitioners. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY, USA: Association for Computing Machinery; 2023. p. 1669-1680. https://doi.org/10.1145/3611643.3613900
24. Happe A., Cito J. Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design. arXiv:2504.10112. 2025. https://doi.org/10.48550/arXiv.2504.10112
25. Comanici G., et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261. 2025. https://doi.org/10.48550/arXiv.2507.06261
26. Yang A., et al. Qwen3 Technical Report. arXiv:2505.09388. 2025. https://doi.org/10.48550/arXiv.2505.09388
27. GLM T., et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793. 2024. https://doi.org/10.48550/arXiv.2406.12793
28. Liu A., et al. DeepSeek-V3 Technical Report. arXiv:2412.19437. 2024. https://doi.org/10.48550/arXiv.2412.19437
29. Grattafiori A., et al. The Llama 3 Herd of Models. arXiv:2407.21783. 2024. https://doi.org/10.48550/arXiv.2407.21783
30. Agarwal S., et al. GPT-OSS-120B & GPT-OSS-20B Model Card. arXiv:2508.10925. 2025. https://doi.org/10.48550/arXiv.2508.10925
31. Singer B., et al. On the Feasibility of Using LLMs to Autonomously Execute Multi-Host Network Attacks. arXiv:2501.16466. 2025. https://doi.org/10.48550/arXiv.2501.16466

Бенчмаркинг возможностей больших языковых моделей в задачах тестирования на проникновение

Двумерная структура оценки на основе матрицы MITRE ATT&CK

Аннотация

Сведения об авторах

Литература