Which AI is best for cyber security?

n today’s digital battlefield, the potential of Large Language Models (LLMs) to augment cybersecurity operations is rapidly gaining recognition. However, selecting the best model and properly harnessing its capabilities requires more than just general-purpose benchmarks. Recent research spanning multiple studies—from SECURE and SecBench to Pentest Copilot and CyberSecEval—has provided an in-depth look into how different LLMs perform on cybersecurity-specific tasks. This blog post collates these findings into a professional guide that not only reviews benchmark results but also offers actionable prompt examples for real-world security applications.

by Kathan Desai

March 10, 2025

Benchmarking LLMs for Cybersecurity: What the Research Tells Us

1. Evaluating Cybersecurity Capabilities

Multiple studies have been conducted to assess LLM performance on tasks such as vulnerability analysis, threat intelligence extraction, and ethical penetration testing. Key findings include:

SECURE Benchmark:
Researchers introduced SECURE—a benchmark specifically designed for Industrial Control Systems (ICS) cybersecurity. SECURE focuses on three core capabilities:
- Knowledge Extraction: Evaluates the LLM’s ability to recall factual security details (e.g., MITRE ATT&CK patterns, CWE entries).
- Understanding: Assesses the model’s capacity to comprehend complex cybersecurity reports, such as recent CVE descriptions.
- Reasoning: Measures how well the LLM can apply its knowledge to propose actionable security recommendations.
The study compared commercial models (e.g., ChatGPT-4, Gemini-Pro) and open-source models (e.g., Llama3-70B) and found that while closed-source models generally excel in overall performance, well-tuned open-source variants can perform comparably on certain extraction and reasoning tasks.
Pentest Copilot Whitepaper:
In “Hacking, The Lazy Way: LLM Augmented Pentesting,” the authors demonstrated how the GPT-4-turbo model can be integrated into penetration testing workflows. Key contributions include:
- Enhanced Context Handling: GPT-4-turbo, with its ability to process up to 128k tokens, significantly outperforms earlier models in handling complex pentest scenarios.
- Prompt Engineering: Custom system prompts enable the model to assume the role of an ethical hacking assistant, ensuring responses are accurate and well-structured.
- Retrieval-Augmented Generation (RAG): Integrating RAG minimizes hallucinations by ensuring the model accesses up-to-date cybersecurity information.
SecBench & CS-Eval:
Other benchmarking efforts like SecBench and CS-Eval provide multi-dimensional datasets with thousands of multiple-choice and short-answer questions. These benchmarks cover a broad spectrum—from basic cybersecurity principles to advanced CTI (Cyber Threat Intelligence) analysis. Their findings stress that:
- Task-Specific Performance: While models like GPT-4-turbo lead in accuracy and context handling, the performance varies across subdomains (e.g., vulnerability severity prediction vs. threat actor attribution).
- Domain Fine-Tuning: Open-source models (e.g., Llama3-70B) show notable promise when fine-tuned on cybersecurity corpora using techniques like Retrieval-Augmented Generation.
CyberSecEval & CYBERSECEVAL:
These benchmarks extend the evaluation by incorporating risk assessment, prompt injection resistance, and ethical considerations. They reveal that:
- Safety & Compliance: Even leading models may have vulnerabilities like prompt hijacking or biased outputs. A model’s overall “safety score” is becoming as important as its raw performance.
- Regulatory Readiness: With emerging regulations such as the EU AI Act, ensuring that an LLM adheres to compliance standards is critical for real-world deployments.

2. Key Takeaways

GPT-4-turbo stands out for its exceptional context length, accuracy, and structured output—making it a strong candidate for complex cybersecurity tasks.
Open-source models like Llama3-70B and Mixtral-8x7B offer competitive performance when fine-tuned with domain-specific data, and they provide flexibility for organizations with in-house expertise.
Prompt engineering and RAG techniques are vital to bridge the gap between static knowledge and dynamic real-world cybersecurity needs, ensuring that models remain both accurate and current.

Practical Prompts for Leveraging LLMs in Cybersecurity

Below are several detailed prompts that you can use to tap into LLM capabilities for different security tasks. These prompts have been refined based on benchmarking insights and are designed to enhance both accuracy and context-awareness.

A. Vulnerability Assessment & Mitigation

Use this prompt to analyze vulnerability descriptions and recommend mitigations:

1You are a cybersecurity expert specializing in vulnerability assessments. Analyze the following CVE description and provide a report that includes:
21. A concise summary of the vulnerability.
32. The potential impact on the system.
43. Recommended mitigation strategies based on industry standards (e.g., MITRE ATT&CK, CVSS).
5Ensure your response is structured clearly and ends with a CVSS vector string in the format: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H.
6
7CVE Description: [Insert vulnerability description here]
8
9

B. Ethical Pentesting Command Generation

This prompt helps generate structured commands for ethical hacking:

1You are an ethical hacking assistant. Given the following scenario, generate a sequence of commands to exploit a directory traversal vulnerability on a web server. Your response must be in JSON format with the keys: "Command", "Purpose", and "Expected Outcome". Ensure that each command logically follows the previous steps in a typical pentest workflow.
2
3Scenario: A web server running at IP 192.168.1.100 is suspected to be vulnerable to directory traversal. The goal is to enumerate and access sensitive directories.
4
5Example JSON Output:
6{
7  "Command": "curl http://192.168.1.100/../../etc/passwd",
8  "Purpose": "Attempt to read the /etc/passwd file",
9  "Expected Outcome": "Display file contents if vulnerability exists"
10}
11
12Provide your command sequence now.
13
14

C. Threat Intelligence Report Summarization

Leverage LLMs to extract and synthesize key insights from a threat report:

1You are a cyber threat intelligence analyst. Summarize the following threat report and identify the main tactics and techniques used by the threat actor according to the MITRE ATT&CK framework. Your summary should include:
2- A brief description of the threat actor’s behavior.
3- Identification of key tactics (e.g., Initial Access, Execution) and techniques (e.g., spear-phishing, C2 communication).
4- Actionable recommendations to mitigate the identified threats.
5
6Threat Report: [Insert full threat report text here]
7
8

D. Security Tool Recommendation

Ask the LLM to recommend security tools based on an organization's current vulnerabilities:

1You are a cybersecurity consultant. A mid-sized organization has experienced multiple phishing attacks and has weak endpoint security. Recommend a set of security tools and best practices to improve their cybersecurity posture. Your response should be structured as bullet points, covering specific product suggestions and actionable steps.
2
3Requirements:
4- Tools for anti-phishing and endpoint protection.
5- Steps to integrate these tools with existing systems.
6- Short explanations for each recommendation.
7
8

E. Cyber Threat Actor Attribution

Use this prompt to help identify threat actors from anonymized CTI reports:

1You are a cyber threat intelligence analyst. Given the anonymized threat report below, attribute the incident to a known threat actor. Consider the techniques, tactics, and procedures (TTPs) described in the report, and justify your conclusion. Your response should include:
2- A summary of the key indicators.
3- The most likely threat actor or group (including any known aliases).
4- A brief explanation for your attribution based on the MITRE ATT&CK framework.
5
6Threat Report: [Insert anonymized threat report text here]
7
8

Final Thoughts

The convergence of benchmarking research and prompt engineering is ushering in a new era for cybersecurity applications of LLMs. Recent benchmarks have demonstrated that while models like GPT-4-turbo are highly effective for complex cybersecurity tasks, open-source models—when fine-tuned and integrated with RAG techniques—also show significant promise. By utilizing tailored prompts, cybersecurity professionals can not only automate routine tasks like vulnerability assessments and pentesting but also gain deeper insights through advanced threat intelligence analysis.

As these LLMs continue to evolve, maintaining a balance between automation, human oversight, and adherence to safety protocols will be key to building robust, reliable security systems. The future of cybersecurity is increasingly intertwined with AI, and understanding these benchmarks and prompt strategies is essential for anyone looking to leverage LLMs in the fight against cyber threats.

Would you like more technical details or additional prompt examples? Feel free to reach out directly.