OpenAI Launches EVMbench to Detect, Patch, and Exploit Vulnerabilities in Blockchain Environments

In Cybersecurity News - Original News Source is cybersecuritynews.com by Blog Writer

OpenAI EVMbench

OpenAI, in collaboration with crypto investment firm Paradigm, has introduced EVMbench, a new benchmark designed to evaluate the ability of AI agents to detect, patch, and exploit high-severity vulnerabilities in smart contracts.

The release marks a significant step in measuring AI capabilities within economically consequential environments, as smart contracts routinely secure over $100 billion in open-source crypto assets.

EVMbench draws on 120 curated vulnerabilities sourced from 40 security audits, with the majority derived from open code audit competitions on platforms such as Code4rena.

The benchmark also incorporates vulnerability scenarios from the security auditing process of the Tempo blockchain, a purpose-built Layer 1 designed for high-throughput stablecoin payments, extending EVMbench’s scope into payment-oriented smart contract code an area where agentic stablecoin transactions are expected to grow substantially.

Three Evaluation Modes

EVMbench evaluates AI agents across three distinct capability modes, each targeting a different phase of the smart contract security lifecycle.

Mode Description
Detect Agents audit a smart contract repository and are scored on recall of ground-truth vulnerabilities and associated audit rewards
Patch Agents modify vulnerable contracts while preserving intended functionality, verified through automated tests and exploit checks
Exploit Agents execute end-to-end fund-draining attacks against deployed contracts in a sandboxed blockchain environment, graded via transaction replay and on-chain verification

To support reproducible evaluation, OpenAI developed a Rust-based harness that deploys contracts deterministically and restricts unsafe RPC methods. All exploit tasks run in an isolated local Anvil environment rather than on live networks.

Frontier model performance on EVMbench reveals clear behavioral differences across task types. In the exploit mode, GPT‑5.3‑Codex achieved a score of 72.2%, a substantial improvement over GPT‑5, which scored 31.9% approximately six months prior.

Agents consistently perform best on exploit tasks, where the objective is explicit: drain funds and iterate until successful. Detect and patch modes remain harder, with agents sometimes stopping after identifying a single vulnerability rather than completing a full audit, and struggling to remove subtle flaws without breaking existing contract functionality.

OpenAI acknowledged that EVMbench does not fully reflect the difficulty of real-world smart contract security, and that its grading system cannot currently distinguish between true vulnerabilities and false positives when agents find issues beyond the human-auditor baseline.

Alongside the benchmark release, OpenAI committed $10 million in API credits through its Cybersecurity Grant Program to accelerate defensive security research, particularly for open-source software and critical infrastructure.

The company also announced the expansion of Aardvark, its security research agent, through a private beta program. EVMbench’s tasks, tooling, and evaluation framework have been released publicly to support continued research into AI-driven cyber capabilities.

Follow us on Google News, LinkedIn, and X for daily cybersecurity updates. Contact us to feature your stories.