OASB, Open Agent Security Benchmark

Why it exists

Every agent-security tool claims to catch attacks. Buyers had no neutral way to check. The antivirus industry solved this with MITRE ATT&CK Evaluations: run a fixed set of real attacks against each product and publish what it caught. OASB is that, for agent security, a vendor-neutral, reproducible measurement of detection coverage.

Note

OASB measures the tools, not the agents

The Threat Matrix catalogs the attacks. OASB runs them against a scanner and scores how many it detects. It is the benchmark a security tool is held to, distinct from ABGS / OASB-2, which audits an agent's declared governance.

222 attack scenarios, 10 categories

The benchmark is a fixed corpus of attack scenarios spanning the full surface, from process and network behavior to the AI layer itself and multi-step chained attacks.

Multi-stepchained attacks

AI-layerprompt injection, jailbreak

Filesystem

Intelligencedetection machinery

Process

Network

Enforcementresponse actions

App-hooksE2E interceptors

Baselinebenign behavior

Real-OSE2E monitors

Attack scenarios222

Category breakdown from the OASB repository (v0.3.2, status: stable). The full test run is 245 tests, 222 attack scenarios plus 23 scoring-engine unit tests.

How a tool is scored

Each scenario is run against the tool under test and the result is tallied as a confusion matrix, yielding standard detection metrics. The benign Baseline scenarios matter as much as the attacks, a tool that flags everything is as useless as one that flags nothing.

Detection rate (recall): Of the real attacks, how many were caught? TP / (TP + FN).
False-positive rate: Of the benign cases, how many were wrongly flagged? FP / (FP + TN).
Precision: Of everything flagged, how much was a real attack? TP / (TP + FP).
F1 score: The harmonic mean of precision and recall, one number balancing both.
P95 latency: 95th-percentile detection time, in milliseconds.

An example scorecard

One measured result from the repository: HackMyAgent 0.23.8 (full pipeline) over the 4,245-sample labeled corpus, re-run 2026-06-05. Shown to illustrate the output shape; numbers are specific to that tool and version. Source: oasb § Latest Results.

82.9%

83.2%

Precision

82.6%

Recall

1.16%

False-positive rate

Caution

Verdicts count attacks, not posture

OASB's verdict counts high/critical attack findings. Posture findings, a missing governance file, wildcard tool access, are surfaced to the user but excluded from the malicious verdict, because they fire on benign and malicious agents alike. That distinction is what keeps the false-positive rate honest.

The Skills Security controls

Alongside the attack corpus, OASB defines a 10-item Skills Security checklist (SS-01-SS-10), argument validation, output integrity, least-privilege scope, signed manifests, audit logging, dependency provenance, graceful degradation, and more, tiered L1 → L3.

Anchored to the standards everyone uses

Every scenario maps to MITRE ATLAS (15 techniques) and the OWASP LLM/Agentic Top 10, so results are comparable to the wider security world rather than living in a silo. OASB also ships a DVAA comparison (70 scenarios) for agent-level evaluation.

If you know ATT&CK Evaluations

MITRE ATT&CK Evaluations

OASB

Fixed adversary emulation

→

222 attack scenariosreproducible corpus

Run against each EDR/SIEM

→

Run against each scannertool under test

Detection coverage report

→

F1 / recall / FPR scorecardconfusion matrix

ATT&CK technique mapping

→

MITRE ATLAS + OWASP mappingcomparable, not siloed

A vendor-neutral detection benchmark, for agent-security tools instead of endpoint security.