OpenA2A /specs
OASB0.3.2Stable

Open Agent Security Benchmark

222 attack scenarios across 10 categories that measure a security tool's detection coverage against agent threats, ATT&CK-Evaluations style, mapped to MITRE ATLAS and OWASP.

The question it answers

Does a security tool actually catch agent attacks?

If you know one thing already

It’s like MITRE ATT&CK Evaluations.

Why it exists

Every agent-security tool claims to catch attacks. Buyers had no neutral way to check. The antivirus industry solved this with MITRE ATT&CK Evaluations: run a fixed set of real attacks against each product and publish what it caught. OASB is that, for agent security, a vendor-neutral, reproducible measurement of detection coverage.

Note

OASB measures the tools, not the agents

The Threat Matrix catalogs the attacks. OASB runs them against a scanner and scores how many it detects. It is the benchmark a security tool is held to, distinct from ABGS / OASB-2, which audits an agent's declared governance.

222 attack scenarios, 10 categories

The benchmark is a fixed corpus of attack scenarios spanning the full surface, from process and network behavior to the AI layer itself and multi-step chained attacks.

Multi-stepchained attacks
43
AI-layerprompt injection, jailbreak
40
Filesystem
28
Intelligencedetection machinery
21
Process
19
Network
18
Enforcementresponse actions
18
App-hooksE2E interceptors
14
Baselinebenign behavior
12
Real-OSE2E monitors
9
Attack scenarios222
Category breakdown from the OASB repository (v0.3.2, status: stable). The full test run is 245 tests, 222 attack scenarios plus 23 scoring-engine unit tests.

How a tool is scored

Each scenario is run against the tool under test and the result is tallied as a confusion matrix, yielding standard detection metrics. The benign Baseline scenarios matter as much as the attacks, a tool that flags everything is as useless as one that flags nothing.

Detection rate (recall)
Of the real attacks, how many were caught? TP / (TP + FN).
False-positive rate
Of the benign cases, how many were wrongly flagged? FP / (FP + TN).
Precision
Of everything flagged, how much was a real attack? TP / (TP + FP).
F1 score
The harmonic mean of precision and recall, one number balancing both.
P95 latency
95th-percentile detection time, in milliseconds.

An example scorecard

One measured result from the repository, HackMyAgent's full pipeline over 4,245 labeled samples. Shown to illustrate the output shape; numbers are specific to that tool and version.

82.9%
F1
83.2%
Precision
82.6%
Recall
1.16%
False-positive rate
Caution

Verdicts count attacks, not posture

OASB's verdict counts high/critical attack findings. Posture findings, a missing governance file, wildcard tool access, are surfaced to the user but excluded from the malicious verdict, because they fire on benign and malicious agents alike. That distinction is what keeps the false-positive rate honest.

The Skills Security controls

Alongside the attack corpus, OASB defines a 10-item Skills Security checklist (SS-01-SS-10), argument validation, output integrity, least-privilege scope, signed manifests, audit logging, dependency provenance, graceful degradation, and more, tiered L1 → L3.

Anchored to the standards everyone uses

Every scenario maps to MITRE ATLAS (15 techniques) and the OWASP LLM/Agentic Top 10, so results are comparable to the wider security world rather than living in a silo. OASB also ships a DVAA comparison (70 scenarios) for agent-level evaluation.

If you know ATT&CK Evaluations

MITRE ATT&CK Evaluations
OASB
Fixed adversary emulation
222 attack scenariosreproducible corpus
Run against each EDR/SIEM
Run against each scannertool under test
Detection coverage report
F1 / recall / FPR scorecardconfusion matrix
ATT&CK technique mapping
MITRE ATLAS + OWASP mappingcomparable, not siloed
A vendor-neutral detection benchmark, for agent-security tools instead of endpoint security.