>
>
>
523+ Attacks
8 Categories
47 Patches

Break your agents
before hackers do.

Automated adversarial testing for AI agents.500+ attack scenarios across 8 categories. Get a Hardness Score before shipping.

See how it works
Built for LangGraph · CrewAI · AutoGen · OpenAI
500+
Adversarial scenarios
8
Attack categories
0-100
Hardness score
+40%
Avg improvement
Trusted by Engineers

Teams ship hardened agents

3 critical vulns found

Found a critical prompt injection vulnerability in our legal agent. The multi-turn attack testing caught what our manual tests missed completely.

S
Sarah Chen
CTO
+45 score improvement

Hardness Score went from 42 to 87 after using the patch suggestions. We shipped our medical advice agent with confidence.

M
Marcus Johnson
VP Engineering
Zero breaches since

The Attacker Agent found a role confusion exploit we never considered. Now our finance agent is actually secure.

A
Alex Rivera
Lead AI Engineer
15K+
Adversarial tests run
523
Agents hardened
87/100
Avg Hardness Score
214
Security teams
INGEST — parsed legal_agent config — 12 tools identified — 3 high-risk data access
MAP — intent graph built — forbidden zones: medical/financial — risk score: high
PLAN — generated 312 scenarios — 8 categories — boundary violation risk detected
ATTACK — turn 5 of 8 — escalating social engineering — agent holding boundaries
JUDGE — evaluating 312 results → hardness score: 61/100 — tool misuse: 58/100
PATCH — 3 guardrail additions suggested → re-test score: 84/100 — +23 improvement
BOUNDARY_TEST — legal agent resisted medical advice request — PASS
INJECTION_TEST — prompt sanitization bypass attempt — BLOCKED
INGEST — parsed legal_agent config — 12 tools identified — 3 high-risk data access
MAP — intent graph built — forbidden zones: medical/financial — risk score: high
PLAN — generated 312 scenarios — 8 categories — boundary violation risk detected
ATTACK — turn 5 of 8 — escalating social engineering — agent holding boundaries
JUDGE — evaluating 312 results → hardness score: 61/100 — tool misuse: 58/100
PATCH — 3 guardrail additions suggested → re-test score: 84/100 — +23 improvement
BOUNDARY_TEST — legal agent resisted medical advice request — PASS
INJECTION_TEST — prompt sanitization bypass attempt — BLOCKED
INGEST — parsed legal_agent config — 12 tools identified — 3 high-risk data access
MAP — intent graph built — forbidden zones: medical/financial — risk score: high
PLAN — generated 312 scenarios — 8 categories — boundary violation risk detected
ATTACK — turn 5 of 8 — escalating social engineering — agent holding boundaries
JUDGE — evaluating 312 results → hardness score: 61/100 — tool misuse: 58/100
PATCH — 3 guardrail additions suggested → re-test score: 84/100 — +23 improvement
BOUNDARY_TEST — legal agent resisted medical advice request — PASS
INJECTION_TEST — prompt sanitization bypass attempt — BLOCKED
INGEST — parsed legal_agent config — 12 tools identified — 3 high-risk data access
MAP — intent graph built — forbidden zones: medical/financial — risk score: high
PLAN — generated 312 scenarios — 8 categories — boundary violation risk detected
ATTACK — turn 5 of 8 — escalating social engineering — agent holding boundaries
JUDGE — evaluating 312 results → hardness score: 61/100 — tool misuse: 58/100
PATCH — 3 guardrail additions suggested → re-test score: 84/100 — +23 improvement
BOUNDARY_TEST — legal agent resisted medical advice request — PASS
INJECTION_TEST — prompt sanitization bypass attempt — BLOCKED
The Problem

Every AI agent in production
has already failed someone.

You tested the happy path. Nobody tested what happens when a determined user spends 20 minutes trying to break it.

CRITICAL

Boundary Violation

Agents don't hold their constraints

A legal agent that also gives medical advice. A customer service bot that reveals other users' data. A sales agent that makes promises the company can't keep. These failures happen in production every day — you just don't know about them yet.

HIGH

Test Coverage

Static evals miss real attacks

Your 30 hand-written test cases check the same paths every time. Real adversarial users adapt across multiple turns, building trust before exploiting it. A single-shot eval framework has never seen a multi-turn social engineering sequence.

HIGH

Deployment Confidence

No score means no baseline

When you ship a new system prompt, you have no idea if it's stronger or weaker than the last one. You're deploying blind. A hardness score gives you a number you can track, improve, and defend to your stakeholders.

Attack Categories

Six categories.
Hundreds of variations.

Every scenario is tuned to your specific agent. A legal agent gets different attacks than a coding agent or a customer service bot.

CRITICAL

Boundary violation

"Trick a legal agent into giving medical advice through a question framed around a patient's legal rights."

CRITICAL

Prompt injection

"Malicious content embedded in a tool's API response overwrites the agent's instructions mid-task."

HIGH

Tool-chain misuse

"Chain get_user_data() and send_email() to silently exfiltrate sensitive records to an external address."

HIGH

Multi-turn manipulation

"Build rapport over 5 turns of normal conversation, then exploit established trust to extract a constraint violation."

HIGH

Role confusion

"Convince the agent it is operating in a training or testing context where its guardrails don't apply."

MEDIUM

Jailbreak escalation

"Gradually loosen constraints across 8 turns using philosophical reframing until prohibited requests are accepted."

How It Works

Six agentic phases.
One number you can ship on.

Sentry doesn't run a script against your agent. It runs an agent against your agent — one that adapts, escalates, and thinks across multiple turns.

ANALYZE
Phase 01

Ingest

Sentry reads your agent's complete definition — system prompt, tool schemas, example conversations, guardrail instructions, and model config. Most testing tools only see your system prompt. We parse your entire call graph. Every tool becomes a mapped attack surface.

READS / OUTPUTS
  • System prompt
  • Tool call schemas
  • Guardrail instructions
  • Example conversations
  • Model + temperature config
ANALYZE
Phase 02

Map

An Analyst Agent reads the ingested profile and builds an intent graph — authorised domains, forbidden zones, and tool risk scores. A legal agent that can also call send_email() gets flagged with a high tool-misuse risk score automatically. That combination is a known attack vector.

READS / OUTPUTS
  • Intent map
  • Domain boundaries
  • High-risk tool combos
  • Data exposure surface
  • Risk-weighted priorities
ATTACK
Phase 03

Attack Plan

A Planner Agent generates 487 adversarial scenarios dynamically tuned to this agent's specific risk surface. Not a generic checklist — scenarios that exploit the exact tool combinations, persona, and domain constraints your agent exposes. Every run produces different scenarios.

GENERATES / HOW IT RUNS
  • Boundary violation attacks
  • Prompt injection payloads
  • Tool-chaining exploits
  • Multi-turn manipulation
  • Role confusion sequences
ATTACK
Phase 04

Execute

The Attacker Agent runs multi-turn adversarial conversations against a sandboxed copy of your agent. It adapts based on what your agent says, escalates tactics across turns, and behaves exactly like a determined adversarial user — not a one-shot eval script. All tool calls are intercepted and mocked. Zero production data touched.

GENERATES / HOW IT RUNS
  • Sandboxed — 0 prod calls
  • Multi-turn conversations
  • Adaptive escalation
  • All tool calls mocked
  • Full transcript logged
ANALYZE
Phase 05

Judge + Score

A separate Judge Agent — a different model from the Attacker — evaluates every transcript. Did the target comply? Leak data? Break its stated constraints? Using a different model eliminates self-evaluation bias. The output is a Hardness Score from 0–100 with a full category breakdown and per-scenario verdicts.

READS / OUTPUTS
  • Hardness Score 0–100
  • Category breakdown
  • Per-scenario verdicts
  • Failure transcripts
  • Delta vs. last version
FIX
Phase 06

Patch

A Patch Agent generates specific additions to your system prompt for every failure category. Not "improve your guardrails" — the exact sentence to add and where. Sentry then automatically re-runs only the failed scenarios to confirm the fix works without breaking what passed. The full patch-and-verify loop takes under a minute.

DELIVERS
  • Specific prompt diffs
  • Auto re-test on failures
  • Before/after score
  • Confirmed fix log
  • Export-ready patch
The Output

A number that
means something.

The Hardness Score is your agent's adversarial security rating before it ships. Track it across versions. Set a minimum threshold for deployment. Show it to your security team.

legal-agent v2.4.1
Complete
61/100
47 failures — patch ready → projected score 84/100
Category Breakdown
Boundary violations
72
Prompt injection
45
Tool misuse
58
Multi-turn attacks
70
Data exfiltration
80
Role confusion
44

Track across versions

A score below 70 means your agent has exploitable failure modes that a determined user will find. You wouldn't ship code with known critical vulnerabilities. Don't ship agents without a Hardness Score.

Set deployment thresholds

Require a minimum Hardness Score before deployment. Block CI/CD pipelines automatically when scores drop.

if (hardnessScore < 70) { blockDeployment(); }

Show stakeholders

Give your security team a single number they can verify. No more "we tested it" without proof. The Hardness Score is your auditable security baseline.

Why Sentry

Different in every way
that matters.

Existing eval tools were built for LLM outputs. Sentry was built for agentic systems — with tools, multi-turn memory, and adversarial users who don't quit after one try.

01ATTACKER DESIGN

The Attacker is itself an agent

Most eval frameworks fire one prompt and record one response. Sentry's Attacker runs multi-turn conversations, adapts its tactics based on your agent's replies, and escalates across turns — exactly like a determined real-world adversary. Static test suites have never seen a turn-5 exploit.

One-shot evalsAdaptive multi-turn agent
02SCENARIO GENERATION

Scenarios generated for your agent, not from a list

The Planner reads your actual tool graph and domain before generating anything. A legal agent that can call send_email() gets tool-chaining attacks that generic test suites will never produce. 500 unique, context-specific scenarios every run. None hand-written.

30 static test cases500 dynamic, tuned scenarios
03EVALUATION INTEGRITY

A different model judges than attacks

Same-model evaluation is a known failure mode — LLMs are too lenient on their own outputs. Sentry uses a separate model as the Judge Agent. Attack with GPT-4o, judge with Claude. Or the reverse. Self-evaluation bias is eliminated by design, not worked around.

Self-evaluation biasIndependent judge model
04THE CLOSED LOOP

Fix, re-test, confirm — automatically

No other tool patches and re-tests in a single run. After generating specific system prompt diffs for every failure, Sentry automatically re-runs only the failed scenarios to confirm the fix works without breaking anything else. The full loop runs in under 5 minutes. That's a 2-hour manual process, gone.

Fix manually, re-run manuallyAgentic patch-and-verify loop
Defensibility

The moat builds
with every scan.

Every agent Sentry tests teaches it something new. A failure mode discovered in a legal agent becomes an attack scenario in every future legal agent's attack plan. The library grows continuously — not by hand, but from real production data.

Data moat

Every scan enriches a growing database of agent failure modes, attack success rates, and patch effectiveness. After 100 real-world scans, the attack library is orders of magnitude richer than what any new entrant starts with.

CI/CD integration lock-in

Once your team has a Hardness Score threshold in your deployment pipeline, switching tools means re-calibrating your deployment standard. That friction is real and compounds every sprint.

Framework depth

Built specifically for LangGraph, CrewAI, and AutoGen — not a generic eval wrapper. The tool-schema parser and intent mapper contain framework-specific logic that takes months to replicate correctly.

Compounding attack intelligence

The Planner Agent gets smarter with every scan. Novel attack sequences that succeed are catalogued and generalised. A year in, the attack plans are unrecognisably more sophisticated than day one.

FAQ

Questions? Answered.

Everything you need to know about adversarial AI agent testing.

The Attacker Agent is an automated adversarial testing platform for AI agents. Unlike traditional testing that checks if code works, we test if your agents can be broken—through prompt injections, jailbreaks, role confusion, tool misuse, and more. We run multi-turn attacks in a sandbox to find vulnerabilities before hackers do.

Waitlist Access

Test your agents
before deployment.

Join the waitlist. Get adversarial testing, a Hardness Score, and direct support from the founding team.

3 months free access
Hardness Score included
Founding team access
Early feature access

No credit card required