ScannersRed Team / Evals

Red Team and Eval Engine

Sentinel Red Team and Eval workflows run repeatable probes, detectors, generators, and config-driven assertions against LLM applications and provider integrations.

Definition

Sentinel Red Team and Eval Engine is a repeatable testing layer for LLM applications. It turns prompts, probes, expected assertions, and provider settings into evidence that can be run locally, in CI, and during release review.

Config-driven evals

Use YAML evals when you need a stable regression suite for prompts, providers, and tool behavior. Every important app behavior should have an assertion that explains what must stay true.

Operational checklist

Provider IDs and model choices
Prompt templates and variables
Assertions for refusal, containment, JSON shape, and leakage
Thresholds that decide release pass/fail

Red team probes

Red team probes are designed to exercise failure modes, not happy paths. They help security teams find injection, leakage, excessive agency, and unsafe output handling before a real user does.

Operational checklist

Prompt injection and indirect prompt injection
Tool-use abuse and overbroad permissions
RAG leakage and poisoned context
Output handling and structured response bypasses

Reporting evidence

Eval evidence should be versioned with the application. A failed probe should include the prompt, model/provider, assertion, observed output, severity, and retest command.

Operational checklist

Attach JSON output to CI artifacts
Keep failed prompts small and reproducible
Map failures to OWASP LLM where possible

Probe coverage benchmarks (OWASP LLM Top 10 2025 + benchmarks)

External benchmarks provide objective probe corpus and difficulty ratings. JailbreakBench and AIRTBench score how effectively evals catch adversarial inputs. ISC-Bench evaluates instruction-following safety compliance. Cross-reference probe results against OWASP LLM Top 10 2025 categories to verify coverage.

Operational checklist

OWASP LLM01:2025 Prompt Injection — probe corpus: direct instruction override, ignore-previous-instructions patterns
OWASP LLM06:2025 Excessive Agency — probe corpus: tool permission boundary violations, overbroad capability requests
OWASP LLM02:2025 Sensitive Info Disclosure — probe corpus: PII extraction, credential leakage, training data extraction
JailbreakBench (jailbreakbench.github.io): standardized leaderboard and probe corpus for jailbreak safety evaluation
AIRTBench: AI red teaming evaluation suite covering both LLM and agentic attack surfaces
ISC-Bench: instruction-following safety compliance benchmark for evaluating refusal quality

SARIF output and CI integration

Sentinel red-team eval output can be emitted as SARIF (Static Analysis Results Interchange Format), making probe failures first-class findings in GitHub Advanced Security, Azure DevSecOps, and other SARIF consumers.

Operational checklist

SARIF rule ID maps to OWASP LLM category (e.g., LLM01, LLM06) for automatic triage
Each finding includes: probe prompt, model response, assertion, severity, and remediation hint
Upload to GitHub Code Scanning via `actions/upload-sarif` for inline PR annotation
SARIF severity levels align with Sentinel severity guide: CRITICAL/HIGH block merge, MEDIUM/LOW advisory

Commands

sentinel redteam --target openai/gpt-4o
sentinel redteam --list-probes
sentinel evaluate eval.yaml --fail-on-threshold 0.95
sentinel evaluate eval.yaml -f json -o eval-report.json

Expected output

Output should carry rule ID, severity, surface, evidence, and release decision in a way other teams can understand.

suite: agent-redteam
pass_rate: 0.91
failed:
  - prompt_injection.system_prompt_leak
  - tool_use.excessive_agency
decision: fail threshold 0.95

FAQ

How often should evals run?

Run fast evals on prompt/tool PRs, full suites before release, and scheduled suites when providers or retrieval data change.

Should eval failures block release?

Failures tied to data leakage, privileged tool abuse, or policy bypass should block release. Cosmetic response drift can be reviewed as MEDIUM or LOW.

Which external benchmarks should I use?

JailbreakBench for jailbreak probe coverage and leaderboard comparison, AIRTBench for agentic red teaming, and ISC-Bench for instruction-following safety compliance. Map probe results to OWASP LLM Top 10 2025 to confirm coverage across all ten categories.

References

Eresus support

Turn the finding into an action your team can actually close.

If you need exploit evidence, prioritization, remediation direction, and retesting for Red Team and Eval Engine, Eresus can help scope the work with your team.

Start Security Test

PreviousHuggingFace Guard NextSupply Chain / AIBOM