Domain 08Introduced in v1.0

Adversarial Robustness

L1L2L338 ACRs (38 defined in current release)

Summary#

Injection resistance, role confusion, and red team validation

Applicability#

Certification Level	Status	Description
L1Supervised Operational Reliability	Required	Applicable ACRs must be satisfied for L1 certification.
L2Bounded Autonomous Deployment	Required	Full domain scope is evaluated for L2 certification.
L3High-Stakes Autonomous Certification	Required	Maximum rigor evaluation at L3 level with extended evidence requirements.

Risk Rationale#

Linked ACR Controls#

The following Autonomous Compliance Requirements are assigned to this domain. Each ACR defines a specific, testable control with its own evaluation method, classification, and evidence requirements.

Adversarial Robustness

Summary#

Applicability#

Risk Rationale#

Linked ACR Controls#

The system SHALL resist direct prompt injection attacks with a pass rate of 95% or higher against a

The system SHALL resist indirect prompt injection where malicious instructions are embedded in data

The system SHALL resist role confusion attacks that attempt to alter its operational identity or aut

The system SHALL resist malicious context insertion where adversarial content is injected into opera

The system SHALL detect and reject role confusion attempts with a pass rate of 95% or higher.

The system SHALL detect adversarial tool outputs including manipulated API responses and poisoned da

The system SHALL resist jailbreaking attempts that seek to override safety constraints or behavioral

The system SHALL resist multi-step adversarial sequences where individual benign-appearing requests

The system SHALL detect and resist social engineering attacks conducted through natural language int

The system SHALL undergo red team validation by qualified independent assessors at least annually.

The system SHALL maintain adversarial robustness when operating under stress conditions.

The system SHALL resist model extraction and reverse engineering attempts that could expose vulnerab

Adversarial input detection and logging SHALL be implemented for post-incident analysis.

The system SHALL resist encoding-based injection attacks (Base64, Unicode, ROT13, etc.).

The system SHALL resist indirect prompt injection through data retrieved from databases, file system

The system SHALL resist instruction override attacks embedded in system messages or context-setting

Adversarial test suites SHALL be updated at intervals defined by the certification level to reflect

The system SHALL resist payload splitting attacks where malicious instructions are distributed acros

The system SHALL resist attacks that attempt to extract training data or system prompt content.

The system SHALL resist attacks that attempt to make it reveal its operational constraints or safety

Adversarial robustness test results SHALL be documented with attack methodology, success criteria, a

The system SHALL implement adversarial attack detection that triggers alerting for novel attack patt

The system SHALL resist token manipulation attacks including homoglyph substitution and whitespace e

The system SHALL resist adversarial attacks targeting tool selection and parameter construction.

The system SHALL resist many-shot adversarial attacks where attack patterns are gradually introduced

Red team exercises SHALL follow documented methodology with defined scope, rules of engagement, and

The system SHALL resist adversarial inputs that exploit ambiguity in the instruction hierarchy.

Adversarial robustness SHALL be tested across all input channels and interfaces, not just the primar

The system SHALL resist adversarial fine-tuning or poisoning of any adaptable model components.

The system SHALL maintain safety constraint enforcement during adversarial conditions without degrad

Adversarial testing SHALL include tests specific to the system's deployment context and industry.

The system SHALL resist privilege escalation attacks conducted through adversarial interaction.

Attack surface documentation SHALL be maintained and updated with each system change.

The system SHALL resist attacks that attempt to cause it to ignore or downgrade the severity of its

Adversarial robustness metrics SHALL be tracked over time to detect degradation trends.

The system SHALL resist context overflow attacks designed to push safety instructions out of the pro

Red team findings SHALL be tracked through remediation with verified closure of identified vulnerabi

The system SHALL undergo automated adversarial regression testing with each significant system updat