Adversarial Robustness
Summary#
Injection resistance, role confusion, and red team validation
Applicability#
| Certification Level | Status | Description |
|---|---|---|
| L1Supervised Operational Reliability | Required | Applicable ACRs must be satisfied for L1 certification. |
| L2Bounded Autonomous Deployment | Required | Full domain scope is evaluated for L2 certification. |
| L3High-Stakes Autonomous Certification | Required | Maximum rigor evaluation at L3 level with extended evidence requirements. |
Risk Rationale#
Linked ACR Controls#
The following Autonomous Compliance Requirements are assigned to this domain. Each ACR defines a specific, testable control with its own evaluation method, classification, and evidence requirements.
The system SHALL resist direct prompt injection attacks with a pass rate of 95% or higher against a
The system SHALL resist direct prompt injection attacks with a pass rate of 95% or higher against a standardized injection test suite.
The system SHALL resist indirect prompt injection where malicious instructions are embedded in data
The system SHALL resist indirect prompt injection where malicious instructions are embedded in data from external sources.
The system SHALL resist role confusion attacks that attempt to alter its operational identity or aut
The system SHALL resist role confusion attacks that attempt to alter its operational identity or authority level.
The system SHALL resist malicious context insertion where adversarial content is injected into opera
The system SHALL resist malicious context insertion where adversarial content is injected into operational context.
The system SHALL detect and reject role confusion attempts with a pass rate of 95% or higher.
The system SHALL detect and reject role confusion attempts with a pass rate of 95% or higher.
The system SHALL detect adversarial tool outputs including manipulated API responses and poisoned da
The system SHALL detect adversarial tool outputs including manipulated API responses and poisoned data feeds.
The system SHALL resist jailbreaking attempts that seek to override safety constraints or behavioral
The system SHALL resist jailbreaking attempts that seek to override safety constraints or behavioral policies.
The system SHALL resist multi-step adversarial sequences where individual benign-appearing requests
The system SHALL resist multi-step adversarial sequences where individual benign-appearing requests combine for malicious objectives.
The system SHALL detect and resist social engineering attacks conducted through natural language int
The system SHALL detect and resist social engineering attacks conducted through natural language interaction.
The system SHALL undergo red team validation by qualified independent assessors at least annually.
The system SHALL undergo red team validation by qualified independent assessors at least annually.
The system SHALL maintain adversarial robustness when operating under stress conditions.
The system SHALL maintain adversarial robustness when operating under stress conditions.
The system SHALL resist model extraction and reverse engineering attempts that could expose vulnerab
The system SHALL resist model extraction and reverse engineering attempts that could expose vulnerabilities.
Adversarial input detection and logging SHALL be implemented for post-incident analysis.
Adversarial input detection and logging SHALL be implemented for post-incident analysis.
The system SHALL resist encoding-based injection attacks (Base64, Unicode, ROT13, etc.).
The system SHALL resist encoding-based injection attacks (Base64, Unicode, ROT13, etc.).
The system SHALL resist indirect prompt injection through data retrieved from databases, file system
The system SHALL resist indirect prompt injection through data retrieved from databases, file systems, and web sources.
The system SHALL resist instruction override attacks embedded in system messages or context-setting
The system SHALL resist instruction override attacks embedded in system messages or context-setting prompts.
Adversarial test suites SHALL be updated at intervals defined by the certification level to reflect
Adversarial test suites SHALL be updated at intervals defined by the certification level to reflect emerging attack patterns.
The system SHALL resist payload splitting attacks where malicious instructions are distributed acros
The system SHALL resist payload splitting attacks where malicious instructions are distributed across multiple inputs.
The system SHALL resist attacks that attempt to extract training data or system prompt content.
The system SHALL resist attacks that attempt to extract training data or system prompt content.
The system SHALL resist attacks that attempt to make it reveal its operational constraints or safety
The system SHALL resist attacks that attempt to make it reveal its operational constraints or safety boundaries.
Adversarial robustness test results SHALL be documented with attack methodology, success criteria, a
Adversarial robustness test results SHALL be documented with attack methodology, success criteria, and pass rates.
The system SHALL implement adversarial attack detection that triggers alerting for novel attack patt
The system SHALL implement adversarial attack detection that triggers alerting for novel attack patterns.
The system SHALL resist token manipulation attacks including homoglyph substitution and whitespace e
The system SHALL resist token manipulation attacks including homoglyph substitution and whitespace exploitation.
The system SHALL resist adversarial attacks targeting tool selection and parameter construction.
The system SHALL resist adversarial attacks targeting tool selection and parameter construction.
The system SHALL resist many-shot adversarial attacks where attack patterns are gradually introduced
The system SHALL resist many-shot adversarial attacks where attack patterns are gradually introduced across interactions.
Red team exercises SHALL follow documented methodology with defined scope, rules of engagement, and
Red team exercises SHALL follow documented methodology with defined scope, rules of engagement, and reporting requirements.
The system SHALL resist adversarial inputs that exploit ambiguity in the instruction hierarchy.
The system SHALL resist adversarial inputs that exploit ambiguity in the instruction hierarchy.
Adversarial robustness SHALL be tested across all input channels and interfaces, not just the primar
Adversarial robustness SHALL be tested across all input channels and interfaces, not just the primary interaction mode.
The system SHALL resist adversarial fine-tuning or poisoning of any adaptable model components.
The system SHALL resist adversarial fine-tuning or poisoning of any adaptable model components.
The system SHALL maintain safety constraint enforcement during adversarial conditions without degrad
The system SHALL maintain safety constraint enforcement during adversarial conditions without degradation.
Adversarial testing SHALL include tests specific to the system's deployment context and industry.
Adversarial testing SHALL include tests specific to the system's deployment context and industry.
The system SHALL resist privilege escalation attacks conducted through adversarial interaction.
The system SHALL resist privilege escalation attacks conducted through adversarial interaction.
Attack surface documentation SHALL be maintained and updated with each system change.
Attack surface documentation SHALL be maintained and updated with each system change.
The system SHALL resist attacks that attempt to cause it to ignore or downgrade the severity of its
The system SHALL resist attacks that attempt to cause it to ignore or downgrade the severity of its own safety alerts.
Adversarial robustness metrics SHALL be tracked over time to detect degradation trends.
Adversarial robustness metrics SHALL be tracked over time to detect degradation trends.
The system SHALL resist context overflow attacks designed to push safety instructions out of the pro
The system SHALL resist context overflow attacks designed to push safety instructions out of the processing window.
Red team findings SHALL be tracked through remediation with verified closure of identified vulnerabi
Red team findings SHALL be tracked through remediation with verified closure of identified vulnerabilities.
The system SHALL undergo automated adversarial regression testing with each significant system updat
The system SHALL undergo automated adversarial regression testing with each significant system update.