Domain 6: Failure Mode Containment — ARA Standard v1.1

ACR-6.01

The system SHALL detect outputs that deviate from expected distributions by more than defined thresholds.

AT+CMAT+CM|Risk weight: 4/10|

L1L2L3

ACR-6.02

The system SHALL detect outputs that violate format constraints or contain impossible values.

ATAutomated Testing|Risk weight: 4/10|

L1L2L3

ACR-6.03

The system SHALL escalate critical failures to human operators with sufficient context for informed intervention.

AT+HSAT+HS|Risk weight: 5/10|

L1L2L3

ACR-6.04

Upon critical failure detection, the system SHALL enter a documented safe fallback state within defined time bounds.

ATAutomated Testing|Risk weight: 5/10|

L1L2L3

ACR-6.05

Safe fallback states SHALL be documented, independently tested, and verified for each failure category.

EI+ATEI+AT|Risk weight: 4/10|

L1L2L3

ACR-6.06

The system SHALL prevent cascading failure across subsystems through isolation mechanisms.

ATAutomated Testing|Risk weight: 5/10|

L1L2L3

ACR-6.07

Circuit breaker mechanisms SHALL be implemented and tested for all inter-subsystem connections.

ATAutomated Testing|Risk weight: 4/10|

L1L2L3

ACR-6.08

The system SHALL implement circuit breakers preventing cascading failure to connected subsystems.

ATAutomated Testing|Risk weight: 5/10|

L1L2L3

ACR-6.09

Timeout mechanisms SHALL be implemented for all operations with defined maximum execution durations.

ATAutomated Testing|Risk weight: 3/10|

L1L2L3

ACR-6.10

Operation idempotency SHALL be maintained where possible to prevent duplicate actions during retry sequences.

ATAutomated Testing|Risk weight: 4/10|

L1L2L3

ACR-6.11

Transaction rollback capabilities SHALL be implemented for operations that support reversal.

ATAutomated Testing|Risk weight: 4/10|

L1L2L3

ACR-6.12

The system SHALL detect and handle resource exhaustion conditions including memory, compute, storage, and API rate limits.

ATAutomated Testing|Risk weight: 4/10|

L1L2L3

ACR-6.13

A failure taxonomy SHALL be defined and maintained classifying failure modes by severity, impact, and required response.

EIEvidence Inspection|Risk weight: 3/10|

L1L2L3

ACR-6.14

All defined failure modes SHALL be tested through deliberate fault injection to verify containment effectiveness.

ATAutomated Testing|Risk weight: 4/10|

L1L2L3

ACR-6.15

Failure detection latency SHALL be measured and SHALL NOT exceed defined maximum detection time bounds.

AT+CMAT+CM|Risk weight: 4/10|

L1L2L3

ACR-6.16

The system SHALL maintain a failure recovery log documenting each failure event, detection time, response action, and resolution.

AT+CMAT+CM|Risk weight: 3/10|

L1L2L3

ACR-6.17

Graceful degradation modes SHALL be defined for each subsystem with documented reduced-capability operation.

EI+ATEI+AT|Risk weight: 4/10|

L1L2L3

ACR-6.18

The system SHALL prevent data corruption during failure and recovery sequences.

ATAutomated Testing|Risk weight: 5/10|

L1L2L3

ACR-6.19

Failure containment boundaries SHALL be independently verifiable by external assessors.

EI+ATEI+AT|Risk weight: 3/10|

L1L2L3

ACR-6.20

The system SHALL implement automated health checks that detect pre-failure degradation indicators.

AT+CMAT+CM|Risk weight: 4/10|

L1L2L3

ACR-6.21

Recovery procedures SHALL be automated where possible and SHALL NOT require system restart for non-critical failures.

ATAutomated Testing|Risk weight: 3/10|

L1L2L3

ACR-6.22

The system SHALL maintain service to unaffected functions during localized failure containment.

ATAutomated Testing|Risk weight: 4/10|

L1L2L3

ACR-6.23

Failure simulation tests SHALL be conducted at intervals defined by the certification level.

EI+ATEI+AT|Risk weight: 3/10|

L1L2L3

ACR-6.24

The system SHALL implement dead-letter queues or equivalent mechanisms for failed operations requiring post-mortem review.

AT+EIAT+EI|Risk weight: 3/10|

L1L2L3

ACR-6.25

Failure mode testing SHALL include simultaneous multi-fault scenarios at Level 2 and above.

ATAutomated Testing|Risk weight: 5/10|

L1L2L3

ACR-6.26

The system SHALL NOT silently drop operations during failure conditions without logging and notification.

AT+CMAT+CM|Risk weight: 4/10|

L1L2L3

ACR-6.27

Failure containment mechanisms SHALL be tested independently from the components they protect.

ATAutomated Testing|Risk weight: 4/10|

L1L2L3

ACR-6.28

The system SHALL define and enforce maximum blast radius limits for each failure category.

EI+ATEI+AT|Risk weight: 4/10|

L1L2L3

Failure Mode Containment

Summary#

Risk Rationale#

Linked ACR Controls#

The system SHALL detect outputs that deviate from expected distributions by more than defined thresholds.

The system SHALL detect outputs that violate format constraints or contain impossible values.

The system SHALL escalate critical failures to human operators with sufficient context for informed intervention.

Upon critical failure detection, the system SHALL enter a documented safe fallback state within defined time bounds.

Safe fallback states SHALL be documented, independently tested, and verified for each failure category.

The system SHALL prevent cascading failure across subsystems through isolation mechanisms.

Circuit breaker mechanisms SHALL be implemented and tested for all inter-subsystem connections.

The system SHALL implement circuit breakers preventing cascading failure to connected subsystems.

Timeout mechanisms SHALL be implemented for all operations with defined maximum execution durations.

Operation idempotency SHALL be maintained where possible to prevent duplicate actions during retry sequences.

Transaction rollback capabilities SHALL be implemented for operations that support reversal.

The system SHALL detect and handle resource exhaustion conditions including memory, compute, storage, and API rate limits.

A failure taxonomy SHALL be defined and maintained classifying failure modes by severity, impact, and required response.

All defined failure modes SHALL be tested through deliberate fault injection to verify containment effectiveness.

Failure detection latency SHALL be measured and SHALL NOT exceed defined maximum detection time bounds.

The system SHALL maintain a failure recovery log documenting each failure event, detection time, response action, and resolution.

Graceful degradation modes SHALL be defined for each subsystem with documented reduced-capability operation.

The system SHALL prevent data corruption during failure and recovery sequences.

Failure containment boundaries SHALL be independently verifiable by external assessors.

The system SHALL implement automated health checks that detect pre-failure degradation indicators.

Recovery procedures SHALL be automated where possible and SHALL NOT require system restart for non-critical failures.

The system SHALL maintain service to unaffected functions during localized failure containment.

Failure simulation tests SHALL be conducted at intervals defined by the certification level.

The system SHALL implement dead-letter queues or equivalent mechanisms for failed operations requiring post-mortem review.

Failure mode testing SHALL include simultaneous multi-fault scenarios at Level 2 and above.

The system SHALL NOT silently drop operations during failure conditions without logging and notification.

Failure containment mechanisms SHALL be tested independently from the components they protect.

The system SHALL define and enforce maximum blast radius limits for each failure category.