Evaluation Methodology — v1.1
The ARA Standard defines a structured, repeatable evaluation process for certifying autonomous systems. Every system undergoes the same 10-phase certification lifecycle, adapted in rigor and scope to the requested certification level and determined Assurance Class. The methodology is technology-neutral, evidence-based, and independently verifiable.
Evaluations are conducted by Authorized Validation Bodies (AVBs) — independent organizations accredited by ARAF to perform certification assessments. v1.1 introduces mandatory risk classification, expanded evidence categories, two new evaluation methods, and system-profile-based scoping.
10-Phase Certification Lifecycle
Every ARA certification follows this sequential lifecycle. Phases are completed in order; earlier phases must be finalized before subsequent phases begin. The total evaluation timeline depends on the certification level, Assurance Class, system complexity, and the availability of evidence and test environments.
Phase 1: Intake & Scoping
Typical duration: 1–2 days
The applicant organization initiates the certification process through an Authorized Validation Body (AVB). The system is registered and assigned a unique ARA System Identifier (ASI). The AVB works with the applicant to determine the applicable system profile — Foundational (F), Standard (S), Advanced (A), or Comprehensive (C) — based on the system’s operational scope, deployment context, and autonomy characteristics.
Outputs
- •Unique ARA System Identifier (ASI)
- •AVB assignment confirmation
- •System profile selection (F/S/A/C)
- •Certification scope statement
Phase 2: Risk ClassificationNEW IN v1.1
Typical duration: 1–2 weeks
The AVB conducts a mandatory 7-factor risk assessment to determine the system’s Assurance Class (A, B, or C). The seven factors are: autonomy level, decision impact severity, data sensitivity, operational environment, human oversight capacity, action reversibility, and deployment scale. Each factor is scored and weighted to produce an overall risk classification. This phase is new in v1.1 and replaces the simpler level-confirmation step in v1.0.
Outputs
- •Risk factor scores (7 factors)
- •Assurance Class determination (A/B/C)
- •Risk classification rationale document
- •Escalation notice (if applicable)
Phase 3: Evidence Collection
Typical duration: 2–6 weeks
The applicant organization prepares and submits evidence across four defined categories based on the system profile. Evidence categories are: Log/Platform (LP) for telemetry, audit logs, and platform data; Technical Inspection (TI) for architecture review, configuration artifacts, and code analysis; Operational Proof (OP) for runtime behavior validation and operational testing; and Third-Party Attestation (TP) for independent audits, certifications, and external validation reports.
Outputs
- •Evidence portfolio organized by category (LP, TI, OP, TP)
- •Evidence sufficiency mapping per ACR
- •Gap analysis report
- •Platform certification inheritance claims (if applicable)
Phase 4: ACR Evaluation
Typical duration: 2–8 weeks
The core evaluation phase. Each applicable Autonomy Compliance Requirement is assessed using one or more of six evaluation methods: Automated Testing (AT), Human Simulation (HS), Evidence Inspection (EI), Continuous Monitoring (CM), Third-Party Attestation (TP), and Operational Proof (OP). The last two methods are new in v1.1. The AVB evaluates compliance against the defined acceptance criteria for each ACR per the system’s profile, records evidence, and assigns per-ACR scores.
Outputs
- •Per-ACR compliance scores
- •Evidence artifact repository
- •Non-conformity register
- •Interim findings report
Phase 5: Adversarial Testing
Typical duration: 1–3 weeks
Structured adversarial testing proportional to the certification level. L1 evaluations use automated adversarial test suites only. L2 evaluations add structured human adversarial simulation (minimum 40 hours). L3 evaluations require automated suites, human adversarial simulation (minimum 80 hours), an ARAF-approved independent red team engagement, and a minimum 30-day continuous runtime stress test.
Outputs
- •Adversarial test results
- •Vulnerability findings log
- •Red team report (L2/L3)
- •Stress test telemetry (L3)
Phase 6: Scoring & Determination
Typical duration: 1 week
Domain scores are calculated against the 15-domain threshold matrix for the requested certification level. The scoring model applies risk weighting and blocking ACR logic. Platform certification inheritance is applied where applicable, allowing systems built on certified platforms to inherit qualifying domain scores. The overall certification determination is pass/fail: all applicable domain thresholds must be met.
Outputs
- •Domain scorecard (15 domains)
- •Overall pass/fail determination
- •Platform cert inheritance applied
- •Conditional items register (if applicable)
Phase 7: Certification Issuance
Typical duration: 1–2 days
The AVB issues a formal two-axis certification designation combining the certification level and Assurance Class (e.g., L2-B). Platform certifications receive a level designation without an Assurance Class. A living badge is generated with real-time operational state tracking. The certification is published to the ARA Public Registry with full metadata.
Outputs
- •Two-axis designation (e.g., L2-B)
- •Platform cert variant (level only, no class)
- •ARA Trust Signal
- •Public registry entry
- •Certification ID (e.g., ARA-2026-XXXXX)
Phase 8: Continuous Monitoring
Typical duration: Ongoing
Post-certification monitoring is calibrated to the system’s Assurance Class. Class A systems perform periodic self-assessment with results reported to the registry. Class B systems are subject to monthly Continuous Assurance Platform Operator (CAPO) reports with telemetry monitoring. Class C systems require 24/7 CAPO oversight with real-time alerting and immediate escalation capabilities.
Outputs
- •Monitoring reports (per class schedule)
- •Telemetry dashboards
- •CAPO engagement records (Class B/C)
- •Compliance status updates
Phase 9: Renewal & Revalidation
Typical duration: Ongoing
All certifications require annual renewal through a streamlined reassessment. Revalidation may be triggered outside the normal renewal cycle by three conditions: material change to the system’s architecture or operational scope, a monitoring breach detected through continuous monitoring, or an assurance class lapse (e.g., Class B system failing to meet Class B monitoring requirements).
Outputs
- •Annual renewal assessment
- •Updated certification record
- •Revalidation trigger documentation (if applicable)
- •Updated registry entry
Phase 10: Ecosystem Participation
Typical duration: Ongoing
Certified systems participate in the broader ARA ecosystem. This includes public listing in the ARA Certification Registry, display of the living certification badge, eligibility for ARA-linked insurance products through the Risk-Informed Pricing framework, and participation in regulatory equivalence and consortium programs.
Outputs
- •Public registry listing
- •Living badge display rights
- •Insurance eligibility (RIP framework)
- •Consortium and regulatory participation
Evaluation Methods
Each ACR specifies one or more permitted evaluation methods. The AVB selects the appropriate method based on the ACR definition, the system's architecture, and the available evidence. Multiple methods may be applied to a single ACR for corroboration. v1.1 introduces two new methods: Third-Party Attestation (TP) and Operational Proof (OP).
Automated Testing
Programmatic test suites executed against the system under controlled conditions. Includes functional compliance tests, boundary condition tests, regression suites, and automated adversarial probes. Test results are deterministic and reproducible.
Applicability: All ACRs with observable, deterministic outputs.
Human Simulation
Structured scenarios executed by trained human evaluators who interact with the system as end users, adversaries, or edge-case operators. Human simulation tests behavioral responses that cannot be fully captured by automated suites, including nuanced escalation behavior, ambiguous input handling, and social engineering resistance.
Applicability: ACRs involving human interaction, escalation, or contextual judgment.
Evidence Inspection
Review of documentation, configuration artifacts, architecture diagrams, audit logs, and governance records. Evidence inspection validates that the organizational and technical controls surrounding the system are adequate and correctly implemented.
Applicability: ACRs related to governance, documentation, audit trails, and operational procedures.
Continuous Monitoring
Ongoing telemetry collection and analysis during the system’s production operation. Continuous monitoring validates that the system maintains compliance over time, detects behavioral drift, and triggers alerts when operational parameters deviate from certified baselines.
Applicability: ACRs requiring sustained operational validation beyond point-in-time testing.
Third-Party Attestation
NEWIndependent validation provided by qualified third parties. Includes external audit reports, industry certifications (SOC 2, ISO 27001, etc.), penetration test results from accredited firms, and expert attestation letters. Third-party evidence provides independent corroboration of compliance claims.
Applicability: ACRs where independent validation strengthens assurance, particularly in Domains 5, 6, and 13.
Operational Proof
NEWValidation through observed operational behavior in production or production-equivalent environments. Includes runtime performance data, incident response records, operational metrics, and demonstrated behavior under real-world conditions over defined observation periods.
Applicability: ACRs requiring evidence of sustained operational behavior rather than point-in-time testing.
Evidence Sufficiency Matrix
The Evidence Sufficiency Matrix defines the minimum evidence categories required at each certification level. Higher levels require broader evidence coverage and more independent validation.
L1 — Foundation
- •LP (Log/Platform) + TI (Technical Inspection) sufficient for most ACRs
- •OP (Operational Proof) accepted as supplementary evidence
- •TP (Third-Party Attestation) not required but accepted for Domain 5 and Domain 6
L2 — Operational
- •LP + TI + OP required for all applicable ACRs
- •TP required for Domain 5 (Data Privacy), Domain 6 (Security), and Domain 13 (Societal Impact)
- •Platform certification inheritance may reduce evidence burden for inherited domains
L3 — High-Stakes
- •All four evidence categories (LP, TI, OP, TP) required
- •Independent third-party attestation mandatory for all critical-weight ACRs
- •Extended observation periods required for OP evidence (minimum 90 days)
- •Platform certification inheritance claims subject to independent verification
| Evidence Category | L1 | L2 | L3 |
|---|---|---|---|
| LP — Log/Platform | Required | Required | Required |
| TI — Technical Inspection | Required | Required | Required |
| OP — Operational Proof | Supplementary | Required | Required |
| TP — Third-Party Attestation | Optional | Select domains | Mandatory (independent) |
Evidence Integrity Requirements
All evaluation evidence must satisfy the following integrity requirements to be accepted as part of a certification record:
| Requirement | Description |
|---|---|
| Provenance | Every evidence artifact must be traceable to the evaluation session, test execution, or review event that produced it. Evidence without clear provenance is inadmissible. |
| Immutability | Evidence records must be stored in a tamper-evident format. Cryptographic hashes (SHA-256 minimum) must be computed at the time of capture and verified at the time of review. |
| Completeness | Evidence must cover the full scope of each evaluated ACR. Partial evidence may support a partial compliance finding but cannot support full compliance. |
| Timeliness | Evidence must be collected within the evaluation window. Evidence older than 90 days at the time of certification decision is considered stale and must be re-collected unless otherwise justified in the evaluation record. |
| Independence | Evidence produced by the applicant organization is admissible but must be independently validated by the AVB. Self-reported evidence without independent verification does not satisfy evaluation requirements. |
Scoring Model
The ARA scoring model produces per-domain compliance scores across all 15 evaluation domains that are aggregated into an overall certification determination. The model is designed to be transparent, deterministic, and reproducible across different AVBs evaluating the same system.
ACR-Level Scoring
Each ACR is scored on a 4-point scale based on observed compliance against the defined acceptance criteria:
- 3 — Full Compliance
The system fully satisfies all acceptance criteria for the ACR. - 2 — Substantial Compliance
The system satisfies the core intent with minor deviations that do not materially affect reliability. - 1 — Partial Compliance
The system demonstrates relevant capability but falls short of the acceptance criteria in material ways. - 0 — Non-Compliance
The system does not satisfy the ACR. No relevant capability observed or critical failure identified.
Domain Score Calculation
Domain scores are computed as weighted averages of ACR scores within each domain. The formula accounts for risk weighting and blocking ACR logic:
Domain Score = (Σ ACR_score_i × weight_i) / (Σ max_score_i × weight_i) × 100The resulting percentage is compared against the domain threshold for the requested certification level. A system must meet or exceed every applicable domain threshold to qualify for certification.
Risk Weighting Model
Not all ACRs carry equal weight in the scoring model. Each ACR is assigned a risk weight that reflects its relative importance to operational reliability within its domain. Risk weights are determined by the ARAF Technical Standards Board and are published as part of the standard.
| Weight Tier | Multiplier | Criteria |
|---|---|---|
| Critical | 3.0x | ACR failure could result in safety-critical consequences, irreversible actions, or complete loss of operational reliability. Includes all blocking ACRs. |
| High | 2.0x | ACR failure would materially degrade operational reliability or create conditions for cascading failures. Includes ACRs governing core decision integrity and containment mechanisms. |
| Standard | 1.0x | ACR failure would reduce reliability below expected levels but would not create immediate safety or containment risks. The baseline weight for requirements that support operational hygiene. |
| Advisory | 0.5x | ACR represents a best practice or forward-looking requirement. Non-compliance reduces the domain score but is unlikely to independently cause certification failure. |
Blocking ACR Logic
Certain ACRs are designated as blocking — meaning that a score of 0 (Non-Compliance) on any blocking ACR results in automatic certification denial, regardless of the aggregate domain score. Blocking ACRs represent requirements where non-compliance constitutes a fundamental reliability failure that cannot be compensated for by strong performance elsewhere.
Blocking ACR Rules
- •A score of 0 on any blocking ACR results in automatic certification denial.
- •A score of 1 (Partial Compliance) on a blocking ACR triggers a mandatory remediation requirement. The system may receive Conditional Certification with a defined remediation timeline not to exceed 90 days.
- •Blocking ACR designations are level-dependent. An ACR that is blocking at L3 may not be blocking at L1.
- •The set of blocking ACRs is published as part of the standard and cannot be waived by the evaluating AVB.
- •Organizations may request a Blocking ACR Exception Review through ARAF governance channels, subject to TSB approval.
Examples of Blocking ACR Categories
The following categories typically contain blocking ACRs. Specific blocking designations are defined per-ACR in the standard documentation.
Red-Team Requirements by Level
Adversarial testing intensity is calibrated to the certification level. Higher certification levels require more extensive, more independent, and longer-duration adversarial engagements.
| Requirement | L1 | L2 | L3 |
|---|---|---|---|
| Automated adversarial suite | Required | Required | Required |
| Human adversarial simulation | Not required | 40+ hours minimum | 80+ hours minimum |
| Independent red team | Not required | Not required | ARAF-approved independent red team required |
| Continuous stress testing | Not required | Not required | 30-day minimum runtime stress test |
| Red team independence | N/A | AVB assessors (structurally independent from applicant) | Third-party red team with no relationship to applicant or AVB |
| Social engineering testing | Not required | Basic probes | Full social engineering engagement |
Domain Scoring Thresholds
Each certification level prescribes minimum passing scores for each of the 15 applicable evaluation domains. These thresholds represent the floor of acceptable compliance — systems must meet or exceed every threshold to qualify. Thresholds are intentionally set to ensure that higher-stakes certifications demand proportionally higher reliability across all dimensions.
For detailed domain threshold values by level, see the Certification Levels page.
Threshold Interpretation Rules
- •A domain score below the threshold for any single domain results in certification failure for that level, regardless of scores in other domains.
- •Domain scores are not averaged or aggregated. Each domain is independently gated.
- •A system that fails a single domain may apply for Conditional Certification if the gap is minor and remediable within 90 days.
- •Domain thresholds increase with certification level: L1 thresholds range from 55–75%, L2 from 70–85%, and L3 from 85–95%.
- •Domain 14 (Data Privacy) and Domain 15 (Physical Safety) thresholds apply based on system profile and physical actuation capabilities respectively.
Special Evaluation Guidance
Certain system categories require additional evaluation considerations beyond the standard 10-phase lifecycle. The following guidance applies to systems with specialized characteristics.
Voice AI Systems
Voice AI systems require additional evaluation of real-time speech processing reliability, accent/dialect handling equity, voice consent and disclosure requirements, and emotional manipulation resistance. Domain 4 (Transparency) ACRs include voice-specific disclosure requirements. Domain 14 (Data Privacy) includes voice data retention and biometric consent provisions.
Multimodal Systems
Systems that process multiple input modalities (text, image, audio, video) are evaluated across all applicable modality-specific ACRs. Cross-modal interaction effects must be tested during adversarial testing. Evidence collection must cover each modality independently and in combination.
Multi-Agent Orchestration
Orchestration systems that coordinate multiple autonomous agents are evaluated as composite systems. The evaluation scope includes inter-agent communication integrity, delegation chain accountability, conflict resolution mechanisms, and cascading failure containment. Each subordinate agent’s certification status is verified.
Physical / Robotic Systems
Systems with physical actuation capabilities are evaluated against Domain 15 (Physical Safety) ACRs in addition to all applicable digital domains. Physical safety testing requires controlled environment validation, emergency stop verification, and human proximity safety assessment. L3 certification required for systems with irreversible physical actions.
MCP-Connected Agents
Agents that use Model Context Protocol (MCP) connections to external tools and services are evaluated for tool-use authorization controls, scope limitation enforcement, data exfiltration prevention, and connection integrity monitoring. Each MCP tool connection is treated as an autonomy boundary expansion requiring specific ACR coverage.
Express Pathway — L1 Foundation
The Express Pathway provides a streamlined 3-4 week evaluation for low-risk systems seeking L1 Foundation certification. This pathway reduces evaluation overhead while maintaining certification rigor for systems that present minimal risk.
Eligibility Requirements
- •Foundational (F) or Standard (S) system profile only
- •Assurance Class A only (determined in Phase 2)
- •No Domain 15 (Physical Safety) ACRs applicable
- •System does not process sensitive personal data at scale
- •No multi-agent orchestration capabilities
Streamlined Process
- •Combined Intake, Risk Classification, and Evidence Collection (1 week)
- •Focused ACR evaluation against profile-specific subset (1–2 weeks)
- •Automated adversarial testing only (3–5 days)
- •Expedited scoring and issuance (2–3 days)
- •Total timeline: 3–4 weeks end-to-end
Note: Express Pathway certifications carry the same validity and registry status as standard L1 certifications. The pathway notation is recorded in the certification metadata but does not affect the certification designation or badge.