AIG-008 AI System Verification, Validation and Testing

Tier 2+AI

Description

Defined verification and validation (V&V) procedures are executed before an AI system is deployed or after a substantial modification. Testing includes: functional accuracy against defined metrics, performance against pre-specified thresholds, robustness to distributional shift, safety and failure-mode testing, and fairness and bias evaluation across relevant population subgroups. Test datasets, metrics, tooling, and results are documented and retained. Testing is not performed solely by the team that built the system.

Rationale

AI systems fail in qualitatively different ways from conventional software; V&V must be designed specifically for AI failure modes, not inherited from generic software testing.

Framework Mappings (9)

EU AI Act 2024

EU-AI-Art.15.1	Accuracy, Robustness and Cybersecurity — Performance Standards	partial
EU-AI-Art.9.5	AI Risk Management System — Testing for Risk Management	full

ISO/IEC 42001:2023

A.6.2.4

AI system verification and validation

full

NIST AI RMF 1.0

MEASURE 1.3	Independent AI Risk Assessment	full
MEASURE 2.1	AI Testing and Evaluation Documentation	full
MEASURE 2.11	AI Fairness and Bias Evaluation	full
MEASURE 2.3	AI System Performance Measurement	full
MEASURE 2.5	AI System Validity and Reliability	full
MEASURE 2.6	AI System Safety Risk Evaluation	full

Evidence (2)

reportmanual

AI system V&V test report produced before deployment or after substantial modification, documenting test datasets, metrics, tooling, results, and independent reviewer sign-off.

Example: V&V Test Report — Fraud Detection Model v4 (Confluence), dated 2025-11-03, containing accuracy, precision, recall, fairness metrics, adversarial robustness results, and sign-off by independent QA team

Test: Request the V&V test report for a sample of production AI systems. Verify: (1) report covers functional accuracy, robustness, safety, and fairness dimensions, (2) test datasets are identified and versioned, (3) results are compared to pre-specified acceptance thresholds, (4) report was authored or reviewed by a team independent of the development team, (5) report is retained and accessible.

tool_outputautomated

Automated evaluation pipeline output (CI/CD test suite results) demonstrating that defined model performance thresholds were checked programmatically before promotion to production.

Example: GitHub Actions CI pipeline run log for model-fraud-detection (run #4812), showing automated accuracy >= 0.92, AUC >= 0.95, and bias test pass gates before merge approval

Test: Request CI/CD pipeline logs for a recent model deployment. Verify: (1) automated evaluation steps are present in the pipeline definition, (2) performance thresholds are coded as pass/fail gates, (3) the deployment was blocked or approved based on gate results, (4) bias evaluation is included as a gate (not only accuracy).

Questions (2)

boolean

Are defined verification and validation procedures executed before any AI system is deployed or after a substantial modification?

AI V&V must cover dimensions conventional software testing misses: distributional robustness, fairness across population subgroups, and safety failure modes. Testing should not be performed solely by the team that built the system.

multi

Which of the following are included in your AI system V&V testing?

Functional accuracy against pre-specified metrics and thresholdsRobustness to distributional shift or out-of-distribution inputsSafety and failure-mode testingFairness and bias evaluation across protected characteristic subgroupsIndependent review (not solely by the development team)Documented and retained test datasets and results

All six elements characterise a mature AI V&V process. Fairness evaluation and independent review are the most frequently absent from programmes that inherit generic software testing practices.

Search controls

AIG-008 AI System Verification, Validation and Testing

Framework Mappings (9)

Evidence (2)

Questions (2)