AIG-008 AI System Verification, Validation and Testing
Description
Defined verification and validation (V&V) procedures are executed before an AI system is deployed or after a substantial modification. Testing includes: functional accuracy against defined metrics, performance against pre-specified thresholds, robustness to distributional shift, safety and failure-mode testing, and fairness and bias evaluation across relevant population subgroups. Test datasets, metrics, tooling, and results are documented and retained. Testing is not performed solely by the team that built the system.
Rationale
AI systems fail in qualitatively different ways from conventional software; V&V must be designed specifically for AI failure modes, not inherited from generic software testing.
Framework Mappings (9)
| EU-AI-Art.15.1 | Accuracy, Robustness and Cybersecurity — Performance Standards | partial |
| EU-AI-Art.9.5 | AI Risk Management System — Testing for Risk Management | full |
| A.6.2.4 | AI system verification and validation | full |
| MEASURE 1.3 | Independent AI Risk Assessment | full |
| MEASURE 2.1 | AI Testing and Evaluation Documentation | full |
| MEASURE 2.11 | AI Fairness and Bias Evaluation | full |
| MEASURE 2.3 | AI System Performance Measurement | full |
| MEASURE 2.5 | AI System Validity and Reliability | full |
| MEASURE 2.6 | AI System Safety Risk Evaluation | full |
Evidence (2)
AI system V&V test report produced before deployment or after substantial modification, documenting test datasets, metrics, tooling, results, and independent reviewer sign-off.
Example: V&V Test Report — Fraud Detection Model v4 (Confluence), dated 2025-11-03, containing accuracy, precision, recall, fairness metrics, adversarial robustness results, and sign-off by independent QA team
Test: Request the V&V test report for a sample of production AI systems. Verify: (1) report covers functional accuracy, robustness, safety, and fairness dimensions, (2) test datasets are identified and versioned, (3) results are compared to pre-specified acceptance thresholds, (4) report was authored or reviewed by a team independent of the development team, (5) report is retained and accessible.
Automated evaluation pipeline output (CI/CD test suite results) demonstrating that defined model performance thresholds were checked programmatically before promotion to production.
Example: GitHub Actions CI pipeline run log for model-fraud-detection (run #4812), showing automated accuracy >= 0.92, AUC >= 0.95, and bias test pass gates before merge approval
Test: Request CI/CD pipeline logs for a recent model deployment. Verify: (1) automated evaluation steps are present in the pipeline definition, (2) performance thresholds are coded as pass/fail gates, (3) the deployment was blocked or approved based on gate results, (4) bias evaluation is included as a gate (not only accuracy).
Questions (2)
Are defined verification and validation procedures executed before any AI system is deployed or after a substantial modification?
AI V&V must cover dimensions conventional software testing misses: distributional robustness, fairness across population subgroups, and safety failure modes. Testing should not be performed solely by the team that built the system.
Which of the following are included in your AI system V&V testing?
All six elements characterise a mature AI V&V process. Fairness evaluation and independent review are the most frequently absent from programmes that inherit generic software testing practices.