GASP: AICF

Search controls

Search by control ID, name or domain

AIG-008 AI System Verification, Validation and Testing

Tier 2+AI

Description

Defined verification and validation (V&V) procedures are executed before an AI system is deployed or after a substantial modification. Testing includes: functional accuracy against defined metrics, performance against pre-specified thresholds, robustness to distributional shift, safety and failure-mode testing, and fairness and bias evaluation across relevant population subgroups. Test datasets, metrics, tooling, and results are documented and retained. Testing is not performed solely by the team that built the system.

Rationale

AI systems fail in qualitatively different ways from conventional software; V&V must be designed specifically for AI failure modes, not inherited from generic software testing.

Framework Mappings (9)

EU-AI-Art.15.1Accuracy, Robustness and Cybersecurity — Performance Standardspartial
EU-AI-Art.9.5AI Risk Management System — Testing for Risk Managementfull
A.6.2.4AI system verification and validationfull
MEASURE 1.3Independent AI Risk Assessmentfull
MEASURE 2.1AI Testing and Evaluation Documentationfull
MEASURE 2.11AI Fairness and Bias Evaluationfull
MEASURE 2.3AI System Performance Measurementfull
MEASURE 2.5AI System Validity and Reliabilityfull
MEASURE 2.6AI System Safety Risk Evaluationfull

Evidence (2)

reportmanual

AI system V&V test report produced before deployment or after substantial modification, documenting test datasets, metrics, tooling, results, and independent reviewer sign-off.

Example: V&V Test Report — Fraud Detection Model v4 (Confluence), dated 2025-11-03, containing accuracy, precision, recall, fairness metrics, adversarial robustness results, and sign-off by independent QA team

Test: Request the V&V test report for a sample of production AI systems. Verify: (1) report covers functional accuracy, robustness, safety, and fairness dimensions, (2) test datasets are identified and versioned, (3) results are compared to pre-specified acceptance thresholds, (4) report was authored or reviewed by a team independent of the development team, (5) report is retained and accessible.

tool_outputautomated

Automated evaluation pipeline output (CI/CD test suite results) demonstrating that defined model performance thresholds were checked programmatically before promotion to production.

Example: GitHub Actions CI pipeline run log for model-fraud-detection (run #4812), showing automated accuracy >= 0.92, AUC >= 0.95, and bias test pass gates before merge approval

Test: Request CI/CD pipeline logs for a recent model deployment. Verify: (1) automated evaluation steps are present in the pipeline definition, (2) performance thresholds are coded as pass/fail gates, (3) the deployment was blocked or approved based on gate results, (4) bias evaluation is included as a gate (not only accuracy).

Questions (2)

boolean

Are defined verification and validation procedures executed before any AI system is deployed or after a substantial modification?

AI V&V must cover dimensions conventional software testing misses: distributional robustness, fairness across population subgroups, and safety failure modes. Testing should not be performed solely by the team that built the system.

multi

Which of the following are included in your AI system V&V testing?

Functional accuracy against pre-specified metrics and thresholdsRobustness to distributional shift or out-of-distribution inputsSafety and failure-mode testingFairness and bias evaluation across protected characteristic subgroupsIndependent review (not solely by the development team)Documented and retained test datasets and results

All six elements characterise a mature AI V&V process. Fairness evaluation and independent review are the most frequently absent from programmes that inherit generic software testing practices.