AIG-012 Training Data Management and Quality
Description
Data used to train, fine-tune, or evaluate AI models is subject to documented data management practices. These include: documented acquisition and selection criteria, quality requirements (completeness, representativeness, accuracy, freshness), labelling and annotation procedures, bias identification and mitigation steps, and handling of data from underrepresented subgroups. Data quality is validated before use. Training datasets are versioned and referenced from the model registry.
Rationale
Data quality is the single largest determinant of AI system quality; undocumented or unvalidated training data is unauditable.
Framework Mappings (8)
| EU-AI-Art.10.1 | Data Governance — Training, Validation and Testing Dataset Quality | full |
| EU-AI-Art.10.2 | Data Governance — Data Preparation and Bias Management | full |
| EU-AI-Art.10.3 | Data Governance — Dataset Representativeness and Completeness | full |
| A.7.2 | Data for development and enhancement of AI system | full |
| A.7.3 | Acquisition of data | full |
| A.7.4 | Quality of data for AI systems | full |
| A.7.6 | Data preparation | full |
| MAP 2.3 | Scientific Integrity and Testing Considerations | partial |
Evidence (2)
Training dataset documentation record for each AI model, covering acquisition criteria, quality requirements, labelling procedures, bias identification steps, and reference to the versioned dataset in the model registry.
Example: Training Data Card — Customer Intent Dataset v3 (MLflow artefact tag: dataset-card), documenting source, selection criteria, quality validation results, annotator agreement scores, bias review finding, and link to versioned S3 dataset
Test: Request training dataset documentation for a sample of production models. Verify: (1) acquisition and selection criteria are documented, (2) quality validation results are present (completeness, accuracy, representativeness checks), (3) bias identification step and outcome are recorded, (4) dataset is versioned and the version is referenced in the model registry entry, (5) handling of underrepresented subgroups is addressed.
Data quality validation report from automated data pipeline tooling (e.g. Great Expectations, dbt tests, Soda) confirming that training datasets passed defined quality checks before model training commenced.
Example: Great Expectations validation result (HTML report, run 2026-01-10) for customer-intent-dataset-v3, showing 97.4% completeness, no null rate violations, and schema conformance pass across all 14 expectations
Test: Request the data quality validation report for a recent training dataset. Verify: (1) expectations cover completeness, accuracy, and representativeness dimensions, (2) all critical expectations passed, (3) report timestamp predates the model training run timestamp, (4) any failed expectations have a documented remediation or waiver.
Questions (2)
Are data used to train, fine-tune, or evaluate AI models subject to documented data management practices covering quality, labelling, and bias?
Training data quality is the single largest determinant of AI system quality. Practices should include documented quality requirements, bias identification steps, and validation before use.
Which of the following training data management practices are applied before model training begins?
All six practices are expected for a mature data governance programme. Missing bias identification or subgroup handling documentation creates exposure to fairness failures that surface after deployment.