AIG-013 Training Data Provenance
Description
The provenance of all data used in AI training, fine-tuning, and evaluation is documented and traceable. Provenance records include: origin (internal, third-party, web-scraped, synthetic), licence and copyright status, collection date, any transformations applied, and consent or legal basis for use. For GPAI models that use web-scraped or third-party data, copyright compliance mechanisms are in place. Provenance records are retained for the lifetime of the model.
Rationale
Provenance underpins both IP compliance (copyright, licensing) and bias auditing; data without documented origin cannot be removed from training sets when disputes arise.
Framework Mappings (4)
| EU-AI-Art.10.2 | Data Governance — Data Preparation and Bias Management | partial |
| EU-AI-Art.53.3 | GPAI Model Obligations — Copyright Compliance Policy | full |
| A.7.5 | Data provenance | full |
| GOVERN 6.1 | Third-Party AI Risk Policies | partial |
Evidence (2)
Training data provenance record for each production model, documenting origin (internal, third-party, web-scraped, synthetic), licence and copyright status, collection date, transformations applied, and legal basis for use.
Example: Data Provenance Record — LLM Fine-Tune Dataset v2 (Confluence), listing 4 source datasets: internal CRM exports (contract basis), licensed Common Crawl subset (licence agreement #CC-2024-07), synthetic augmentation (internal generation), with copyright review completed by legal 2025-05-10
Test: Request provenance records for a sample of training datasets used by production models. Verify: (1) origin of each data source is documented, (2) licence and copyright status is recorded for each source, (3) legal basis for processing is stated, (4) transformation steps are described, (5) for web-scraped sources, a copyright compliance mechanism is documented, (6) records are linked to the model registry entry.
Licence agreements or data processing agreements for third-party or licensed training datasets, confirming the organisation has the legal right to use the data for AI training purposes.
Example: Data Licence Agreement with DataProvider Ltd (executed 2024-07-15), explicitly granting rights to use dataset for model training, specifying permitted use scope and restrictions on redistribution of derivative models
Test: Request licence or data processing agreements for all third-party training datasets identified in provenance records. Verify: (1) the agreement explicitly permits use for AI/ML model training, (2) any restrictions on derivative models are identified and assessed against current use, (3) agreements are current (not expired), (4) agreements are stored in a retrievable contract repository.
Questions (2)
Is the provenance of all data used in AI training, fine-tuning, and evaluation documented and traceable?
Provenance records are required for both IP compliance (copyright, licensing) and bias auditing. Data without documented origin cannot be removed from training sets when disputes arise.
What does your training data provenance record include for each data source?
All six elements should be present for sources used in production models. For GPAI providers, copyright compliance mechanisms for web-scraped data are a direct EU AI Act obligation (Art. 53.3).