AIG-013 Training Data Provenance

Tier 2+AI

Description

The provenance of all data used in AI training, fine-tuning, and evaluation is documented and traceable. Provenance records include: origin (internal, third-party, web-scraped, synthetic), licence and copyright status, collection date, any transformations applied, and consent or legal basis for use. For GPAI models that use web-scraped or third-party data, copyright compliance mechanisms are in place. Provenance records are retained for the lifetime of the model.

Rationale

Provenance underpins both IP compliance (copyright, licensing) and bias auditing; data without documented origin cannot be removed from training sets when disputes arise.

Framework Mappings (4)

EU AI Act 2024

EU-AI-Art.10.2	Data Governance — Data Preparation and Bias Management	partial
EU-AI-Art.53.3	GPAI Model Obligations — Copyright Compliance Policy	full

ISO/IEC 42001:2023

A.7.5

Data provenance

full

NIST AI RMF 1.0

GOVERN 6.1

Third-Party AI Risk Policies

partial

Evidence (2)

recordmanual

Training data provenance record for each production model, documenting origin (internal, third-party, web-scraped, synthetic), licence and copyright status, collection date, transformations applied, and legal basis for use.

Example: Data Provenance Record — LLM Fine-Tune Dataset v2 (Confluence), listing 4 source datasets: internal CRM exports (contract basis), licensed Common Crawl subset (licence agreement #CC-2024-07), synthetic augmentation (internal generation), with copyright review completed by legal 2025-05-10

Test: Request provenance records for a sample of training datasets used by production models. Verify: (1) origin of each data source is documented, (2) licence and copyright status is recorded for each source, (3) legal basis for processing is stated, (4) transformation steps are described, (5) for web-scraped sources, a copyright compliance mechanism is documented, (6) records are linked to the model registry entry.

contractmanual

Licence agreements or data processing agreements for third-party or licensed training datasets, confirming the organisation has the legal right to use the data for AI training purposes.

Example: Data Licence Agreement with DataProvider Ltd (executed 2024-07-15), explicitly granting rights to use dataset for model training, specifying permitted use scope and restrictions on redistribution of derivative models

Test: Request licence or data processing agreements for all third-party training datasets identified in provenance records. Verify: (1) the agreement explicitly permits use for AI/ML model training, (2) any restrictions on derivative models are identified and assessed against current use, (3) agreements are current (not expired), (4) agreements are stored in a retrievable contract repository.

Questions (2)

boolean

Is the provenance of all data used in AI training, fine-tuning, and evaluation documented and traceable?

Provenance records are required for both IP compliance (copyright, licensing) and bias auditing. Data without documented origin cannot be removed from training sets when disputes arise.

multi

What does your training data provenance record include for each data source?

Data origin (internal, third-party, web-scraped, synthetic)Licence and copyright statusCollection or acquisition dateTransformations applied to the dataLegal basis for use (contract, licence, consent)Copyright compliance mechanism for web-scraped sources

All six elements should be present for sources used in production models. For GPAI providers, copyright compliance mechanisms for web-scraped data are a direct EU AI Act obligation (Art. 53.3).

Search controls

AIG-013 Training Data Provenance

Framework Mappings (4)

Evidence (2)

Questions (2)