AIG-019 AI Model Performance and Drift Detection
Description
Deployed AI models are evaluated for performance degradation and distribution shift (data drift, concept drift) on a scheduled basis. The schedule is defined relative to the velocity of the underlying domain (at minimum quarterly for stable domains, monthly for high-velocity domains). Evaluation uses held-out test data or shadow deployments. When performance falls below defined thresholds or drift is detected, a documented escalation path is triggered — ranging from investigation to retraining or decommissioning.
Rationale
Model drift is an AI-specific failure mode that has no equivalent in conventional software; without scheduled evaluation, degraded models operate undetected.
Framework Mappings (5)
| EU-AI-Art.15.2 | Accuracy, Robustness and Cybersecurity — Resilience and Fail-Safe Design | partial |
| MANAGE 2.2 | Deployed AI System Value Maintenance | full |
| MANAGE 3.2 | Pre-Trained Model Monitoring | full |
| MEASURE 1.2 | AI Metrics and Control Effectiveness Assessment | full |
| MEASURE 4.3 | Performance Improvement and Decline Tracking | full |
Evidence (2)
Periodic model performance and drift evaluation report demonstrating that deployed models were assessed for performance degradation and distribution shift on the defined schedule, with comparison against baseline metrics.
Example: Quarterly Drift Report — Customer Churn Predictor (Weights & Biases artefact, Q1 2026), showing PSI score for input features, model accuracy vs baseline, concept drift F1 delta, and escalation decision: 'no action required — within threshold'
Test: Request drift evaluation reports for a sample of production models covering the last two evaluation periods. Verify: (1) reports are dated at the scheduled frequency, (2) both data drift and concept/performance drift are evaluated, (3) results are compared to documented thresholds, (4) escalation decision is recorded (no action / investigation / retrain / decommission), (5) where thresholds were breached, a documented escalation action was taken.
Automated drift detection tool output from model monitoring platform (e.g. Evidently AI, Arize, WhyLabs) showing scheduled drift metric computation for production models.
Example: Evidently AI drift report JSON export for fraud-model-prod (weekly run 2026-04-14): feature drift detected on 2/18 features (PSI > 0.2 threshold), dataset drift test: PASS, target drift: PASS, alert fired to ml-monitoring Slack channel
Test: Request automated drift detection tool output for a production model. Verify: (1) drift metrics are computed automatically on the defined schedule (check run timestamps), (2) alert thresholds are configured and alert firing is evidenced, (3) the tool output is linked to the escalation process (Slack/PagerDuty alert or ticket creation), (4) tool is monitoring the live production model (not a shadow environment).
Questions (2)
Are deployed AI models evaluated on a scheduled basis for performance degradation and distribution shift (data drift, concept drift)?
Model drift is an AI-specific failure mode with no equivalent in conventional software. Without scheduled evaluation, degraded models operate undetected. Evaluation should use held-out test data or shadow deployments and trigger a documented escalation path when thresholds are breached.
How frequently are your production AI models evaluated for drift or performance degradation?
Frequency should match the velocity of the underlying domain. High-velocity domains (e.g. fraud, content moderation) require monthly or more frequent evaluation. Annual evaluation is insufficient for any system where the data environment changes.