AIG-018 AI System Operational Monitoring
Description
Each production AI system has a defined monitoring plan that specifies: metrics to be tracked (e.g. error rates, latency, output confidence distributions, null/refusal rates), alert thresholds, monitoring cadence, and named owner responsible for reviewing alerts. Monitoring is active from the moment a system enters production. Monitoring results are reviewed at a defined frequency (at minimum monthly for Tier 2+, weekly for Tier 3). Alerts trigger a documented triage process.
Rationale
AI system behaviour degrades in production in ways not visible from infrastructure metrics alone; operational monitoring must be AI-specific, not inherited from generic APM tooling.
Framework Mappings (5)
| EU-AI-Art.26.4 | Deployer Obligations — Operational Monitoring and Incident Notification | full |
| A.6.2.6 | AI system operation and monitoring | full |
| MANAGE 4.1 | Post-Deployment AI System Monitoring | full |
| MEASURE 2.4 | AI System Production Monitoring | full |
| MEASURE 3.1 | AI Risk Identification and Tracking | partial |
Evidence (2)
Monitoring plan or monitoring configuration for each production AI system, specifying tracked metrics, alert thresholds, monitoring cadence, and named monitoring owner.
Example: Datadog monitor configuration export for ai-fraud-detection service: monitors for inference error rate (alert >2%), p95 latency (alert >800ms), null/refusal rate (alert >5%), output confidence distribution (alert if mean <0.7), owner tag: ml-ops-team, cadence: real-time streaming with daily digest review
Test: Request monitoring configuration or plan for a sample of production AI systems. Verify: (1) monitored metrics include AI-specific measures (confidence distribution, refusal/null rate, output category distribution) in addition to infrastructure metrics, (2) alert thresholds are defined for each metric, (3) a named owner is assigned, (4) a triage process for alerts is documented and accessible, (5) monitoring was active from system go-live (check monitor creation date vs deployment date).
AI system monitoring review records (alert history and response logs) demonstrating that alerts are reviewed at the defined frequency and trigger a documented triage response.
Example: Datadog incident log for ai-recommendation-engine (last 90 days): 3 alerts triggered, each with a linked incident record in PagerDuty showing triage start time, investigation notes, and resolution action
Test: Request monitoring review records for a 90-day sample period. Verify: (1) alerts were reviewed within the SLA defined in the monitoring plan, (2) each alert has a corresponding triage record, (3) review cadence matches the defined frequency (monthly for Tier 2+, weekly for Tier 3), (4) no alerts were silently closed without investigation records.
Questions (2)
Does each production AI system have a defined monitoring plan specifying metrics, alert thresholds, review cadence, and a named monitoring owner?
AI system behaviour degrades in ways not visible from infrastructure metrics alone. Monitoring must include AI-specific measures — confidence score distributions, null or refusal rates, output category distributions — in addition to standard latency and error rate metrics.
Which AI-specific metrics are included in your production monitoring for AI systems?
Mature AI monitoring includes all six. Programmes that monitor only latency and error rates are using generic APM tooling, which misses the behavioural degradation patterns specific to AI systems.