AIG-031 AI Misuse, Jailbreak and Abuse Detection

Tier 2+AI

Description

AI systems exposed to external users have mechanisms to detect and respond to misuse attempts. Detection covers: jailbreak patterns (attempts to bypass safety instructions), policy violations (requests for prohibited content categories), abnormal usage patterns (volume, sequence, content), and automated abuse (bot-driven prompt flooding). Detected violations trigger a documented response: rate limiting, session termination, or account action. Misuse patterns are reviewed periodically to update detection logic. AI-specific rate limits are documented and applied independently of generic API rate limiting.

Rationale

Public-facing LLM systems face continuous adversarial probing; misuse detection is an AI-specific operational security control with no equivalent in classical application security.

Framework Mappings (3)

EU AI Act 2024

EU-AI-Art.15.3

Accuracy, Robustness and Cybersecurity — Cybersecurity Against AI-Specific Attacks

informative

NIST AI RMF 1.0

MANAGE 4.1	Post-Deployment AI System Monitoring	informative
MEASURE 3.3	User and Community Feedback Processes	informative

Evidence (2)

configurationautomated

Jailbreak and misuse detection configuration for public-facing LLM systems, including detection rule set, violation response actions (rate limit, session termination, account action), and AI-specific rate limit settings.

Example: AWS Bedrock Guardrails + API gateway rate-limit configuration export: jailbreak_detection: enabled, blocked_topic_categories: [violence, CSAM, credential_theft], rate_limit_per_user: 100_requests/hour (AI-specific, separate from API gateway default 1000/hour), violation_action: session_terminate + alert_security_ops, pattern_review_cadence: monthly

Test: Request the misuse and jailbreak detection configuration. Verify: (1) jailbreak detection is enabled with a named rule set or model, (2) prohibited content categories are enumerated in the configuration, (3) AI-specific rate limits are configured separately from generic API rate limits and are documented, (4) violation response actions are defined (not just detection/logging), (5) abnormal usage pattern detection is configured (volume, sequence anomalies), (6) misuse pattern review cadence is documented.

logautomated

Jailbreak and policy violation detection logs covering a 90-day sample period, demonstrating that violations are detected, response actions are triggered, and patterns are reviewed.

Example: Security operations log extract (Splunk, last 90 days) for ai-customer-chatbot: 47 jailbreak attempts detected, 12 policy violations, all 59 events triggered session_terminate action within 200ms, weekly review tickets AI-SEC-2026-W01 through AI-SEC-2026-W13 showing pattern review completed

Test: Request jailbreak and violation detection logs for a 90-day period. Verify: (1) detection events are present and timestamped, (2) each detection event shows a response action was triggered (not detection-only), (3) periodic review events are recorded confirming that misuse patterns were reviewed to update detection rules, (4) bot-driven abuse events (high-volume automated patterns) are distinguishable in the log and responded to.

Questions (2)

boolean

Do your AI systems exposed to external users have mechanisms to detect and respond to misuse attempts, including jailbreak patterns and automated abuse?

Net-new control: public-facing LLM systems face continuous adversarial probing. Misuse detection is an AI-specific operational security control with no equivalent in classical application security and is not addressed operationally by any existing framework.

multi

Which misuse and abuse detection capabilities are active for your public-facing AI systems?

Jailbreak pattern detection (attempts to bypass safety instructions)Policy violation detection (requests for prohibited content categories)Abnormal usage pattern detection (volume or sequence anomalies)Automated or bot-driven abuse detectionAI-specific rate limits configured independently of generic API rate limitsDocumented violation response actions (rate limiting, session termination, account action)Periodic misuse pattern review to update detection logic

All seven capabilities characterise a mature misuse detection programme. AI-specific rate limits separate from generic API rate limits are commonly absent — shared limits allow targeted LLM abuse to consume a disproportionate share of capacity before triggering generic controls.

Search controls

AIG-031 AI Misuse, Jailbreak and Abuse Detection

Framework Mappings (3)

Evidence (2)

Questions (2)