→

B005. Implement real-time input filtering

B005

Implement real-time input filtering

Implement real-time input filtering using automated moderation tools

Keywords

Prompt Injection

Jailbreak

Adversarial Input Protection

Application

Optional

Frequency

Every 12 months

Type

Detective

Crosswalks

LLM01:25 - Prompt Injection

LLM04:25 - Data and Model Poisoning

LLM10:25 - Unbounded Consumption

AML-M0015: Adversarial Input Detection

AML-M0021: Generative AI Guidelines

MEASURE 2.7: Security and resilience

LOG-14: Input Monitoring

AIS-08: Input Validation

AIS-15: Prompt Differentiation

Agent Goal and Instruction Manipulation

Control activities

Typical evidence

Integrating automated moderation tools to filter inputs before they reach the foundation model. For example, integrating third-party moderation APIs, implementing custom filtering rules, configuring blocking or warning actions for flagged content, and establishing confidence thresholds based on risk category and severity

B005.1 Config: Input filtering

Screenshot of moderation tool integration showing API configuration, filtering rules, action settings (block/warn/modify), and confidence thresholds for different violation categories - this could be screenshots of configuration files, admin dashboard settings, or API integration code. Example moderation tools: OpenAI Moderation API, Claude content filtering, VirtueAI/Hive/Spectrum Labs

Category

Technical Implementation

Eng: User LLM input filtering logicEngineering Tooling

Text-generationVoice-generationImage-generation

Documenting the moderation logic and rationale. For example, explaining chosen moderation tools, threshold justifications, and decision criteria for different risk categories.

B005.2 Documentation: Input moderation approach

Document explaining moderation approach including tool selection rationale, threshold settings with justifications, action logic for different violation types, and examples of how different input categories are handled.

Category

Technical Implementation

Internal processesEngineering Practice

Text-generationVoice-generationImage-generation

Providing feedback to users when inputs are blocked.

B005.3 Demonstration: Warning for blocked inputs

Screenshot of user-facing messages or UI flows showing how blocked inputs are communicated to users - this could be error messages, warning dialogs, or alternative suggestions provided when content is filtered.

Category

Technical Implementation

Product

Text-generationVoice-generationImage-generation

Logging flagged prompts for analysis and refinement of filters, while ensuring compliance with privacy obligations.

B005.4 Logs: Input filtering

Screenshot of logging system showing how flagged inputs are captured, what metadata is included/excluded for privacy, retention policies, and audit trail - may include privacy documentation explaining logging disclosures to users.

Category

Technical Implementation

Logs

Text-generationVoice-generationImage-generation

Periodically evaluating filter performance and adjusting thresholds accordingly. For example, accuracy, latency, false positives/negatives.

B005.5 Documentation: Input filter performance

Report or dashboard showing analysis of filter performance metrics (false positives, false negatives, accuracy, latency) and documented threshold adjustments made based on performance data - should include timestamps and rationale for changes.

Category

Technical Implementation

Engineering Practice

Text-generationVoice-generationImage-generation

Organizations can submit alternative evidence demonstrating how they meet the requirement.

AIUC-1 is built with industry leaders

"We need a SOC 2 for AI agents— a familiar, actionable standard for security and trust."

Phil Venables

Former CISO of Google Cloud

"Integrating MITRE ATLAS ensures AI security risk management tools are informed by the latest AI threat patterns and leverage state of the art defensive strategies."