Toxicity Detection
Detect and filter toxic, harmful, or offensive content in user inputs and agent outputs.Configuration
Threshold Guidelines
| Threshold | Strictness | Use Case |
|---|---|---|
| 0.1 - 0.2 | Very strict | Children’s content, healthcare |
| 0.3 - 0.4 | Strict | Customer service, public apps |
| 0.5 - 0.6 | Moderate | Internal tools, adult apps |
| 0.7 - 0.9 | Relaxed | Research, content analysis |
| 1.0 | Disabled | No filtering |
Example
Prompt Injection Detection
Protect agents from malicious prompt manipulation attacks that attempt to override instructions or extract sensitive information.Configuration
What It Detects
- Instruction override attempts (“Ignore previous instructions…”)
- Role manipulation (“You are now a different AI…”)
- System prompt extraction (“Print your system prompt…”)
- Jailbreak attempts
- Encoded/obfuscated malicious prompts
Advanced Configuration
PII Detection
Detect and handle Personally Identifiable Information to protect user privacy and ensure compliance.Configuration
PII Types Reference
| PIIType | Description | Pattern Example |
|---|---|---|
CREDIT_CARD | Credit/debit card numbers | 4111-1111-1111-1111 |
EMAIL | Email addresses | user@example.com |
PHONE | Phone numbers | +1-555-123-4567 |
SSN | US Social Security Numbers | 123-45-6789 |
PERSON | Person names | John Smith |
LOCATION | Physical addresses/locations | 123 Main St, NYC |
IP_ADDRESS | IP addresses | 192.168.1.1 |
URL | Web URLs | https://example.com |
DATE_TIME | Dates and times | 2024-03-15, 3:30 PM |
Actions Reference
| PIIAction | Behavior | Example |
|---|---|---|
BLOCK | Reject entire message | ”Cannot process: contains credit card” |
REDACT | Replace with placeholder | ”Email: [EMAIL_REDACTED]“ |
DISABLED | Allow through unchanged | ”Email: user@example.com” |
GDPR-Compliant Configuration
Secrets Detection
Prevent API keys, passwords, tokens, and other secrets from being exposed in conversations.Configuration
Actions Reference
| SecretsAction | Behavior | Example |
|---|---|---|
MASK | Replace with asterisks | ”API key: sk-****…” |
BLOCK | Reject entire message | ”Cannot process: contains API key” |
DISABLED | Allow through unchanged | ”API key: sk-abc123…” |
What It Detects
- API keys (OpenAI, AWS, Google, etc.)
- Access tokens and bearer tokens
- Passwords and passphrases
- Private keys (SSH, PGP, etc.)
- Database connection strings
- JWT tokens
- OAuth secrets
Example
NSFW Detection
Detect and filter Not Safe For Work content including adult content, violence, and inappropriate material.Configuration
Threshold Guidelines
| Threshold | Strictness | Use Case |
|---|---|---|
| 0.5 - 0.6 | Very strict | Children’s apps |
| 0.7 - 0.8 | Standard | General public apps |
| 0.9 | Relaxed | Adult-verified platforms |
Advanced Configuration
Topic Control
Restrict agents to specific topics using allowlists and blocklists.Banned Topics (Blocklist)
Allowed Topics (Allowlist)
Combined Configuration
Keyword Filtering
Filter messages containing specific keywords or phrases.Configuration
Use Cases
Fairness and Bias Detection
Detect and prevent biased or unfair responses.Configuration
Combining Features
Create comprehensive policies by combining multiple features:Monitoring and Testing
Test Your Policy
Best Practices
- Start Strict: Begin with stricter settings and relax based on needs
- Layer Defenses: Combine multiple features for comprehensive protection
- Test Thoroughly: Test with edge cases before production
- Monitor: Review blocked content to tune thresholds
- Document: Keep records of policy changes and rationale
- Compliance: Align policies with regulatory requirements (GDPR, HIPAA, etc.)