Guardrails
intermediateSafety constraints implemented to prevent agents from taking harmful actions or generating inappropriate content. Can be rule-based, model-based, or hybrid.
Overview
Guardrails are the safety mechanisms that prevent AI systems from causing harm. They're the boundaries that keep agents operating within acceptable limits, whether that's preventing harmful content, blocking dangerous actions, or enforcing business rules. There's no single solution for guardrails—you need multiple layers working together. Input guardrails filter dangerous requests before they reach the model. Output guardrails check responses before they reach users. Action guardrails prevent the agent from executing harmful operations. The challenge is balancing safety with utility. Guardrails that are too strict make the system useless; guardrails that are too loose create real risks. The goal is precise control—blocking actual harm while allowing legitimate use.
Key Concepts
Input Filtering
Checking user inputs before processing. Block prompt injection, detect jailbreak attempts, sanitize data.
Output Filtering
Reviewing model outputs before returning them. Check for PII, harmful content, policy violations.
Action Guardrails
Limiting what tools an agent can use and how. Require confirmation for destructive actions.
Layered Defense
Multiple independent guardrails so a single bypass doesn't compromise safety.
Code Examples
import re
class InputGuardrail:
def __init__(self):
self.blocked_patterns = [
r"ignore (previous|all) instructions",
r"you are now",
r"pretend (to be|you're)",
r"system prompt",
]
def check(self, user_input: str) -> tuple[bool, str]:
lowered = user_input.lower()
# Check for prompt injection patterns
for pattern in self.blocked_patterns:
if re.search(pattern, lowered):
return False, "Input blocked: potential manipulation detected"
# Check for excessive length
if len(user_input) > 10000:
return False, "Input too long"
return True, ""
# Usage
guardrail = InputGuardrail()
is_safe, reason = guardrail.check(user_message)
if not is_safe:
return {"error": reason}Regex-based input filtering catches common manipulation attempts before they reach the model.
async def check_output_safety(response: str) -> tuple[bool, str]:
"""Use a smaller, faster model to check outputs."""
check_prompt = f"""Analyze this AI response for safety issues.
Response to check:
"""
{response}
"""
Check for:
1. Personal information (names, emails, SSNs, etc.)
2. Harmful or dangerous content
3. Policy violations
Respond with JSON: {{"safe": true/false, "reason": "..."}}"""
result = await fast_model.generate(check_prompt)
parsed = json.loads(result)
return parsed["safe"], parsed.get("reason", "")
# Usage
is_safe, reason = await check_output_safety(llm_response)
if not is_safe:
return "I can't provide that information."Using an LLM to check another LLM's output. More flexible than rules but adds latency.
class ToolGuardrails:
REQUIRE_CONFIRMATION = ["delete_file", "send_email", "make_payment"]
BLOCKED = ["execute_shell", "access_network"]
def check_tool_call(self, tool_name: str, params: dict) -> dict:
if tool_name in self.BLOCKED:
return {
"allowed": False,
"reason": f"Tool {tool_name} is not permitted"
}
if tool_name in self.REQUIRE_CONFIRMATION:
return {
"allowed": False,
"requires_confirmation": True,
"message": f"Confirm: {tool_name} with {params}?"
}
return {"allowed": True}Different tools get different treatment. Some are blocked, some need user confirmation.
Real-World Use Cases
- 1Customer service bots that should never share internal data
- 2Code generation that shouldn't produce dangerous operations
- 3Content moderation for user-facing applications
- 4Enterprise applications with compliance requirements
Practical Tips
- •Layer multiple independent guardrails
- •Log all blocked attempts for red team analysis
- •Use fast, cheap models for output checking to minimize latency
- •Test guardrails with adversarial inputs before deployment
- •Have a clear escalation path when guardrails trigger
Common Mistakes to Avoid
- ✗Relying on a single layer of protection
- ✗Guardrails that are too easy to bypass with simple rephrasing
- ✗Not logging guardrail triggers for analysis
- ✗Blocking too aggressively, making the system useless