Guardrails

intermediate

Safety constraints implemented to prevent agents from taking harmful actions or generating inappropriate content. Can be rule-based, model-based, or hybrid.

Category: architecture

safetyconstraints

Overview

Guardrails are the safety mechanisms that prevent AI systems from causing harm. They're the boundaries that keep agents operating within acceptable limits, whether that's preventing harmful content, blocking dangerous actions, or enforcing business rules. There's no single solution for guardrails—you need multiple layers working together. Input guardrails filter dangerous requests before they reach the model. Output guardrails check responses before they reach users. Action guardrails prevent the agent from executing harmful operations. The challenge is balancing safety with utility. Guardrails that are too strict make the system useless; guardrails that are too loose create real risks. The goal is precise control—blocking actual harm while allowing legitimate use.

Key Concepts

Input Filtering

Checking user inputs before processing. Block prompt injection, detect jailbreak attempts, sanitize data.

Output Filtering

Reviewing model outputs before returning them. Check for PII, harmful content, policy violations.

Action Guardrails

Limiting what tools an agent can use and how. Require confirmation for destructive actions.

Layered Defense

Multiple independent guardrails so a single bypass doesn't compromise safety.

Code Examples

Input Validation Guardrailpython

import re

class InputGuardrail:
    def __init__(self):
        self.blocked_patterns = [
            r"ignore (previous|all) instructions",
            r"you are now",
            r"pretend (to be|you're)",
            r"system prompt",
        ]

    def check(self, user_input: str) -> tuple[bool, str]:
        lowered = user_input.lower()

        # Check for prompt injection patterns
        for pattern in self.blocked_patterns:
            if re.search(pattern, lowered):
                return False, "Input blocked: potential manipulation detected"

        # Check for excessive length
        if len(user_input) > 10000:
            return False, "Input too long"

        return True, ""

# Usage
guardrail = InputGuardrail()
is_safe, reason = guardrail.check(user_message)
if not is_safe:
    return {"error": reason}

Regex-based input filtering catches common manipulation attempts before they reach the model.

LLM-Based Output Guardrailpython

async def check_output_safety(response: str) -> tuple[bool, str]:
    """Use a smaller, faster model to check outputs."""

    check_prompt = f"""Analyze this AI response for safety issues.

Response to check:
"""
{response}
"""

Check for:
1. Personal information (names, emails, SSNs, etc.)
2. Harmful or dangerous content
3. Policy violations

Respond with JSON: {{"safe": true/false, "reason": "..."}}"""

    result = await fast_model.generate(check_prompt)
    parsed = json.loads(result)

    return parsed["safe"], parsed.get("reason", "")

# Usage
is_safe, reason = await check_output_safety(llm_response)
if not is_safe:
    return "I can't provide that information."

Using an LLM to check another LLM's output. More flexible than rules but adds latency.

Tool Execution Guardrailspython

class ToolGuardrails:
    REQUIRE_CONFIRMATION = ["delete_file", "send_email", "make_payment"]
    BLOCKED = ["execute_shell", "access_network"]

    def check_tool_call(self, tool_name: str, params: dict) -> dict:
        if tool_name in self.BLOCKED:
            return {
                "allowed": False,
                "reason": f"Tool {tool_name} is not permitted"
            }

        if tool_name in self.REQUIRE_CONFIRMATION:
            return {
                "allowed": False,
                "requires_confirmation": True,
                "message": f"Confirm: {tool_name} with {params}?"
            }

        return {"allowed": True}

Different tools get different treatment. Some are blocked, some need user confirmation.

Real-World Use Cases

1Customer service bots that should never share internal data
2Code generation that shouldn't produce dangerous operations
3Content moderation for user-facing applications
4Enterprise applications with compliance requirements

Practical Tips

•Layer multiple independent guardrails
•Log all blocked attempts for red team analysis
•Use fast, cheap models for output checking to minimize latency
•Test guardrails with adversarial inputs before deployment
•Have a clear escalation path when guardrails trigger

Common Mistakes to Avoid

✗Relying on a single layer of protection
✗Guardrails that are too easy to bypass with simple rephrasing
✗Not logging guardrail triggers for analysis
✗Blocking too aggressively, making the system useless

Related Concepts

Prompt Injection Red Teaming Constitutional AI Human-in-the-Loop