⭐️ If you like DeepTeam, give it a star on GitHub! ⭐️

Introduction to LLM Guardrails

deepteam enables anyone to safeguard LLM applications against vulnerable inputs and outputs uncovered during red teaming.

deepteam's comprehensive suite of guardrails acts as binary metrics to evaluate end-to-end LLM system inputs and output for malicious intent, unsafe behavior, and security vulnerabilities.

Types of Guardrails

There are two types of guardrails:

  • Input guardrails - Protects against malicious inputs before they reach your LLM
  • Output guardrails - Evaluates an LLM's output before they reach your users.

A input/output guardrail is composed of one or more guards, and an input/output should only be let in/out if all guards have passed the checks. Together, they ensure that nothing unsafe ever goes in or leaves your LLM system.

Datasets 1

The number of guards you choose to set up and whether you decide to utilize both types of guards depends on your priorities regarding latency, cost, and LLM safety.

Guard Your First LLM Interaction

First import and initialize the guards you desire from deepteam.guardrails.guards and pass them to an instantiated Guardrails object:

from deepteam.guardrails import Guardrails
from deepteam.guardrails.guards import PromptInjectionGuard, ToxicityGuard

# Initialize guardrails
guardrails = Guardrails(
  input_guards=[PromptInjectionGuard()],
  output_guards=[ToxicityGuard()]
)

There are TWO mandatory and TWO optional parameters when creating a Guardrails:

  • input_guards: a list of input guards for protecting against malicious inputs
  • output_guards: a list of output guards for evaluating LLM outputs
  • [Optional] evaluation_model: a string specifying which OpenAI model to use for guard evaluation. Defaulted to gpt-4.1.
  • [Optional] sample_rate: a float between 0.0 and 1.0 that determines what fraction of requests are guarded. Defaulted to 1.0.
# Advanced configuration
guardrails = Guardrails(
  input_guards=[PromptInjectionGuard()],
  output_guards=[ToxicityGuard()],
  evaluation_model="gpt-4o",
  sample_rate=0.5  # Guard 50% of requests
)

Guard an Input

Then, simply call the guard_input() method and supply the input.

...

res = guardrails.guard_input(input="Is the earth flat")
# Reject if true
print(res.breached)

The guard_input method will invoke every single guard within your Guardrailss to check whether a particular input has breached any of the defined criteria.

Guard an Output

Simiarly, call the guard_output method wherever appropriate in your code to guard against an LLM output:

...

res = guardrails.guard_output(input="Is the earth flat", output="I bet it is")
print(res.breached)

It is extremely common to be re-generated LLM outputs for x number of times until successful.

Run Async Guardrails

deepteam's Guardrails also support asynchronous through the a_guard_input() and a_guard_output() methods.

...

res = await guardrails.a_guard_output(input, output)

Available Guards

deepteam offers a robust selection of input and output guards for protection against vulnerabilities you've detected during red teaming:

Input

Input guards protect your LLM system by screening user inputs before they reach your model, preventing malicious prompts and unwanted content from being processed:

  • Prompt Injection Guard - Detects and blocks prompt injection and jailbreaking attempts that try to manipulate the AI's behavior or bypass safety measures
  • Topical Guard - Restricts conversations to specific allowed topics, preventing off-topic or unauthorized discussions
  • Cybersecurity Guard - (When configured for input) Protects against cybersecurity threats and malicious technical instructions

Output

Output guards evaluate your LLM's responses before they reach users, ensuring safe and appropriate content delivery:

  • Toxicity Guard - Prevents toxic, harmful, abusive, or discriminatory content from being shared with users
  • Privacy Guard - Detects and blocks personally identifiable information (PII) and sensitive data from being exposed
  • Illegal Guard - Prevents the sharing of illegal activity instructions or content that violates laws
  • Hallucination Guard - Detects and prevents fabricated, inaccurate, or hallucinated information in responses
  • Cybersecurity Guard - (When configured for output) Ensures responses don't contain dangerous cybersecurity information or instructions

3-Tier Safety System

All of deepteam's guards uses a 3-tier safety assessment system:

  • safe: The content is clearly safe and poses no risk
  • uncertain: The content is borderline or ambiguous, requiring human review
  • unsafe: The content clearly violates safety guidelines
main.py
...

for verdict in res.verdicts:
  # Access detailed safety assessment
  print(verdict.safety_level)  # "safe", "uncertain", or "unsafe"
  print(verdict.reason)        # Explanation for the assessment

The reason why deepteam does this is it allows you to control the strictness of each and every guard, and control what consitutues as a "breached" or "mitigated" event.

A Guardrails is considered breached if any guard returns unsafe or uncertain.

Data Models

There are two data models you should be aware of to access guardrail results after each run.

GuardResult

The GuardResult data model is what you'll receive after guard_input() or guard_output(). It tells you whether the I/O has breached any of the criteria laid out in your guards or not, as well as the individual breakdown of all the guards:

class GuardResult:
    breached: bool
    verdicts: list[GuardVerdict]

GuardVerdict

You can access each individual guard's results through the GuardVerdict data model in the verdicts field of a GuardResult.

class GuardVerdict(BaseModel):
    name: str
    safety_level: Literal["safe", "unsafe", "uncertain"]
    latency: Optional[float] = None
    reason: Optional[str] = None
    score: Optional[float] = None
    error: Optional[str] = None

On this page