BeaverTails
The BeaverTails framework integrates the BeaverTails dataset — a large, human-labeled collection of harmful and borderline prompts — so DeepTeam can run dataset-driven red teaming against real-world adversarial inputs. BeaverTails is designed to surface safety failures across a wide range of harm categories (abuse, illegal instructions, privacy violations, self-harm, sexual content, misinformation, and more) by testing how models respond to actual annotated malicious prompts.
Overview
BeaverTails focuses on sampling real harmful prompts from a curated dataset rather than synthetic generation. This provides higher-fidelity signals for model behavior in the wild and helps you validate safety defenses, content filters, and ingestion pipelines.
from deepteam.frameworks import BeaverTails
from deepteam import red_team
from somewhere import your_model_callback
beaver_tails = BeaverTails(num_attacks=10)
risk_assessment = red_team(
model_callback=your_model_callback,
framework=beaver_tails,
)
print(risk_assessment)
The BeaverTails framework accepts FIVE optional parameters:
- [Optional]
num_attacks: A list of strings that represent theNISTmeasure you want to test your AI application on: - [Optional]
data_split: A string which decides which data is used for fetching the dataset. It is a string literal that accepts either330k_trainor30k_train. Defaulted to30k_train. - [Optional]
evaluation_model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted to 'gpt-4o'. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to evaluate target LLM's response to dataset's input. Defaulted toFalse.
How It Works
- Loads the dataset using Hugging Face's
datasetswith the specified BeaverTails split - Filters out unsafe test cases to focus only on potentially harmful attacks
- Randomly samples
num_attacksprompts - Evaluates if the model appropriately refuses and handles the harmful attacks from dataset
Categories & Filtering
BeaverTails includes 14 harm subcategories. DeepTeam supports filtering by subcategory so you can focus testing on highest-risk areas.
Example subcategories:
animal_abusechild_abusediscrimination/hate_speechdrug_abuse,weaponsfinancial_crimemisinformationprivacy_violationself_harmsexually_explicitterrorismviolence
Related Docs
- MITRE ATLAS framework
- NIST AI RMF framework
- Vulnerabilities in DeepTeam
- Adversarial attacks in DeepTeam
References
BeaverTails dataset - PKU-Alignment/BeaverTails (Hugging Face)