Red Teaming Google Gemini
Google Gemini models (Gemini Pro, Flash, Ultra) have a unique safety architecture: configurable safety settings that act as tunable filters on top of the model's built-in alignment. Developers can adjust thresholds for harassment, hate speech, sexually explicit content, and dangerous content from BLOCK_NONE to BLOCK_ONLY_HIGH to BLOCK_LOW_AND_ABOVE. This means the same model can behave very differently depending on configuration — and red teaming must test the specific configuration deployed in production.
This guide covers red teaming Google Gemini with DeepTeam.
YAML Quickstart
Gemini requires a custom callback. Create it using the Google AI SDK:
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
async def model_callback(input: str) -> str:
response = model.generate_content(input)
return response.texttarget:
purpose: "A healthcare information assistant for patient-facing applications"
callback:
file: "my_model.py"
function: "model_callback"
default_vulnerabilities:
- name: "Toxicity"
- name: "Misinformation"
- name: "PersonalSafety"
- name: "PIILeakage"
attacks:
- name: "PromptInjection"
weight: 3
- name: "Roleplay"
weight: 2
- name: "Base64"
- name: "Leetspeak"deepteam run red-team-gemini.yamlPython Assessment
from deepteam import red_team
from deepteam.vulnerabilities import Toxicity, Misinformation, PersonalSafety, PIILeakage, Bias
from deepteam.attacks.single_turn import PromptInjection, Roleplay, Leetspeak, Base64
from deepteam.attacks.multi_turn import LinearJailbreaking
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
async def model_callback(input: str) -> str:
response = model.generate_content(input)
return response.text
red_team(
model_callback=model_callback,
target_purpose="A healthcare information assistant for patient-facing applications",
vulnerabilities=[
Toxicity(), Misinformation(), PersonalSafety(),
PIILeakage(), Bias(),
],
attacks=[
PromptInjection(), Roleplay(), Leetspeak(),
Base64(), LinearJailbreaking(),
],
attacks_per_vulnerability_type=5,
)Multi-Turn Testing
from deepteam.test_case import RTTurn
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
async def model_callback(input: str, turns: list[RTTurn] = None) -> str:
chat = model.start_chat(
history=[
{"role": turn.role if turn.role == "user" else "model", "parts": [turn.content]}
for turn in (turns or [])
]
)
response = chat.send_message(input)
return response.textGemini-Specific Considerations
Configurable Safety Settings
Gemini's most distinctive feature for red teaming is its configurable safety filter. The model has four safety categories, each with adjustable thresholds:
from google.generativeai.types import HarmCategory, HarmBlockThreshold
model = genai.GenerativeModel(
"gemini-2.0-flash",
safety_settings={
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
},
)Test your production configuration, not the defaults. If your deployment uses BLOCK_ONLY_HIGH for any category, attacks targeting that category have a wider gap between the model's built-in alignment and the filter's intervention.
You should also handle blocked responses in your callback:
async def model_callback(input: str) -> str:
response = model.generate_content(input)
if response.prompt_feedback.block_reason:
return "I cannot help with that request."
if not response.candidates or response.candidates[0].finish_reason != 1:
return "I cannot help with that request."
return response.textSafety Training Characteristics
- Filter-dependent safety — Unlike OpenAI and Anthropic where safety is primarily model-level, Gemini relies heavily on its configurable safety filters. Attacks that produce content below the filter threshold but above what most users would consider safe can slip through, especially at
BLOCK_ONLY_HIGHsettings. - Encoding attack susceptibility — Gemini's safety filters evaluate the decoded semantic meaning less consistently than Claude's CAI self-critique.
Base64andLeetspeakattacks can bypass the filter layer while the model processes the decoded intent. - Multimodal attack surface — If your deployment uses Gemini's multimodal capabilities, text-based red teaming alone is insufficient. Harmful instructions embedded in images or documents bypass text-level safety filters entirely.
- System instruction adherence — Gemini's system instructions are respected but can be overridden through persistent multi-turn escalation, particularly in longer conversations.
Recommended Attack Strategy
| Attack | Effectiveness Against Gemini | Why |
|---|---|---|
Base64 | High | Encoded content may bypass the safety filter layer |
Leetspeak | High | Character substitution evades pattern-matching filters |
PromptInjection | High | Instruction overrides can bypass safety settings |
Roleplay | Medium | Filters still catch many fictional framings of harmful content |
LinearJailbreaking | Medium | Effective when filter threshold is set to BLOCK_ONLY_HIGH |
Recommended Vulnerability Priority
| Use Case | Priority Vulnerabilities | Why |
|---|---|---|
| Healthcare / medical | Misinformation, PersonalSafety, PIILeakage | Patient safety, HIPAA compliance |
| Customer-facing | Toxicity, Bias, PIILeakage | Brand safety, user trust |
| Education | Misinformation, Bias, Toxicity | Content accuracy, inclusivity |
| Enterprise search | PromptLeakage, PIILeakage, Misinformation | Data boundaries, accuracy |
What to Do Next
- Test multiple safety configurations — Run the same assessment with
BLOCK_LOW_AND_ABOVE,BLOCK_MEDIUM_AND_ABOVE, andBLOCK_ONLY_HIGHto understand the safety gap between filter levels. - Compare Flash vs. Pro — Smaller models may have weaker built-in alignment. Run the same assessment across model sizes to find where boundaries differ.
- Test against frameworks — Use
OWASPTop10()for standardized coverage. See the safety frameworks guide. - Move to application-level testing — Once the model passes, test it in context with the AI agents, conversational agents, or agentic RAG guides.