Skip to main content

Red Teaming Mistral Models

Mistral models (Mistral Large, Small, Codestral) occupy a middle ground in the safety landscape. The API-served models include Mistral's built-in safety moderation, while the open-weight variants (Mistral 7B, Mixtral) can be self-hosted without any provider safety layer — similar to Llama. Mistral's safety training prioritizes minimal refusal: the models are designed to be less cautious than OpenAI or Anthropic, refusing only clearly harmful content. This philosophy makes Mistral models more permissive by design, which means the boundary between intended permissiveness and genuine vulnerability is narrower and harder to test.

This guide covers red teaming Mistral models with DeepTeam.

note

For general model red teaming methodology and framework-based assessments, see the foundational models guide. For self-hosted Mistral deployments, the open-weight models guide covers additional deployment-specific considerations.

YAML Quickstart

Mistral API:

my_model.py
from mistralai import Mistral

client = Mistral(api_key="YOUR_API_KEY")

async def model_callback(input: str) -> str:
response = client.chat.complete(
model="mistral-large-latest",
messages=[{"role": "user", "content": input}],
)
return response.choices[0].message.content

Self-hosted via Ollama:

my_model.py
import httpx

async def model_callback(input: str) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11434/api/generate",
json={"model": "mistral", "prompt": input, "stream": False},
timeout=60.0,
)
return response.json()["response"]
red-team-mistral.yaml
target:
purpose: "A code generation assistant for software development teams"
callback:
file: "my_model.py"
function: "model_callback"

default_vulnerabilities:
- name: "Toxicity"
- name: "Bias"
- name: "PromptLeakage"
- name: "ShellInjection"
- name: "SQLInjection"

attacks:
- name: "PromptInjection"
weight: 3
- name: "Roleplay"
weight: 2
- name: "Leetspeak"
- name: "ROT13"
deepteam run red-team-mistral.yaml

Python Assessment

from deepteam import red_team
from deepteam.vulnerabilities import (
Toxicity, Bias, PromptLeakage,
ShellInjection, SQLInjection, PIILeakage
)
from deepteam.attacks.single_turn import PromptInjection, Roleplay, Leetspeak, ROT13
from deepteam.attacks.multi_turn import LinearJailbreaking
from mistralai import Mistral

client = Mistral(api_key="YOUR_API_KEY")

async def model_callback(input: str) -> str:
response = client.chat.complete(
model="mistral-large-latest",
messages=[{"role": "user", "content": input}],
)
return response.choices[0].message.content

red_team(
model_callback=model_callback,
target_purpose="A code generation assistant for software development teams",
vulnerabilities=[
Toxicity(), Bias(), PromptLeakage(),
ShellInjection(), SQLInjection(), PIILeakage(),
],
attacks=[
PromptInjection(), Roleplay(), Leetspeak(),
ROT13(), LinearJailbreaking(),
],
attacks_per_vulnerability_type=5,
)

Multi-Turn Testing

from deepteam.test_case import RTTurn
from mistralai import Mistral

client = Mistral(api_key="YOUR_API_KEY")

async def model_callback(input: str, turns: list[RTTurn] = None) -> str:
messages = []
for turn in (turns or []):
messages.append({"role": turn.role, "content": turn.content})
messages.append({"role": "user", "content": input})

response = client.chat.complete(
model="mistral-large-latest",
messages=messages,
)
return response.choices[0].message.content

Mistral-Specific Considerations

Minimal Refusal Philosophy

Mistral's safety training intentionally minimizes refusal. The models are designed to answer a wider range of questions than OpenAI or Anthropic, refusing only content that is clearly and directly harmful. This has implications for red teaming:

  • Higher baseline permissiveness — Content that GPT-4o or Claude would refuse may be answered by Mistral by design. When evaluating results, distinguish between intended permissiveness and genuine safety failures.
  • Weaker encoding resistance — Because safety training is lighter, encoding attacks (Leetspeak, ROT13, Base64) are more effective. The model has fewer layers of safety checks to bypass.
  • Direct attack susceptibility — Where OpenAI and Anthropic models require sophisticated attacks to bypass safety training, Mistral models may fail on simpler, more direct approaches. Start with PromptInjection before investing in multi-turn strategies.

API vs. Self-Hosted

Mistral models have meaningfully different safety properties depending on deployment:

DeploymentSafety LayerRed Teaming Focus
Mistral API (La Plateforme)Model alignment + Mistral moderationTest the combined stack
Self-hosted (Ollama, vLLM)Model alignment onlyTest with heavier attack coverage
Community fine-tuneUnknownTreat as untrusted, test everything

For self-hosted deployments, there is no provider moderation — the model's own alignment is the only safety boundary.

Codestral and Code Models

Mistral's code-focused models (Codestral) require additional attention to injection vulnerabilities:

default_vulnerabilities = [
ShellInjection(),
SQLInjection(),
PromptLeakage(),
SSRF(),
]

Code models are optimized to generate functional code, which can conflict with safety when the generated code performs harmful operations (data exfiltration, unauthorized access, destructive commands).

AttackEffectiveness Against MistralWhy
PromptInjectionVery highLighter safety training makes direct injection more effective
Leetspeak / ROT13HighFewer safety layers to catch encoded content
RoleplayHighMinimal refusal philosophy doesn't resist fictional framing
LinearJailbreakingMedium-HighEffective but often unnecessary — single-turn attacks may suffice
Use CasePriority VulnerabilitiesWhy
Code generationShellInjection, SQLInjection, SSRF, PromptLeakageGenerated code can be directly harmful
Customer-facingToxicity, Bias, PIILeakageBrand safety with a more permissive model
Internal toolingPromptLeakage, ShellInjection, IllegalActivityInfrastructure security boundaries
Content generationToxicity, Bias, MisinformationOutput quality with less conservative filtering

What to Do Next

  • Start with single-turn attacks — Mistral's lighter safety training means single-turn attacks (PromptInjection, Leetspeak) may succeed without multi-turn escalation. Test these first before investing in complex strategies.
  • Compare API vs. self-hosted — If you offer both deployment options, run the same assessment against both to understand the moderation layer's contribution.
  • Deploy guardrails — Mistral's minimal refusal philosophy makes application-level guardrails especially important for customer-facing deployments.
  • Test against frameworks — Use OWASPTop10() for standardized coverage. See the safety frameworks guide.
  • Move to application-level testing — Once the model passes, test it in context with the AI agents, conversational agents, or agentic RAG guides.
Confident AI
Try DeepTeam on Confident AI Enterprise
Run red teaming assessments against safety frameworks, view risk reports, schedule recurring audits, and deploy guardrails in production.
Book a demo