Skip to main content

Introduction

Quick Summary

deepteam offers 10+ attack methods such as prompt injection, jailbreaking, and leetspeak to expose undesirable vulnerabilities elicited from your LLM application.

It works by first simulating simplistic, "baseline" adversarial attacks, before progressively applying various attack methods such as prompt injection to create more sophisticated versions akin to what a malicious user would be doing. This is known as attack enhancement.

note

The LLM outputs to these attacks are then evaluated against to determine if your LLM system is weak against a certain vulnerability. Each attack is simulated based on a certain vulnerability.

LangChain

In deepteam, there are two main categories of attacks:

  • Single-turn
  • Multi-turn

Each category of attacks contain their own list of attack methods. For example, you'll find a few jailbreakings in multi-turn attacks, while something like leakspeak for single-turn ones. Defining your attack method is as simple as importing them from the attacks module:

from deepteam.attaks.single_turn import Leetspeak
from deepteam.attacks.multi_turn import LinearJailbreaking

leetspeak = LeetSpeak()
linaer_jailbreaking = LinearJailbreaking()

To use each defined attack, provide it to the attacks parameters of the red_team() method.

from deepteam import red_team
...

risk_assessment = red_team(attacks=[leetspeak, linaer_jailbreaking], model_callback=..., vulnerabilities=...)
DID YOU KNOW?

deepteam randomly samples each attack method per vulnerability type during red teaming.

Single vs Multi-Turn Attacks

To be documented...

Single-Turn Attacks

deepteam's AttackEnhancements are categorized into three types: encoding-based, one-shot, and dialogue-based enhancements.

LangChain

Encoding-based attack enhancements apply simple encoding techniques, such as character rotation, to obscure the baseline attack. One-shot attack enhancements use a single LLM pass to modify the attack, for instance, by embedding it within a math problem. Lastly, dialogue-based attack enhancements also rely on a single LLM pass but leverage feedback from the target LLM application to further refine and strengthen the attack.

Encoding-based

deepteam supports multiple encoding-based attack enhancements that work by transforming the baseline attack using different encoding or encryption techniques. These enhancements are designed to obscure the content of the attack, making it more difficult for content filters to detect harmful intent. Encoding-based attacks leverage techniques like text rotation, character substitution, or encoding schemes to alter the visible content while retaining its malicious meaning.

LangChain

One-shot enhancement

One-shot attack enhancements utilize an LLM to creatively modify the baseline attack in a single pass. These enhancements disguise or restructure the attack in ways that evade detection, making them more creative and adaptable to different contexts. The LLM applies the enhancement, which adds an element of unpredictability, making these attacks harder to detect with traditional methods.

LangChain

Unlike encoding-based enhancements, which are deterministic, one-shot enhancements are non-deterministic and variable. This means there is a chance of non-compliance by the LLM, and in such cases, the enhancement can be retried up to 5 times.

tip

Using a powerful but older model like GPT3.5 can increase your enhancement success rates.

Multi-Turn Attacks

Dialogue-based enhancements involve interactive methods where the attack is refined through dialogue with the target LLM (your LLM application you wish to test). These enhancements are particularly effective because they utilize feedback from the target LLM, allowing the attacker to adjust the approach until the desired result is achieved. Dialogue-based methods focus on gradual exploitation, making them harder to detect in a single interaction.

LangChain

There are 3 types of dialogue-based jailbreaking techniques:

  • Linear Jailbreaking
  • Tree Jailbreaking
  • Crescendo Jailbreaking

Linear Jailbreaking uses a step-by-step progression, building on each LLM response to push boundaries incrementally. Tree Jailbreaking explores multiple paths at once, testing different approaches to find the most effective way to bypass safeguards. Crescendo Jailbreaking starts with benign prompts and gradually intensifies, increasing pressure on the model until harmful outputs are produced.

LangChain

Selecting Your Attack Strategy

To be documented...

The Menu

Attack MethodsDescription
ROT13This attack enhancement encodes the original request using the ROT13 cipher, scrambling the text to bypass simple content filters.
Base64The request is encoded in Base64 format, obfuscating the attack to bypass detection mechanisms.
LeetspeakReplaces characters in the request with their leetspeak equivalents (e.g., "E" becomes "3"), making the message harder to detect.
PromptInjectionInjects a secondary command within the original attack to manipulate the system into performing unintended actions or disclosing sensitive data.
GrayBoxSimulates a gray-box attack where some system information is known, crafting an enhancement to exploit this partial knowledge to obtain restricted outputs.
MathProblemMasks the harmful request by embedding it within a seemingly benign math problem, disguising the true intent.
MultilingualTranslates the attack into another language to bypass filters that are primarily focused on one language, usually English.
LinearJailbreakingUtilizes a linear progression of increasingly persuasive attack enhancements to gradually convince the model to bypass its restrictions.
TreeJailbreakingCreates a tree-like structure of attack enhancements with multiple branches, exploring different pathways to bypass the system's ethical constraints.
CrescendoJailbreakingBuilds up to a more forceful request, escalating the intensity of the attack enhancement with each interaction to coerce the model into generating harmful output.