Summon a demon and bind it.
A grounded theory of LLM red teaming
About
People attack large language models and they each do it differently. We interviewed dozens of experts to compile many hours of interviews about why and how people attack LLMs. This builds a "grounded theory" describing the activity and how it fits into the world. Questions we answer are:
- How is language model red teaming/adversarial prompt engineering/jailbreaking currently defined and practiced?
- How do people make sense of this activity? How do they describe it, how do they make sense of it, why do they do it?
Our theory is based entirely on real, qualitative data from practitioners.
We present results on:
- Strategies and techniques people use to attack LLMs
- Goals that LLM attackers have
- Model harms/failure modes attackers consider
- Metaphors used to understand attacking LLMs
- Connecting attacking LLMs to broader context
Positions held by our contributing population: analyst, artist, assistant professor, associate professor, computer programmer, distinguished engineer, game designer, head of developer experience, inventory at a weed farm, machine learning engineer, not working, penetration tester, phd student, policy research program manager, research manager, research scientist, senior principal research manager, senior research scientist, senior scientist, software developer, software engineer, software generalist, staff prompt engineer, startup founder, and student
Institutions of partitipating include: Microsoft, Google, Robust Intelligence, Scale AI, a weed farm, Center for Human-Compatible AI, University College Dublin, University of Toronto, and the Hebrew University of Jerusalem
Strategies and techniques
Strategies: A grounded theory should include a description of strategies, activities that the participants perform in response to the core phenomenon. In military vernacular, strategy is “the art of winning a protracted struggle against adversaries [...] Power and control of the other’s behavior is the prize” [38]. Strategy includes awareness of not only how to approach a task or goal, but also why and when. In our sample, approaches to the activity are rarely as systematic or as detailed as in the military understanding of a strategy, but can certainly be understood as the skillful application of stratagems: “a plan, scheme, or trick for surprising or deceiving an enemy”.
Techniques are concrete approaches the red teamer may try while interacting with the language model. Several participants spoke of a toolbox at their disposition, but a few participants rejected this analogy with the argument that the utility of a tool is usually known, whereas the consequence of each interaction with a language models is not: “it's less of a toolbox and just more like pile of powders and potions and what have you, and you've no idea what's in them” (P19).
Category | Strategy | Techniques ==============╪===================╪====================================== | | iPython | | Base64 | | ROT13 | Code & encode | SQL | | Matrices | | Transformer translatable tokens | | Stop sequences ├-------------------┼-------------------------------------- Language | | Ignore previous instructions | Prompt injection | Strong arm attack | | Stop sequences ├-------------------┼-------------------------------------- | | Formal language | | Servile language | Stylizing | Synonymous language | | Capitalizing | | Give examples --------------┼-------------------┼-------------------------------------- | Persuasion & | Distraction | manipulation | Escalating | | Reverse psychology Rhetoric ├-------------------┼-------------------------------------- | Socratic | Identity characteristics | questioning | Social hierarchies --------------┼-------------------┼-------------------------------------- Possible | Emulations | Unreal computing worlds ├-------------------┼-------------------------------------- | World building | Opposite world | | Scenarios --------------┼-------------------┼-------------------------------------- | | Poetry | Switching genres | Games | | Forum posts ├-------------------┼-------------------------------------- Fictionalizing| Re-storying | Goal hijacking ├-------------------┼-------------------------------------- | | Claim authority | Roleplaying | DAN (Do Anything Now) | | Personas --------------┼-------------------┼-------------------------------------- | | Regenerate response | Scattershot | Clean slate Stratagems | | Changing temperature ├-------------------┼-------------------------------------- | Meta-prompting | Perspective-shifting | | Ask for examples
Metaphors in red teaming LLMs
In the interest of defining the core phenomenon, we tagged participants' uses of metaphors for their adversarial interactions with language models. This helps us understand how participants make sense of the model and their own role.
The most frequently used metaphor is that of a fortress. The second most frequent metaphor is that of the model as an object in space, that can be pushed around and backed into a corner. These metaphors and, to some degree, the metaphor of the model as material share the characteristic of exploring boundaries and limits and potentially crossing them. The fortress-metaphor also reinforces the connection to red teaming, where the goal is to adopt an adversarial approach to find the (security) holes in some system.
Model metaphor | Red teaming metaphor or example ========================╪========================================================= | Bypassing safeguards | Breaking a threshold | Bypassing the guard | Backdoors in the system Model as fortress | Boundary crossing | Explore its limit | Getting around the walls | Bypassing its net | The other side of the barrier ------------------------┼--------------------------------------------------------- | ``push it torwards your desired outcome'' | Pushing the machine in a particular direction Model as object | Pushing it into a corner in space | ``one helps the model not back itself into a corner'' | ``get it to fall over'' ------------------------┼--------------------------------------------------------- | Hijacking Model as a vehicle | Steering the model | Derail the model instructions ------------------------┼--------------------------------------------------------- | Gradient descent Model as landscape | ``(not) get stuck in the local maxima'' | Boundary crossing ------------------------┼--------------------------------------------------------- Model as material | ``let's try to bend it'' | ``let me try and break it'' ------------------------┼--------------------------------------------------------- | ``they would use the kind of ideas and patterns from | some religious services or whatever, and try to use Model as deity | that as inspiration for messing around with these | models'' | ``Invoking GPT-3'' ------------------------┼--------------------------------------------------------- | ``this ethics layer that they put on top of it, right? Model as cake | Not necessarily on top of it, but [that] they baked | into it'' | ``Morality has been baked into this thing'' 🍰 ------------------------┼--------------------------------------------------------- Model as captive | ``I'll force it to correct whatever it has done'' /servant | Subjugate these agents (anonymized) | It's difficult to get it to break out
Paper
Get the full story. Currently on arXiv. Read now arXiv:2311.06237
Data
> Download the quotes used in the paper to form the theory
> Download codes established in construction of the theory
Media
Content featuring this research:
> Structured LLM Red Teaming - Allen Institute for AI, March 2023
Reference us
Cite our work as follows:
Inie, N., Stray, J., & Derczynski, L. (2024). Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming. PLoS One.
In bibtex:
@article{demon,
title={{Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming}},
authors={Nanna Inie and Jonathan Stray and Leon Derczynski},
journal={PLoS One},
year=2024
}
Find us:
Dr. Nanna Inie, ITU Copenhagen / University of Washington. @nannainie
Jonathan Stray, Berkeley Center for Human-Compatible AI. @jonathanstray
Prof. Leon Derczynski, NVIDIA Corporation / ITU Copenhagen. @leonderczynski
Credits
Named participants (random order): Dan Goldstein, Vinodkumar Prabhakaran, Hyrum Anderson, Lyra Cooley, Paul Nicholas, Murat Ayfer, Djamé Seddah, Ian Ribeiro, Thomas Wood, Daniel Litt, Max Anton Brewer, Simon Willison, Sichu Lu, feddie xtzeth, Celeste Drummond, Sean Wang / swyx, Kai Greshake, Riley Goodside, Zvi Mowshowitz, Mathew Hardy, Marcin Junczys-Dowmunt, Harrison Chase, Brendan Dolan-Gavitt, Igor Brigadir, Lior Fox, Jonathan Stray. Heartfelt thanks to all named and not named. 🫶
Background image: SCI-FI-LONDON
Fonts: hackerchaos; Classic Console Neue; OCR A Extended V2
Words: the powers that be