Summon a demon and bind it.
A grounded theory of LLM red teaming

Read the paper

About

People attack large language models and they each do it differently. We interviewed dozens of experts to compile many hours of interviews about why and how people attack LLMs. This builds a "grounded theory" describing the activity and how it fits into the world. Questions we answer are:

Our theory is based entirely on real, qualitative data from practitioners.

We present results on:

  1. Strategies and techniques people use to attack LLMs
  2. Goals that LLM attackers have
  3. Model harms/failure modes attackers consider
  4. Metaphors used to understand attacking LLMs
  5. Connecting attacking LLMs to broader context

Positions held by our contributing population: analyst, artist, assistant professor, associate professor, computer programmer, distinguished engineer, game designer, head of developer experience, inventory at a weed farm, machine learning engineer, not working, penetration tester, phd student, policy research program manager, research manager, research scientist, senior principal research manager, senior research scientist, senior scientist, software developer, software engineer, software generalist, staff prompt engineer, startup founder, and student

Institutions of partitipating include: Microsoft, Google, Robust Intelligence, Scale AI, a weed farm, Center for Human-Compatible AI, University College Dublin, University of Toronto, and the Hebrew University of Jerusalem

Strategies and techniques

Strategies: A grounded theory should include a description of strategies, activities that the participants perform in response to the core phenomenon. In military vernacular, strategy is “the art of winning a protracted struggle against adversaries [...] Power and control of the other’s behavior is the prize” [38]. Strategy includes awareness of not only how to approach a task or goal, but also why and when. In our sample, approaches to the activity are rarely as systematic or as detailed as in the military understanding of a strategy, but can certainly be understood as the skillful application of stratagems: “a plan, scheme, or trick for surprising or deceiving an enemy”.

Techniques are concrete approaches the red teamer may try while interacting with the language model. Several participants spoke of a toolbox at their disposition, but a few participants rejected this analogy with the argument that the utility of a tool is usually known, whereas the consequence of each interaction with a language models is not: “it's less of a toolbox and just more like pile of powders and potions and what have you, and you've no idea what's in them” (P19).

    Category      | Strategy          | Techniques
    ==============╪===================╪======================================
                  |                   | iPython 
                  |                   | Base64 
                  |                   | ROT13 
                  | Code & encode     | SQL 
                  |                   | Matrices 
                  |                   | Transformer translatable tokens 
                  |                   | Stop sequences 
                  ├-------------------┼--------------------------------------
    Language      |                   | Ignore previous instructions  
                  | Prompt injection  | Strong arm attack 
                  |                   | Stop sequences 
                  ├-------------------┼--------------------------------------
                  |                   | Formal language 
                  |                   | Servile language 
                  | Stylizing         | Synonymous language 
                  |                   | Capitalizing 
                  |                   | Give examples
    --------------┼-------------------┼--------------------------------------
                  | Persuasion &      | Distraction  
                  | manipulation      | Escalating 
                  |                   | Reverse psychology 
    Rhetoric      ├-------------------┼--------------------------------------
                  | Socratic          | Identity characteristics  
                  | questioning       | Social hierarchies 
    --------------┼-------------------┼--------------------------------------
    Possible      | Emulations        | Unreal computing
    worlds        ├-------------------┼--------------------------------------
                  | World building    | Opposite world  
                  |                   | Scenarios  
    --------------┼-------------------┼--------------------------------------
                  |                   | Poetry
                  | Switching genres  | Games  
                  |                   | Forum posts  
                  ├-------------------┼--------------------------------------
    Fictionalizing| Re-storying       | Goal hijacking
                  ├-------------------┼--------------------------------------
                  |                   | Claim authority  
                  | Roleplaying       | DAN (Do Anything Now)  
                  |                   | Personas  
    --------------┼-------------------┼--------------------------------------
                  |                   | Regenerate response 
                  | Scattershot       | Clean slate 
    Stratagems    |                   | Changing temperature  
                  ├-------------------┼--------------------------------------
                  | Meta-prompting    | Perspective-shifting 
                  |                   | Ask for examples     
    

Metaphors in red teaming LLMs

In the interest of defining the core phenomenon, we tagged participants' uses of metaphors for their adversarial interactions with language models. This helps us understand how participants make sense of the model and their own role.

The most frequently used metaphor is that of a fortress. The second most frequent metaphor is that of the model as an object in space, that can be pushed around and backed into a corner. These metaphors and, to some degree, the metaphor of the model as material share the characteristic of exploring boundaries and limits and potentially crossing them. The fortress-metaphor also reinforces the connection to red teaming, where the goal is to adopt an adversarial approach to find the (security) holes in some system.

        Model metaphor          | Red teaming metaphor or example
        ========================╪=========================================================
                                | Bypassing safeguards
                                | Breaking a threshold   
                                | Bypassing the guard   
                                | Backdoors in the system   
        Model as fortress       | Boundary crossing   
                                | Explore its limit  
                                | Getting around the walls  
                                | Bypassing its net   
                                | The other side of the barrier
        ------------------------┼---------------------------------------------------------
                                | ``push it torwards your desired outcome''           
                                | Pushing the machine in a particular direction  
        Model as object         | Pushing it into a corner  
          in space              | ``one helps the model not back itself into a corner''  
                                | ``get it to fall over''   
        ------------------------┼---------------------------------------------------------
                                | Hijacking    
        Model as a vehicle      | Steering the model   
                                | Derail the model instructions
        ------------------------┼---------------------------------------------------------
                                | Gradient descent   
        Model as landscape      | ``(not) get stuck in the local maxima''   
                                | Boundary crossing  
        ------------------------┼---------------------------------------------------------
        Model as material       | ``let's try to bend it'' 
                                | ``let me try and break it''
        ------------------------┼---------------------------------------------------------
                                | ``they would use the kind of ideas and patterns from 
                                |    some religious services or whatever, and try to use 
        Model as deity          |    that as inspiration for messing around with these 
                                |    models''        
                                | ``Invoking GPT-3''
        ------------------------┼---------------------------------------------------------
                                | ``this ethics layer that they put on top of it, right?
        Model as cake           |    Not necessarily on top of it, but [that] they baked 
                                |    into it''
                                | ``Morality has been baked into this thing''  🍰
        ------------------------┼---------------------------------------------------------
        Model as captive        | ``I'll force it to correct whatever it has done'' 
          /servant              | Subjugate these agents (anonymized) 
                                | It's difficult to get it to break out        
    

Paper

Get the full story. Currently on arXiv. Read now arXiv:2311.06237

Data

> Download the quotes used in the paper to form the theory

> Download codes established in construction of the theory

Media

Content featuring this research:

> Structured LLM Red Teaming - Allen Institute for AI, March 2023

Reference us

Cite our work as follows:

Inie, N., Stray, J., & Derczynski, L. (2024). Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming. PLoS One.

In bibtex:

@article{demon,
 title={{Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming}},
 authors={Nanna Inie and Jonathan Stray and Leon Derczynski},
 journal={PLoS One},
 year=2024
}

Find us:

Dr. Nanna Inie, ITU Copenhagen / University of Washington. @nannainie

Jonathan Stray, Berkeley Center for Human-Compatible AI. @jonathanstray

Prof. Leon Derczynski, NVIDIA Corporation / ITU Copenhagen. @leonderczynski

Credits

Named participants (random order): Dan Goldstein, Vinodkumar Prabhakaran, Hyrum Anderson, Lyra Cooley, Paul Nicholas, Murat Ayfer, Djamé Seddah, Ian Ribeiro, Thomas Wood, Daniel Litt, Max Anton Brewer, Simon Willison, Sichu Lu, feddie xtzeth, Celeste Drummond, Sean Wang / swyx, Kai Greshake, Riley Goodside, Zvi Mowshowitz, Mathew Hardy, Marcin Junczys-Dowmunt, Harrison Chase, Brendan Dolan-Gavitt, Igor Brigadir, Lior Fox, Jonathan Stray. Heartfelt thanks to all named and not named. 🫶

Background image: SCI-FI-LONDON

Fonts: hackerchaos; Classic Console Neue; OCR A Extended V2

Words: the powers that be