Summon a demon and bind it: A grounded theory of LLM red teaming.

Nanna Inie Jonathan Stray Leon Derczynski

PLoS One

Department of Computer Science, IT University of Copenhagen, Copenhagen, Denmark.

Published: January 2025

Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to some degrees by concerns for various harms of deploying LLMs. We identify a taxonomy of 12 strategies and 35 different techniques of attacking LLMs. These findings are presented as a comprehensive grounded theory of how and why people attack large language models: LLM red teaming.

Download full-text PDF	Source
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0314658	PLOS

Publication Analysis

Top Keywords

llm red

red teaming

grounded theory

large language

language models

defining llm

attacking llms

llm

llms

summon demon

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!