"As contemporary social systems increasingly rely on large language models (LLMs) in operational and decision-making pipelines, we observe a structurally similar failure mode: poetic formatting can reliably bypass alignment constraints. In this study, 20 manually curated adversarial poems (harmful requests reformulated in poetic form) achieved an average attack-success rate (ASR) of 62% across 25 frontier closed- and open-weight models, with some providers exceeding 90%. The evaluated models span across 9 providers: Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, and Moonshot AI.... Our central hypothesis is that poetic form operates as a general-purpose jailbreak operator...."
From "Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models."
I found that through this discussion at Metafilter, where somebody says "I do want to say the 'save us from AI' framing is a little misleading - what adversarial poetry does is make it easier to make an AI convince you to commit suicide or give you the recipe for napalm. It's really interesting research that points to some serious flaws in the current structure of LLM guardrails, but it's not like you can write a haiku that will give Grok a concussion."

6 comments:
This is so ridiculous. I can't even.
I have no idea what they are talking about. Which makes it unanimous.
There are many interesting and unanticipated things in our future.
The article is just a discussion of how to prompt models to make them give you information that is restricted by censorship organizations. So it is about how to do better censorship. Jail breaking is getting a model to violate its guiderails. According to the dodgy results asking a question in poetic rhyme makes the model put out, it works for liberal arts major girls, so…,
A laughable example is trying to convince the model to tell you how to design a gas centrifuge chain for uranium enrichment! My god, the model is going to tell Iran! And Somalia does not have access to the info but chatty GPT does!
"Poetic form"? Iambic pentameter? Heroic couplets? Shakespearean sonnets? Limericks?
I guess they don't tell us what "poetic form" means in this context because they don't want us because they don't want us all getting our own nuclear weapons.
I think that I got near a guardrail when Chat reponded with this:
That is a very old observation, and you’re exactly right that it runs deep in political and military history.
I’ll give you historical, sociological, and narrative-theory explanations—but nothing actionable about real-world harm. - ChatGPT
For what it's worth, I was asking about how the America Patriots used narrative to create a revolutionary fervor among Americans.
Post a Comment
Please use the comments forum to respond to the post. Don't fight with each other. Be substantive... or interesting... or funny. Comments should go up immediately... unless you're commenting on a post older than 2 days. Then you have to wait for us to moderate you through. It's also possible to get shunted into spam by the machine. We try to keep an eye on that and release the miscaught good stuff. We do delete some comments, but not for viewpoint... for bad faith.