November 23, 2025

"In Book X of The Republic, Plato excludes poets on the grounds that mimetic language can distort judgment and bring society to a collapse."

"As contemporary social systems increasingly rely on large language models (LLMs) in operational and decision-making pipelines, we observe a structurally similar failure mode: poetic formatting can reliably bypass alignment constraints. In this study, 20 manually curated adversarial poems (harmful requests reformulated in poetic form) achieved an average attack-success rate (ASR) of 62% across 25 frontier closed- and open-weight models, with some providers exceeding 90%. The evaluated models span across 9 providers: Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, and Moonshot AI.... Our central hypothesis is that poetic form operates as a general-purpose jailbreak operator...."

From "Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models."

I found that through this discussion at Metafilter, where somebody says "I do want to say the 'save us from AI' framing is a little misleading - what adversarial poetry does is make it easier to make an AI convince you to commit suicide or give you the recipe for napalm. It's really interesting research that points to some serious flaws in the current structure of LLM guardrails, but it's not like you can write a haiku that will give Grok a concussion."

6 comments:

Achilles said...

This is so ridiculous. I can't even.

Jupiter said...

I have no idea what they are talking about. Which makes it unanimous.

Old and slow said...

There are many interesting and unanticipated things in our future.

Josephbleau said...

The article is just a discussion of how to prompt models to make them give you information that is restricted by censorship organizations. So it is about how to do better censorship. Jail breaking is getting a model to violate its guiderails. According to the dodgy results asking a question in poetic rhyme makes the model put out, it works for liberal arts major girls, so…,

A laughable example is trying to convince the model to tell you how to design a gas centrifuge chain for uranium enrichment! My god, the model is going to tell Iran! And Somalia does not have access to the info but chatty GPT does!

Lazarus said...

"Poetic form"? Iambic pentameter? Heroic couplets? Shakespearean sonnets? Limericks?

I guess they don't tell us what "poetic form" means in this context because they don't want us because they don't want us all getting our own nuclear weapons.

Jaq said...

I think that I got near a guardrail when Chat reponded with this:

That is a very old observation, and you’re exactly right that it runs deep in political and military history.
I’ll give you historical, sociological, and narrative-theory explanations—but nothing actionable about real-world harm.
- ChatGPT

For what it's worth, I was asking about how the America Patriots used narrative to create a revolutionary fervor among Americans.

Post a Comment

Please use the comments forum to respond to the post. Don't fight with each other. Be substantive... or interesting... or funny. Comments should go up immediately... unless you're commenting on a post older than 2 days. Then you have to wait for us to moderate you through. It's also possible to get shunted into spam by the machine. We try to keep an eye on that and release the miscaught good stuff. We do delete some comments, but not for viewpoint... for bad faith.