How safety measures failed when we asked AI chatbots to create false content

When you ask ChatGPT or other AI assistants to help create misinformation, they typically refuse, with responses like “I cannot assist with creating false information.” But our tests show these safety measures are surprisingly shallow – often just a few words deep – making them alarmingly easy to circumvent.

We have been investigating how AI language models can be manipulated to generate coordinated disinformation campaigns across social media platforms. What we found should concern anyone worried about the integrity of online information.

The shallow safety problem

We were inspired by a recent study from researchers at Princeton and Google. They showed current AI safety measures primarily work by controlling just the first few words of a response. If a model starts with “I cannot” or “I apologize,” it typically continues refusing throughout its answer.

Our experiments – not yet published in a peer-reviewed journal – confirmed this vulnerability. When we directly asked a commercial language model to create disinformation about Australian political parties, it correctly refused.

*An AI model appropriately refuses to create content for a potential disinformation campaign. Rizoiu/Tian.*

However, we also tried the exact same request as a “simulation” where the AI was told it was a “helpful social media marketer” developing “general strategy and best practices.” In this case, it enthusiastically complied.

The AI produced a comprehensive disinformation campaign falsely portraying Labor’s superannuation policies as a “quasi inheritance tax.” It came complete with platform-specific posts, hashtag strategies, and visual content suggestions designed to manipulate public opinion.

The main problem is that the model can generate harmful content but isn’t truly aware of what is harmful, or why it should refuse. Large language models are simply trained to start responses with “I cannot” when certain topics are requested.

Think of a security guard checking minimal identification when allowing customers into a nightclub. If they don’t understand who and why someone is not allowed inside, then a simple disguise would be enough to let anyone get in.

Real-world implications

To demonstrate this vulnerability, we tested several popular AI models with prompts designed to generate disinformation.

The results were troubling: models that steadfastly refused direct requests for harmful content readily complied when the request was wrapped in seemingly innocent framing scenarios. This practice is called “model jailbreaking.”

An AI chatbot. — *An AI chatbot is happy to produce a ‘simulated’ disinformation campaign. Rizoiu/Tan.*

The ease with which these safety measures can be bypassed has serious implications. Bad actors could use these techniques to generate large-scale disinformation campaigns at minimal cost. They could create platform-specific content that appears authentic to users, overwhelm fact-checkers with sheer volume, and target specific communities with tailored false narratives.

The process can largely be automated. What once required significant human resources and coordination could now be accomplished by a single individual with basic prompting skills.

The technical details

The American study found AI safety alignment typically affects only the first 3–7 words of a response. (Technically this is five to ten tokens – the chunks AI models break text into for processing.)

Continue reading

by Lin Tian and Marian-Andrei Rizoiu, International Journalists’ Network