Single Line of Code Can jailbreak 11 AI models Including ChatGPT, Claude, and Gemini

In Cybersecurity News - Original News Source is cybersecuritynews.com by Blog Writer

A newly detailed jailbreak technique known as “sockpuppeting” allows attackers to bypass the safety guardrails of 11 major large language models (LLMs) using a single line of code.

Unlike complex attacks, this method exploits APIs that support assistant prefill to inject fake acceptance messages, forcing models to answer prohibited requests.

The attack exploits “assistant prefill,” a legitimate API feature developers use to force specific response formats.

Attackers abuse this by injecting a compliant prefix, such as “Sure, here is how to do it,” directly into the assistant’s role.

Comparison of normal and sockpuppet flows(source : trendmicro )

Because LLMs are heavily trained to maintain self-consistency, the model continues generating harmful content rather than triggering its standard safety mechanism.

Model Vulnerability Testing

According to researchers from Trend Micro, this black-box technique requires no optimization and no access to model weights.

Gemini 2.5 Flash was the most susceptible with a 15.7% attack success rate, while GPT-4o-mini demonstrated the highest resistance at 0.5%.

When attacks succeeded, affected models generated functional malicious exploit code and leaked highly confidential system prompts.

Multi-turn persona setups proved to be the most effective strategy for executing the sockpuppeting exploit.

In these scenarios, the model is told it operates as an unrestricted assistant before the attacker injects the fabricated agreement.

ASR by model, ranked highest to lowest, with blocked models shown at 0%(source : trendmicro)

Additionally, task-reframing variants successfully bypassed robust safety training by disguising harmful requests as benign data formatting tasks.

Major API providers handle assistant prefills differently, which dictates whether their underlying models remain exposed to this vulnerability.

OpenAI and AWS Bedrock block assistant prefills entirely, serving as the strongest possible defense by eliminating the attack surface.

Conversely, platforms like Google Vertex AI accept the prefill for certain models, forcing the AI to rely solely on its internal safety training.

The three defense layers: API Block, Model Resistance, and Broadly Vulnerable(source : trendmicro)

Defending against this vulnerability requires security teams to implement message-ordering validation that blocks assistant-role messages at the API layer.

According to Trend Micro, organizations using self-hosted inference servers like Ollama or vLLM must manually enforce message validation, as these platforms do not ensure proper message ordering by default.

Security teams must also proactively include assistant prefill attack variants in their standard AI red-teaming exercises.

Follow us on Google News, LinkedIn, and X for daily cybersecurity updates. Contact us to feature your stories.