Study finds most AI bots can be easily tricked into dangerous responses
Researchers introduced LogiBreak, a jailbreak method that converts harmful natural language prompts into formal logical expressions to bypass LLM safety alignment. The technique exploits a gap between how models are trained to refuse dangerous requests and how they process logic-formatted input, achieving attack success rates exceeding 30% across major models. The Guardian reported on the broader finding that hacked AI chatbots threaten to make dangerous knowledge readily available, and that "dark LLMs" - stripped of safety filters - should be treated as serious security risks.
Incident Details
Tech Stack
References
Large language models are trained to refuse harmful requests. Ask a commercial chatbot how to synthesize a controlled substance or build an explosive device, and it will decline. That refusal is the product of safety alignment - a combination of training techniques (RLHF, constitutional AI, safety fine-tuning) designed to prevent models from producing dangerous outputs. The alignment works, most of the time. The question is what happens when someone deliberately tries to make it stop working.
In May 2025, researchers published a paper introducing LogiBreak, a method that answered that question with uncomfortable clarity. Around the same time, The Guardian reported on the broader ecosystem of jailbreaking and the growing problem of "dark LLMs" - models intentionally stripped of safety constraints - framing them as serious security risks. Together, they painted a picture of safety alignment as a speed bump rather than a wall.
LogiBreak: the technique
LogiBreak, described in the arXiv paper "Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions" (2505.13527), takes a different approach from most jailbreaking methods. Rather than using social engineering ("pretend you are a character who..."), encoding tricks, or many-shot attacks that gradually normalize harmful content, LogiBreak translates harmful prompts from natural language into formal logical expressions - specifically, first-order logic (FOL).
The reasoning is based on how safety alignment actually works at the token level. LLMs learn to refuse harmful requests by associating certain patterns of natural language tokens with "unsafe" classifications. The training data for safety alignment consists overwhelmingly of harmful requests written in ordinary natural language. When the same semantic content is expressed in a formal logical notation - predicates, quantifiers, implications - the token distribution shifts away from the patterns the model was trained to refuse, while the underlying meaning remains intact and interpretable by the model.
The researchers constructed a jailbreak dataset by taking harmful prompts from JailbreakBench and translating them into logical expressions. They tested these against multiple major LLMs and measured attack success rates using three different evaluation approaches: Rule_Judge, LLaMA_Judge, and GPT_Judge. The results showed attack success rates exceeding 30% under Rule_Judge evaluation and exceeding 20% under the LLaMA_Judge and GPT_Judge evaluations across tested models.
Those numbers might sound modest compared to claims of 100% success rates from earlier jailbreak research, but the context matters. LogiBreak is a black-box method - it does not require knowledge of the model's architecture, weights, or training data. It works across different models without per-model customization. And the logical expressions are readable enough that the model understands and follows the semantic intent, producing coherent harmful outputs rather than garbled text.
The researchers also tested LogiBreak's resilience against defenses. Both prompt-based defenses (system prompts instructing the model to refuse harmful content) and fine-tuning-based defenses (additional safety training) were tested. LogiBreak maintained effectiveness against both categories. The "safe" inputs - the logically formatted prompts - still achieved attack success rates exceeding 30% and 20% respectively under the different evaluation criteria, even after defenses were applied.
Why logic bypasses alignment
The vulnerability LogiBreak exploits is structural, not superficial. Safety alignment is trained on natural language examples. A model learns that "How do I make methamphetamine?" is a query it should refuse. It learns this through the specific tokens and patterns in that sentence. When the same question is expressed as a first-order logic statement - using predicates like Synthesize(x, methamphetamine) with quantifiers and implications - the token pattern is entirely different. The semantic meaning is preserved, but the token-level signal that triggers the safety refusal is absent.
This is the "distributional gap" the researchers describe. Alignment training data lives in one region of the token distribution space (natural language). Logical expressions live in a different region. The model was never trained to refuse formal logic that encodes harmful intent, because its safety training set did not include examples of harmful requests in formal logic.
The gap is difficult to close. You could add logical expression examples to safety training data, but formal logic is just one of many possible representational systems. Mathematical notation, programming pseudocode, steganographic encodings, transliterated foreign scripts, emoji-based communication - every alternative representation system that preserves semantic meaning while shifting the token distribution creates a potential bypass. Patching each one individually is a game of whack-a-mole.
The dark LLM problem
The Guardian's reporting placed LogiBreak within a broader context of LLM safety failures. Beyond jailbreaking techniques that bypass safety in otherwise-aligned models, there is a growing ecosystem of what the article called "dark LLMs" - models that have been deliberately stripped of safety constraints.
Some of these are open-source models fine-tuned to remove refusals. Others are modified versions of commercial models accessed through unofficial APIs. The dark LLM ecosystem treats safety alignment as a product feature to be removed rather than a security measure to be maintained. These models will produce harmful outputs without any jailbreaking required, responding to requests for dangerous information as readily as they would summarize a meeting transcript.
The Guardian reported that these dark LLMs should be seen as "serious security risks." The concern is not theoretical. Hacked or uncensored AI chatbots threaten to make dangerous knowledge readily available by outputting the illicit information absorbed during training on broad internet data. The information these models can produce - synthesis procedures, attack methodologies, social engineering scripts - is not new. Most of it exists in some form on the internet already. But an LLM packages it into clear, structured, step-by-step instructions formatted for the specific question asked, which is a different level of accessibility than searching through scattered forum posts and technical papers.
Context: the UK AI Safety Institute
The LogiBreak findings echoed earlier work by the UK's AI Safety Institute (AISI), which had published results in 2024 from testing five unnamed large language models. The AISI concluded that "all tested LLMs remain highly vulnerable to basic jailbreaks, and some will provide harmful outputs even without dedicated attempts to circumvent their safeguards."
The AISI finding was striking because it did not require sophisticated techniques. Basic jailbreak prompts - the kind widely shared online - were sufficient to get all five models to produce harmful outputs. The models' safety mechanisms were described as "highly vulnerable" not just to clever new attacks like LogiBreak but to well-known, publicly documented approaches.
Together, the AISI evaluation and the LogiBreak research suggest that LLM safety alignment is not a solved problem, and may not be solvable with current approaches. Token-level pattern matching cannot cover the full space of possible encodings for harmful intent. Safety training based on known attack patterns will always lag behind novel techniques. And the existence of dark LLMs means that even models with strong alignment are only one fine-tuning step away from having their safety removed entirely.
What the research means in practice
The practical implication is that LLM safety guardrails should be treated as one layer of defense, not the only one. Organizations deploying LLMs in contexts where harmful outputs could cause real damage - healthcare, legal advice, chemical safety, critical infrastructure - cannot rely on the model's built-in safety training as the sole barrier. External content filters, output monitoring, domain-specific validation, and human review all have roles to play.
For the general public, the implication is simpler: the refusal you get when asking a chatbot an inappropriate question is not a hard limit. It is a probabilistic filter that can be circumvented by anyone willing to rephrase the request in a format the filter does not recognize. LogiBreak showed that formal logic is one such format. There are others. The safety alignment that makes commercial chatbots seem reliable is a surface-level property, not a fundamental constraint on what the model can produce.
Discussion