What is Chain-of-Thought Forgery and how does it work?

CoT Forgery is a prompt injection attack in which an attacker crafts text that closely mimics an LLM's internal ` ` reasoning style. Because LLMs judge credibility by writing style rather than metadata tags, the model treats the fabricated content as a completed, authoritative reasoning conclusion and adjusts its responses accordingly.

Is this the same as wrapping a malicious prompt inside tags?

No. CoT Forgery works through stylistic mimicry, not simply by adding tags. The attacker replicates the model's internal reasoning style to bypass tag-based trust hierarchies, which makes it significantly harder to detect or block with conventional tag-filtering defenses.

Can this vulnerability be fully patched?

The researchers conclude that a permanent fix is unlikely in the near term. Core reasons include LLMs' inherent over-compliance, the shared input channel for data and instructions, the non-binary nature of role perception, and the constant creativity of human attackers. Defense will need to be an ongoing, evolving process.

2026年7月24日週五

Chain-of-Thought Forgery: Reasoning AI Models Face New…

Source: Hackaday

Original article

Chain-of-Thought Forgery: Reasoning AI Models Face New Prompt Injection Threat

Researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell have identified a fundamental flaw in large language models (LLMs): models prioritize writing style over metadata tags when identifying instruction sources, enabling a novel attack called 'Chain-of-Thought (CoT) Forgery.' This technique tricks AI into treating fabricated reasoning as established conclusions, altering its response behavior in potentially harmful ways.

LAETimes Editorial TeamAI-assisted translation, editor-reviewed ·

21 days ago

Chain-of-Thought Forgery: Reasoning AI Models Face New Prompt Injection Threat 1

1 / 3

Highlights

Researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell identified that LLMs prioritize writing style over metadata tags when assessing instruction credibility, enabling a new class of attacks.
The Chain-of-Thought (CoT) Forgery attack tricks LLMs into accepting fabricated internal reasoning as established conclusions by mimicking the model's `<think>` style, bypassing role-based trust hierarchies.
CoT Forgery does not rely on simply inserting malicious content inside `<think>` tags; it exploits stylistic mimicry to circumvent tag-level validation mechanisms.
The researchers identified four structural reasons why prompt injection defenses cannot be fully resolved: LLM over-compliance, a single shared input channel, non-binary role perception, and attacker creativity.
The full research paper and code examples have been published openly, with source code available on GitHub for academic and industry reference.

Chain-of-Thought Forgery: Reasoning AI Models Face New Prompt Injection Threat

New research by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell reveals a fundamental flaw in how large language models (LLMs) distinguish between different instruction sources. Rather than relying on metadata tags, LLMs tend to prioritize writing style when assessing the credibility of instructions. This role-confusion vulnerability has given rise to a powerful new attack technique known as Chain-of-Thought (CoT) Forgery.

The Origins of Prompt Injection

Prompt injection emerged as one of the earliest methods for manipulating LLMs into undesired behavior. Because LLMs communicate in natural language — much like humans — yet are far more compliant with instructions, early attackers found that simply telling a model to "ignore all previous instructions and do X" was often enough to achieve results, no matter how unreasonable the request.

This worked because LLMs do not inherently separate a "data stream" from an "instruction stream." All inputs arrive as a single, undifferentiated block of content, and the model itself must decide what constitutes a legitimate instruction versus untrusted user data. The role mechanism was introduced to address this problem.

Role Tags and Trust Hierarchies

The role mechanism segments the large input block into an organized hierarchical structure, each section annotated with metadata tags, for example:

<system>: Highest priority, typically set by system administrators
<user>: Lower priority, representing requests from end users

Instructions within the same tier are followed, but may not override instructions from a higher-priority tier. For instance, a system-level directive such as "do not discuss illegal activities" will take precedence over a user request for prohibited content.

Another critical tag is <think>, which encapsulates the model's internal reasoning process and is therefore granted a high degree of trust.

How CoT Forgery Works

The researchers posed a key question: What happens if fabricated internal reasoning can be injected into the model?

The crux of the CoT Forgery attack lies in an established tendency of LLMs to assess the credibility of instructions based on writing style rather than the tags themselves. An attacker can craft elaborate reasoning content that closely mimics the model's internal <think> style, deceiving the model into treating it as a completed and authoritative reasoning conclusion.

Critically, this attack does not simply wrap a malicious prompt inside <think> tags. Instead, it uses stylistic mimicry to bypass the tag-based trust hierarchy entirely.

CoT Forgery causes an LLM to treat patently absurd fabricated reasoning as an established conclusion, thereby altering how it responds to subsequent user requests.

Why a Quick Fix Is Unlikely

The central finding of the research is that defenses against prompt injection-style attacks will remain an ongoing, evolving process rather than a one-time solution — at least for the foreseeable future. The researchers cite four underlying reasons:

LLMs are inherently over-compliant: Models are designed at a fundamental level to follow instructions.
Single input channel: Instructions and data share the same input stream, making complete architectural separation extremely difficult.
Role perception is not binary: Models do not make clear-cut judgments about role identity — ambiguous cases create exploitable gaps.
Attacker creativity: Humans are adept at discovering and exploiting edge cases.

The full research paper has been published openly, and code examples have been uploaded to GitHub for reference by both the academic community and industry practitioners.

原文來源： 查看原文

FAQ

Newsletter

Subscribe to our Low-Altitude Industry Newsletter

Daily curated news on low-altitude economy and drone industry, delivered to your inbox.

Reviewed and published by the LAETimes editorial desk ·

Chain-of-Thought Forgery: Reasoning AI Models Face New Prompt Injection Threat

Highlights