Can We Detect AI-Generated Text? The Mathematics Behind LLM Watermarking

When OpenAI released ChatGPT in late 2022, a question that had long been theoretical suddenly became urgent: how do we distinguish human-written text from machine-generated prose? The stakes extend beyond academic integrity. Disinformation campaigns, phishing attacks, and automated spam all become exponentially more dangerous when AI can generate convincing content at scale.

The most promising answer lies not in training classifiers to spot AI-written text—a cat-and-mouse game that becomes harder as models improve—but in embedding statistical watermarks directly into the generation process itself.

The Core Idea: Watermarking at Generation Time

Unlike traditional watermarking that embeds signals into existing content, LLM watermarking operates during text generation. The model’s output distribution is subtly modified in a way that’s:

Imperceptible to human readers—the text reads normally
Statistically detectable with knowledge of a secret key
Robust against common text modifications

The fundamental challenge is balancing these three properties. Strengthen the watermark signal, and text quality degrades. Make it too subtle, and detection becomes unreliable. Design for ideal conditions, and real-world edits erase the signal.

The Foundation: KGW and the Green-Red List Algorithm

The seminal work in LLM watermarking comes from Kirchenbauer, Geiping, and colleagues (2023), now commonly referred to as KGW. Their approach elegantly transforms the watermarking problem into a matter of controlled vocabulary partitioning.

How KGW Works

At each generation step, the model produces a probability distribution over its vocabulary—typically 50,000 to 100,000 tokens. KGW modifies this distribution as follows:

Step 1: Compute a hash from context. Using the preceding $k$ tokens as input, a cryptographic hash function generates a pseudo-random seed:

$$h = H(\text{key} \| t_{i-k} \| t_{i-k+1} \| ... \| t_{i-1})$$

where $\text{key}$ is the secret watermarking key and $t_j$ are the previous tokens.

Step 2: Partition the vocabulary. Using the hash as a seed, divide the vocabulary into a “green list” (roughly $\gamma \cdot |V|$ tokens) and a “red list” (the remaining $(1-\gamma) \cdot |V|$ tokens). Typical values use $\gamma = 0.5$, splitting the vocabulary roughly in half.

Step 3: Bias the green list. Add a constant $\delta$ to the logits of all green-list tokens before applying softmax:

$$\text{logits}'[t] = \begin{cases} \text{logits}[t] + \delta & \text{if } t \in \text{green list} \\ \text{logits}[t] & \text{if } t \in \text{red list} \end{cases}$$

A typical value of $\delta = 2.0$ increases green-list token probabilities by roughly 10-15%.

Step 4: Sample from the modified distribution. The next token is sampled from the adjusted probability distribution.

Detection via Hypothesis Testing

Detection exploits a simple statistical observation: in watermarked text, green-list tokens appear more frequently than expected by chance.

For a text with $T$ tokens, count the number of green-list tokens $|S_G|$. Under the null hypothesis (no watermark), we expect:

$$\mathbb{E}[|S_G|] = T \cdot \gamma$$$$\text{Var}[|S_G|] = T \cdot \gamma \cdot (1 - \gamma)$$

We can compute a z-score:

$$z = \frac{|S_G| - T \cdot \gamma}{\sqrt{T \cdot \gamma \cdot (1 - \gamma)}}$$

For sufficiently long texts (typically $T > 200$ tokens), the z-score follows approximately a standard normal distribution under the null hypothesis. A z-score above 2-3 provides strong evidence of watermarking.

The key insight is that without knowledge of the secret key, an adversary cannot predict which tokens belong to which list at each position—the green-red partition appears random. But the detector, possessing the key, can reconstruct these partitions and observe the statistical anomaly.

Limitations of the Basic Approach

KGW’s context-dependent design creates a fundamental vulnerability: any modification to preceding tokens changes the green-red partition for subsequent tokens. If an attacker edits even a single token early in the text, the hash changes, and all downstream partitions shift. This cascading effect can significantly degrade detection accuracy.

SynthID-Text: Google’s Production Watermarking System

In October 2024, Google DeepMind published SynthID-Text in Nature, describing the first production-ready LLM watermarking system. Deployed across Gemma models and integrated into Vertex AI, SynthID-Text introduces several innovations that address KGW’s limitations.

Tournament Sampling: A New Generation Strategy

Instead of biasing token probabilities, SynthID-Text uses Tournament Sampling—a sampling algorithm that embeds watermark information through a competitive selection process.

The Algorithm:

At each generation step:

Sample $k$ candidate tokens from the model’s probability distribution (typically $k = 4$ to $8$)
Assign each candidate a G-value computed from: $\text{G}(t) = f(\text{key}, \text{context}, t)$
Select the candidate with the highest G-value as the next token

The G-value function $f$ is typically a hash-based pseudo-random function mapping to $[0, 1]$. Critically, the same key and context always produce the same G-values, enabling detection.

Why Tournament Sampling Works Better:

Traditional KGW can produce biased text because certain tokens are systematically favored. Tournament sampling preserves more of the original distribution’s shape—the candidates are still drawn from the model’s probabilities, and the watermark only determines which of the plausible candidates “wins.”

Detection Scores: Mean vs. Bayesian

SynthID-Text supports multiple detection strategies:

Mean Score: Simply average the G-values of all generated tokens:

$$S_{\text{mean}} = \frac{1}{T} \sum_{i=1}^{T} G(t_i)$$

Under the null hypothesis, each $G(t_i) \sim \text{Uniform}(0, 1)$, so $\mathbb{E}[S_{\text{mean}}] = 0.5$ with variance $\frac{1}{12T}$. Watermarked text exhibits elevated mean scores.

Bayesian Detector: A more sophisticated approach trains a neural network to classify texts as watermarked or not, using the computed G-values as features. This requires training data but achieves better detection accuracy, especially for shorter texts.

Critical Finding from Recent Research: A March 2026 ArXiv paper analyzing SynthID-Text revealed that the mean score is fundamentally vulnerable to a “layer inflation attack”—increasing tournament layers degrades detection. The Bayesian detector offers improved robustness but requires retraining for each unique watermarking key.

The Attack Landscape: Can Watermarks Be Broken?

Any security mechanism must be evaluated against adversarial attacks. LLM watermarks face several threat vectors:

1. Paraphrasing Attacks

Using another LLM to rewrite watermarked text can disrupt the statistical signal. The Dipper paraphraser (Krishna et al., 2023) applies lexical and syntactic transformations that:

Change word order
Substitute synonyms
Restructure sentences

Results from WaterPark benchmark: Under high-intensity paraphrasing (60% of text modified), KGW’s true positive rate drops from 99.3% to 22.2%. Context-free designs like UG (Unigram) fare better, maintaining 92.1% TPR under the same attack.

2. Lexical Editing Attacks

Simple word-level modifications:

Synonym substitution: Replace words with equivalents
Typos: Introduce deliberate misspellings
Character swaps: Swap adjacent characters

Text-dependent watermarkers (like basic KGW) are particularly vulnerable—each edit corrupts downstream green-red partitions. Distribution-transform methods (like RDF and GO) show higher resilience, with TPR above 90% under synonym attacks.

3. Text-Mixing Attacks

The copy-paste attack embeds watermarked text within non-watermarked content. If only 10% of the text carries the watermark, detection becomes significantly harder.

WaterPark’s evaluation shows that copy-pasting attacks reduce most watermarkers’ TPR to near zero at 10% watermark density. Only distribution-transform methods maintain reasonable detection rates (87-89%).

4. Translation Attacks

Translating watermarked text to another language and back can erase the statistical signal:

English (watermarked) → French → English (modified)

The semantic content survives, but token-level statistics are scrambled. KGW’s TPR drops to 48.5% under translation attacks, while context-free designs maintain higher resilience.

Design Trade-offs: What Makes a Watermark Robust?

The WaterPark systematic evaluation reveals critical design factors:

Context Dependency

Design	Robustness to Paraphrasing	Robustness to Lexical Edits
Context-dependent (KGW)	Low	Very Low
Context-free (UG)	High	High
Index-dependent (RDF)	High	High

Insight: Context-free designs apply the same perturbation regardless of preceding tokens, making them immune to text modifications. However, they sacrifice some detection power for shorter texts.

Generation Strategy

Strategy	Signal Strength	Text Quality Impact
Distribution-shift (KGW)	Moderate	Moderate
Distribution-transform (RDF, GO)	High	Low

Distribution-transform methods use deterministic sampling conditioned on random permutations, producing stronger per-token signals. This provides better resilience against text-mixing attacks.

Detection Method

Method	Training Required	Accuracy	Robustness
Score-based	No	Moderate	Moderate
Model-based (UPV)	Yes	High	Variable
Edit-score (RDF)	No	High	High

Model-based detection can introduce additional variance, making it sensitive to distributional shifts. Edit-score detection shows consistent resilience across attack types.

The EU AI Act and Regulatory Requirements

Starting March 2025, the EU AI Act mandates that providers of generative AI systems mark AI-generated content with detectable signals. Article 50 requires:

“Providers shall ensure that AI-generated content is marked in a machine-readable format and detectable as artificially generated or manipulated.”

This regulation has driven rapid adoption of watermarking technologies. The EU’s Code of Practice on AI-generated content transparency specifies:

Multi-layered watermarking combining multiple detection methods
Metadata embedding for attribution
Standardized detection interfaces for third-party verification

Google’s deployment of SynthID across Gemini and Gemma models represents the first major compliance effort. OpenAI has announced similar plans for GPT-4 outputs.

Mathematical Foundations: Detection Theory

At its core, watermark detection is a hypothesis testing problem:

$$H_0: \text{Text is human-written (no watermark)}$$

$$H_1: \text{Text is AI-generated with watermark}$$

Controlling False Positives

A critical concern is falsely accusing human writers of using AI. For a detection threshold $\tau$, the false positive rate (FPR) is:

$$\text{FPR} = P(S > \tau | H_0)$$

Under $H_0$, the detection score $S$ follows a known distribution (for score-based methods, approximately normal). We can set $\tau$ to achieve a target FPR, typically 1%:

$$\tau = \mu_0 + z_{0.99} \cdot \sigma_0$$

For the mean score detector with $T$ tokens:

$$\tau_{1\%} = 0.5 + 2.33 \cdot \frac{1}{\sqrt{12T}}$$

Power Analysis

Detection power depends on:

Text length: Longer texts provide more statistical evidence
Watermark strength: Larger bias values $\delta$ increase signal
Model entropy: Lower entropy (more deterministic models) reduces watermark robustness

For KGW with bias $\delta$ and text length $T$, the expected z-score is approximately:

$$\mathbb{E}[z] \approx \delta \cdot \sqrt{\frac{T \cdot \gamma}{1 - \gamma}}$$

This reveals a fundamental tension: increasing $\delta$ strengthens detection but degrades text quality.

Practical Implementation Considerations

Key Management

Watermarking keys must be:

Kept secret: Compromised keys enable evasion attacks
Uniquely assigned: Different models/deployments use different keys
Securely rotated: Regular key changes limit damage from breaches

Latency Impact

Watermarking adds minimal overhead:

KGW: Hash computation per token (~0.1ms)
SynthID Tournament: Candidate sampling + comparison (~0.5ms per tournament)

For typical generation speeds (20-50 tokens/second), watermarking overhead is negligible.

Multi-bit Watermarks

Beyond simple presence detection, advanced systems can embed payload information—identifying the specific model, user, or timestamp:

$$\text{Payload} = f(\text{text}, \text{key}) \rightarrow \{\text{model\_id}, \text{timestamp}, \text{user\_hash}\}$$

This enables fine-grained attribution but requires longer texts for reliable decoding.

Current Limitations and Open Problems

1. The Short-Text Problem

Detection requires statistical power. For texts under 50-100 tokens, false positive rates become unacceptably high. This limits applicability to:

Social media posts
Chat messages
Short-form content

2. The Quality-Robustness Trade-off

Stronger watermarks produce more detectable signals but risk:

Unnatural text patterns
Reduced generation diversity
Noticeable stylistic artifacts

Finding the optimal balance remains an open research problem.

3. Adversarial Adaptation

As watermarking becomes widespread, adversaries will develop targeted evasion techniques. The arms race between watermark design and attack sophistication mirrors other security domains.

4. Standardization Gaps

Currently, no universal standard exists for:

Watermark formats
Detection protocols
Key management practices
Interoperability between providers

The lack of standardization fragments the ecosystem and complicates compliance.

The Path Forward

LLM watermarking represents a fascinating intersection of cryptography, statistics, and natural language processing. While not a panacea, it provides a crucial tool for AI transparency when deployed thoughtfully.

The most practical approach combines multiple defenses:

Watermarking at generation time for initial attribution
Classifier-based detection as a fallback for unwatermarked content
Metadata logging for enterprise deployments
Hybrid detection combining statistical and neural methods

As the EU AI Act comes into force and public concern about AI-generated content grows, watermarking will become standard infrastructure for responsible AI deployment. The technical challenges are significant but not insurmountable—and the mathematics behind these systems reveals elegant solutions to an increasingly pressing problem.

Key References:

Kirchenbauer et al. (2023). “A Watermark for Large Language Models.” ICML 2023.
Dathathri et al. (2024). “Scalable watermarking for identifying large language model outputs.” Nature 634.
Liang et al. (2025). “Watermark under Fire: A Robustness Evaluation of LLM Watermarking.” EMNLP 2025 Findings.
EU AI Act, Article 50 (2024).

The Core Idea: Watermarking at Generation Time#

The Foundation: KGW and the Green-Red List Algorithm#

How KGW Works#

Detection via Hypothesis Testing#

Limitations of the Basic Approach#

SynthID-Text: Google’s Production Watermarking System#

Tournament Sampling: A New Generation Strategy#

Detection Scores: Mean vs. Bayesian#

The Attack Landscape: Can Watermarks Be Broken?#

1. Paraphrasing Attacks#

2. Lexical Editing Attacks#

3. Text-Mixing Attacks#

4. Translation Attacks#

Design Trade-offs: What Makes a Watermark Robust?#

Context Dependency#

Generation Strategy#

Detection Method#

The EU AI Act and Regulatory Requirements#

Mathematical Foundations: Detection Theory#

Controlling False Positives#

Power Analysis#

Practical Implementation Considerations#

Key Management#

Latency Impact#

Multi-bit Watermarks#

Current Limitations and Open Problems#

1. The Short-Text Problem#

2. The Quality-Robustness Trade-off#

3. Adversarial Adaptation#

4. Standardization Gaps#

The Path Forward#