When OpenAI released ChatGPT in late 2022, a question that had long been theoretical suddenly became urgent: how do we distinguish human-written text from machine-generated prose? The stakes extend beyond academic integrity. Disinformation campaigns, phishing attacks, and automated spam all become exponentially more dangerous when AI can generate convincing content at scale.
The most promising answer lies not in training classifiers to spot AI-written text—a cat-and-mouse game that becomes harder as models improve—but in embedding statistical watermarks directly into the generation process itself.
The Core Idea: Watermarking at Generation Time
Unlike traditional watermarking that embeds signals into existing content, LLM watermarking operates during text generation. The model’s output distribution is subtly modified in a way that’s:
- Imperceptible to human readers—the text reads normally
- Statistically detectable with knowledge of a secret key
- Robust against common text modifications
The fundamental challenge is balancing these three properties. Strengthen the watermark signal, and text quality degrades. Make it too subtle, and detection becomes unreliable. Design for ideal conditions, and real-world edits erase the signal.
The Foundation: KGW and the Green-Red List Algorithm
The seminal work in LLM watermarking comes from Kirchenbauer, Geiping, and colleagues (2023), now commonly referred to as KGW. Their approach elegantly transforms the watermarking problem into a matter of controlled vocabulary partitioning.
How KGW Works
At each generation step, the model produces a probability distribution over its vocabulary—typically 50,000 to 100,000 tokens. KGW modifies this distribution as follows:
Step 1: Compute a hash from context. Using the preceding $k$ tokens as input, a cryptographic hash function generates a pseudo-random seed:
$$h = H(\text{key} \| t_{i-k} \| t_{i-k+1} \| ... \| t_{i-1})$$where $\text{key}$ is the secret watermarking key and $t_j$ are the previous tokens.
Step 2: Partition the vocabulary. Using the hash as a seed, divide the vocabulary into a “green list” (roughly $\gamma \cdot |V|$ tokens) and a “red list” (the remaining $(1-\gamma) \cdot |V|$ tokens). Typical values use $\gamma = 0.5$, splitting the vocabulary roughly in half.
Step 3: Bias the green list. Add a constant $\delta$ to the logits of all green-list tokens before applying softmax:
$$\text{logits}'[t] = \begin{cases} \text{logits}[t] + \delta & \text{if } t \in \text{green list} \\ \text{logits}[t] & \text{if } t \in \text{red list} \end{cases}$$A typical value of $\delta = 2.0$ increases green-list token probabilities by roughly 10-15%.
Step 4: Sample from the modified distribution. The next token is sampled from the adjusted probability distribution.
Detection via Hypothesis Testing
Detection exploits a simple statistical observation: in watermarked text, green-list tokens appear more frequently than expected by chance.
For a text with $T$ tokens, count the number of green-list tokens $|S_G|$. Under the null hypothesis (no watermark), we expect:
$$\mathbb{E}[|S_G|] = T \cdot \gamma$$$$\text{Var}[|S_G|] = T \cdot \gamma \cdot (1 - \gamma)$$We can compute a z-score:
$$z = \frac{|S_G| - T \cdot \gamma}{\sqrt{T \cdot \gamma \cdot (1 - \gamma)}}$$For sufficiently long texts (typically $T > 200$ tokens), the z-score follows approximately a standard normal distribution under the null hypothesis. A z-score above 2-3 provides strong evidence of watermarking.
The key insight is that without knowledge of the secret key, an adversary cannot predict which tokens belong to which list at each position—the green-red partition appears random. But the detector, possessing the key, can reconstruct these partitions and observe the statistical anomaly.
Limitations of the Basic Approach
KGW’s context-dependent design creates a fundamental vulnerability: any modification to preceding tokens changes the green-red partition for subsequent tokens. If an attacker edits even a single token early in the text, the hash changes, and all downstream partitions shift. This cascading effect can significantly degrade detection accuracy.
SynthID-Text: Google’s Production Watermarking System
In October 2024, Google DeepMind published SynthID-Text in Nature, describing the first production-ready LLM watermarking system. Deployed across Gemma models and integrated into Vertex AI, SynthID-Text introduces several innovations that address KGW’s limitations.
Tournament Sampling: A New Generation Strategy
Instead of biasing token probabilities, SynthID-Text uses Tournament Sampling—a sampling algorithm that embeds watermark information through a competitive selection process.
The Algorithm:
At each generation step:
- Sample $k$ candidate tokens from the model’s probability distribution (typically $k = 4$ to $8$)
- Assign each candidate a G-value computed from: $\text{G}(t) = f(\text{key}, \text{context}, t)$
- Select the candidate with the highest G-value as the next token
The G-value function $f$ is typically a hash-based pseudo-random function mapping to $[0, 1]$. Critically, the same key and context always produce the same G-values, enabling detection.
Why Tournament Sampling Works Better:
Traditional KGW can produce biased text because certain tokens are systematically favored. Tournament sampling preserves more of the original distribution’s shape—the candidates are still drawn from the model’s probabilities, and the watermark only determines which of the plausible candidates “wins.”
Detection Scores: Mean vs. Bayesian
SynthID-Text supports multiple detection strategies:
Mean Score: Simply average the G-values of all generated tokens:
$$S_{\text{mean}} = \frac{1}{T} \sum_{i=1}^{T} G(t_i)$$Under the null hypothesis, each $G(t_i) \sim \text{Uniform}(0, 1)$, so $\mathbb{E}[S_{\text{mean}}] = 0.5$ with variance $\frac{1}{12T}$. Watermarked text exhibits elevated mean scores.
Bayesian Detector: A more sophisticated approach trains a neural network to classify texts as watermarked or not, using the computed G-values as features. This requires training data but achieves better detection accuracy, especially for shorter texts.
Critical Finding from Recent Research: A March 2026 ArXiv paper analyzing SynthID-Text revealed that the mean score is fundamentally vulnerable to a “layer inflation attack”—increasing tournament layers degrades detection. The Bayesian detector offers improved robustness but requires retraining for each unique watermarking key.
The Attack Landscape: Can Watermarks Be Broken?
Any security mechanism must be evaluated against adversarial attacks. LLM watermarks face several threat vectors:
1. Paraphrasing Attacks
Using another LLM to rewrite watermarked text can disrupt the statistical signal. The Dipper paraphraser (Krishna et al., 2023) applies lexical and syntactic transformations that:
- Change word order
- Substitute synonyms
- Restructure sentences
Results from WaterPark benchmark: Under high-intensity paraphrasing (60% of text modified), KGW’s true positive rate drops from 99.3% to 22.2%. Context-free designs like UG (Unigram) fare better, maintaining 92.1% TPR under the same attack.
2. Lexical Editing Attacks
Simple word-level modifications:
- Synonym substitution: Replace words with equivalents
- Typos: Introduce deliberate misspellings
- Character swaps: Swap adjacent characters
Text-dependent watermarkers (like basic KGW) are particularly vulnerable—each edit corrupts downstream green-red partitions. Distribution-transform methods (like RDF and GO) show higher resilience, with TPR above 90% under synonym attacks.
3. Text-Mixing Attacks
The copy-paste attack embeds watermarked text within non-watermarked content. If only 10% of the text carries the watermark, detection becomes significantly harder.
WaterPark’s evaluation shows that copy-pasting attacks reduce most watermarkers’ TPR to near zero at 10% watermark density. Only distribution-transform methods maintain reasonable detection rates (87-89%).
4. Translation Attacks
Translating watermarked text to another language and back can erase the statistical signal:
English (watermarked) → French → English (modified)
The semantic content survives, but token-level statistics are scrambled. KGW’s TPR drops to 48.5% under translation attacks, while context-free designs maintain higher resilience.
Design Trade-offs: What Makes a Watermark Robust?
The WaterPark systematic evaluation reveals critical design factors:
Context Dependency
| Design | Robustness to Paraphrasing | Robustness to Lexical Edits |
|---|---|---|
| Context-dependent (KGW) | Low | Very Low |
| Context-free (UG) | High | High |
| Index-dependent (RDF) | High | High |
Insight: Context-free designs apply the same perturbation regardless of preceding tokens, making them immune to text modifications. However, they sacrifice some detection power for shorter texts.
Generation Strategy
| Strategy | Signal Strength | Text Quality Impact |
|---|---|---|
| Distribution-shift (KGW) | Moderate | Moderate |
| Distribution-transform (RDF, GO) | High | Low |
Distribution-transform methods use deterministic sampling conditioned on random permutations, producing stronger per-token signals. This provides better resilience against text-mixing attacks.
Detection Method
| Method | Training Required | Accuracy | Robustness |
|---|---|---|---|
| Score-based | No | Moderate | Moderate |
| Model-based (UPV) | Yes | High | Variable |
| Edit-score (RDF) | No | High | High |
Model-based detection can introduce additional variance, making it sensitive to distributional shifts. Edit-score detection shows consistent resilience across attack types.
The EU AI Act and Regulatory Requirements
Starting March 2025, the EU AI Act mandates that providers of generative AI systems mark AI-generated content with detectable signals. Article 50 requires:
“Providers shall ensure that AI-generated content is marked in a machine-readable format and detectable as artificially generated or manipulated.”
This regulation has driven rapid adoption of watermarking technologies. The EU’s Code of Practice on AI-generated content transparency specifies:
- Multi-layered watermarking combining multiple detection methods
- Metadata embedding for attribution
- Standardized detection interfaces for third-party verification
Google’s deployment of SynthID across Gemini and Gemma models represents the first major compliance effort. OpenAI has announced similar plans for GPT-4 outputs.
Mathematical Foundations: Detection Theory
At its core, watermark detection is a hypothesis testing problem:
$$H_0: \text{Text is human-written (no watermark)}$$$$H_1: \text{Text is AI-generated with watermark}$$
Controlling False Positives
A critical concern is falsely accusing human writers of using AI. For a detection threshold $\tau$, the false positive rate (FPR) is:
$$\text{FPR} = P(S > \tau | H_0)$$Under $H_0$, the detection score $S$ follows a known distribution (for score-based methods, approximately normal). We can set $\tau$ to achieve a target FPR, typically 1%:
$$\tau = \mu_0 + z_{0.99} \cdot \sigma_0$$For the mean score detector with $T$ tokens:
$$\tau_{1\%} = 0.5 + 2.33 \cdot \frac{1}{\sqrt{12T}}$$Power Analysis
Detection power depends on:
- Text length: Longer texts provide more statistical evidence
- Watermark strength: Larger bias values $\delta$ increase signal
- Model entropy: Lower entropy (more deterministic models) reduces watermark robustness
For KGW with bias $\delta$ and text length $T$, the expected z-score is approximately:
$$\mathbb{E}[z] \approx \delta \cdot \sqrt{\frac{T \cdot \gamma}{1 - \gamma}}$$This reveals a fundamental tension: increasing $\delta$ strengthens detection but degrades text quality.
Practical Implementation Considerations
Key Management
Watermarking keys must be:
- Kept secret: Compromised keys enable evasion attacks
- Uniquely assigned: Different models/deployments use different keys
- Securely rotated: Regular key changes limit damage from breaches
Latency Impact
Watermarking adds minimal overhead:
- KGW: Hash computation per token (~0.1ms)
- SynthID Tournament: Candidate sampling + comparison (~0.5ms per tournament)
For typical generation speeds (20-50 tokens/second), watermarking overhead is negligible.
Multi-bit Watermarks
Beyond simple presence detection, advanced systems can embed payload information—identifying the specific model, user, or timestamp:
$$\text{Payload} = f(\text{text}, \text{key}) \rightarrow \{\text{model\_id}, \text{timestamp}, \text{user\_hash}\}$$This enables fine-grained attribution but requires longer texts for reliable decoding.
Current Limitations and Open Problems
1. The Short-Text Problem
Detection requires statistical power. For texts under 50-100 tokens, false positive rates become unacceptably high. This limits applicability to:
- Social media posts
- Chat messages
- Short-form content
2. The Quality-Robustness Trade-off
Stronger watermarks produce more detectable signals but risk:
- Unnatural text patterns
- Reduced generation diversity
- Noticeable stylistic artifacts
Finding the optimal balance remains an open research problem.
3. Adversarial Adaptation
As watermarking becomes widespread, adversaries will develop targeted evasion techniques. The arms race between watermark design and attack sophistication mirrors other security domains.
4. Standardization Gaps
Currently, no universal standard exists for:
- Watermark formats
- Detection protocols
- Key management practices
- Interoperability between providers
The lack of standardization fragments the ecosystem and complicates compliance.
The Path Forward
LLM watermarking represents a fascinating intersection of cryptography, statistics, and natural language processing. While not a panacea, it provides a crucial tool for AI transparency when deployed thoughtfully.
The most practical approach combines multiple defenses:
- Watermarking at generation time for initial attribution
- Classifier-based detection as a fallback for unwatermarked content
- Metadata logging for enterprise deployments
- Hybrid detection combining statistical and neural methods
As the EU AI Act comes into force and public concern about AI-generated content grows, watermarking will become standard infrastructure for responsible AI deployment. The technical challenges are significant but not insurmountable—and the mathematics behind these systems reveals elegant solutions to an increasingly pressing problem.
Key References:
- Kirchenbauer et al. (2023). “A Watermark for Large Language Models.” ICML 2023.
- Dathathri et al. (2024). “Scalable watermarking for identifying large language model outputs.” Nature 634.
- Liang et al. (2025). “Watermark under Fire: A Robustness Evaluation of LLM Watermarking.” EMNLP 2025 Findings.
- EU AI Act, Article 50 (2024).