How DeepSeek-R1 Learned to Think: The GRPO Algorithm Behind Open-Source Reasoning Models

On January 20, 2025, DeepSeek released R1—a 671B parameter Mixture-of-Experts model that achieved something remarkable: matching OpenAI’s o1 on reasoning benchmarks while being fully open-source. The breakthrough wasn’t just in scale or architecture, but in a fundamentally different approach to training reasoning capabilities: Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that eliminates the need for reward models while enabling sophisticated reasoning behaviors to emerge naturally.

The Problem with Traditional LLM Training

Standard large language models excel at pattern matching and next-token prediction, but struggle with tasks requiring multi-step logical deduction, self-correction, and complex problem decomposition. Chain-of-thought prompting helped, but it required extensive human-annotated demonstrations and still couldn’t match the systematic reasoning humans employ.

The challenge lies in how we train these models. Traditional approaches use:

Supervised Fine-Tuning (SFT): Models learn to mimic human reasoning traces, but this limits them to patterns seen in training data
Reinforcement Learning from Human Feedback (RLHF): Uses Proximal Policy Optimization (PPO) with a learned reward model, but reward models can be “hacked” and require separate training pipelines

Neither approach incentivizes the model to discover reasoning strategies independently. The model learns to produce outputs that look like good reasoning, rather than developing genuine problem-solving capabilities.

From PPO to GRPO: A Paradigm Shift

Proximal Policy Optimization (PPO) has been the workhorse of LLM reinforcement learning since ChatGPT. But PPO carries significant complexity:

flowchart TB
    subgraph PPO["PPO Architecture"]
        A1[Policy Model] --> A2[Generate Completion]
        A2 --> A3[Reward Model]
        A2 --> A4[Critic/Value Model]
        A3 --> A5[Reward Signal]
        A4 --> A6[Value Estimates]
        A5 --> A7[Advantage Calculation]
        A6 --> A7
        A7 --> A8[Policy Update]
    end
    
    subgraph GRPO["GRPO Architecture"]
        B1[Policy Model] --> B2[Generate N Completions]
        B2 --> B3[Verifiable Rewards]
        B3 --> B4[Group Statistics]
        B4 --> B5[Relative Advantage]
        B5 --> B6[Policy Update]
    end

PPO requires four models during training:

Policy (the LLM being trained)
Critic/Value Model (estimates expected rewards per token)
Reference Model (for KL divergence penalty)
Reward Model (scores completions)

The critic is particularly problematic. It must be trained alongside the policy using Generalized Advantage Estimation (GAE), consuming significant memory and compute. For LLMs using outcome-based rewards (only the final answer is scored), estimating per-token value from sequence-level rewards is fundamentally under-constrained.

GRPO asks a radical question: Do we actually need the critic?

The GRPO Algorithm: Mathematics and Intuition

GRPO’s core insight is elegantly simple: instead of training a critic to estimate baseline values, compute the baseline from multiple completions of the same prompt.

The Algorithm

For each training step:

Sample a prompt from the dataset
Generate $G$ completions $\{o_1, o_2, ..., o_G\}$ using the current policy
Compute rewards $\{r_1, r_2, ..., r_G\}$ for each completion
Calculate group-relative advantages:

$$\hat{A}_{i,t} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$

Update policy using the GRPO objective:

$$\mathcal{J}_{GRPO}(\theta) = \mathbb{E}\left[\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\left(\rho_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(\rho_{i,t}(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_{i,t}\right) - \beta \mathbb{D}_{KL}(\pi_\theta || \pi_{ref})\right]$$

Where:

$\rho_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} | q, o_{i,
$\epsilon$ is the clipping parameter (typically 0.2)
$\beta$ is the KL penalty coefficient

Why This Works

The group statistics serve as a dynamic baseline. Completions with above-average rewards receive positive advantages, while below-average ones receive negative advantages. This achieves variance reduction similar to a learned critic, but without the overhead of training a value model.

# Simplified GRPO advantage calculation
def compute_grpo_advantage(rewards: list[float]) -> list[float]:
    """Compute group-relative advantages from rewards."""
    mean_r = sum(rewards) / len(rewards)
    std_r = (sum((r - mean_r) ** 2 for r in rewards) / len(rewards)) ** 0.5
    
    # Normalize to zero mean, unit variance
    advantages = [(r - mean_r) / (std_r + 1e-8) for r in rewards]
    return advantages

# Example: 4 completions with rewards [1.0, 0.0, 0.5, 0.8]
# Advantages: [0.87, -1.36, -0.16, 0.65]
# The best completion gets highest positive advantage

Memory and Compute Efficiency

GRPO reduces memory overhead by ~50% compared to PPO:

Component	PPO	GRPO
Policy Model	✓	✓
Critic/Value Model	✓	✗
Reference Model	✓	✓
Reward Model	✓	✗

For verifiable domains (math, code), GRPO uses deterministic verification instead of a learned reward model, eliminating another source of potential reward hacking.

RLVR: Reinforcement Learning with Verifiable Rewards

GRPO is typically paired with Reinforcement Learning with Verifiable Rewards (RLVR), which replaces learned reward models with deterministic verifiers.

How RLVR Works

For mathematical reasoning:

def verify_math_answer(model_output: str, ground_truth: str) -> float:
    """
    Verify if model's answer matches ground truth.
    Uses string matching or symbolic comparison.
    """
    extracted_answer = extract_final_answer(model_output)
    
    # Direct match
    if extracted_answer == ground_truth:
        return 1.0
    
    # Numeric comparison with tolerance
    try:
        if abs(float(extracted_answer) - float(ground_truth)) < 1e-6:
            return 1.0
    except ValueError:
        pass
    
    return 0.0

For coding tasks:

def verify_code_solution(code: str, test_cases: list) -> float:
    """Execute code against test cases."""
    passed = 0
    for test_input, expected_output in test_cases:
        try:
            result = execute_in_sandbox(code, test_input)
            if result == expected_output:
                passed += 1
        except Exception:
            pass
    
    return passed / len(test_cases)

Why RLVR > RLHF for Reasoning

Aspect	RLHF	RLVR
Reward Source	Learned model	Deterministic verifier
Reward Hacking Risk	High	Minimal
Training Scale	Limited (collapse risk)	Large-scale possible
Data Requirement	Preference pairs	Problems with answers
Domain	General alignment	Math, code, logic

Verifiable rewards enable the extended training runs necessary for reasoning capabilities to emerge—runs that would cause reward model collapse with RLHF.

DeepSeek-R1: The Training Pipeline

DeepSeek-R1’s training consists of four stages, each building on the previous:

Stage 1: Cold Start Supervised Fine-Tuning

DeepSeek-R1-Zero (trained with pure RL from the base model) demonstrated that reasoning could emerge without supervision, but suffered from poor readability and language mixing. To address this:

Collect thousands of long chain-of-thought examples with human-readable reasoning
Fine-tune the base model (DeepSeek-V3) on this data
Purpose: Establish a readable output format before RL training

Stage 2: Reasoning-Oriented Reinforcement Learning

Apply GRPO with RLVR on reasoning-intensive domains:

Mathematics: GSM8K, MATH, competition problems
Coding: Competitive programming tasks with test cases
STEM: Physics, chemistry problems with verifiable answers

During this stage, remarkable behaviors emerge spontaneously:

<thinking>
Let me solve this step by step...
First, I'll compute the partial sum...

Wait, that approach won't work because the series diverges.

Let me reconsider. Actually, I should use the integral test here...

Hmm, but the integral doesn't converge easily. Let me try a different strategy.

Actually, I made an error in my substitution. Let me redo this...
</thinking>

The model learns to:

Self-reflect: “Wait, let me reconsider…”
Verify intermediate steps: Checking calculations mid-solution
Explore alternatives: “That approach won’t work, let me try…”
Allocate compute: Spending more tokens on harder problems

Stage 3: Rejection Sampling and Supervised Fine-Tuning

Generate multiple solutions per problem, keep only correct ones:

def rejection_sampling(policy, problems, n_samples=64):
    """Generate samples, filter by correctness, use for SFT."""
    high_quality_data = []
    
    for problem in problems:
        samples = [policy.generate(problem) for _ in range(n_samples)]
        
        for sample in samples:
            if verify_answer(sample, problem.answer):
                high_quality_data.append((problem, sample))
    
    return high_quality_data

This creates a high-quality dataset combining:

Reasoning trajectories from the RL-trained model
General domain data (writing, QA) to maintain broad capabilities

Stage 4: All-Task Reinforcement Learning

Final RL phase with:

Reasoning rewards (verifiable)
Human preference rewards (for non-verifiable tasks)
Safety and alignment constraints

The result: a model that excels at complex reasoning while maintaining general LLM capabilities.

The “Aha Moment”: Emergence of Meta-Cognition

One of the most fascinating findings from DeepSeek-R1-Zero’s training was the spontaneous emergence of sophisticated reasoning behaviors—without any explicit supervision.

Around step 8,000 of RL training, researchers observed the model begin using phrases like “Wait, let me reconsider” and “Actually, I should try a different approach”—behaviors never explicitly taught.

timeline
    title Emergence of Reasoning Behaviors
    section Early Training
        Step 1000 : Simple pattern matching
        Step 3000 : Basic chain-of-thought
    section Transition
        Step 8000 : "Aha Moment"
        : Self-reflection emerges
        : Error detection appears
    section Mature
        Step 15000 : Strategic exploration
        : Verification behaviors
        : Dynamic compute allocation

This emergence suggests that reasoning strategies are not learned behaviors but discovered solutions. When incentivized to produce correct answers, the model naturally gravitates toward strategies that improve success rates—self-correction being among the most effective.

Inference-Time Scaling: Beyond Training

Large Reasoning Models (LRMs) like DeepSeek-R1 support inference-time compute scaling—using more computation at generation time to improve output quality.

Strategies for Inference Scaling

1. Best-of-N with Verification

def best_of_n(model, prompt, n=64, verifier=verify_math_answer):
    """Generate N completions, return best according to verifier."""
    candidates = [model.generate(prompt) for _ in range(n)]
    
    best_score = -1
    best_output = None
    
    for candidate in candidates:
        score = verifier(candidate)
        if score > best_score:
            best_score = score
            best_output = candidate
    
    return best_output

2. Majority Voting

def majority_vote(model, prompt, n=64):
    """Generate N completions, return most common answer."""
    answers = [extract_answer(model.generate(prompt)) for _ in range(n)]
    
    from collections import Counter
    vote_counts = Counter(answers)
    
    return vote_counts.most_common(1)[0][0]

3. Sequential Scaling (Extended Thinking)

def extended_thinking(model, prompt, max_thoughts=10):
    """Allow model to generate extended reasoning before answering."""
    thinking_prompt = f"{prompt}\n\nThink carefully before answering."
    
    # Model generates <think>...</think> block first
    return model.generate(thinking_prompt)

Performance vs. Compute Trade-off

Strategy	Compute Multiplier	Accuracy Gain
Greedy	1×	Baseline
Best-of-4	4×	+5-10%
Best-of-16	16×	+8-15%
Majority-64	64×	+12-20%
Extended Thinking	2-10×	+15-25%

Benchmark Results: Matching OpenAI o1

DeepSeek-R1 achieves performance comparable to OpenAI’s o1 across reasoning benchmarks:

Benchmark	DeepSeek-R1	OpenAI o1-1217	Gap
AIME 2024	79.8%	79.2%	+0.6%
MATH-500	97.3%	97.6%	-0.3%
GPQA Diamond	71.5%	75.7%	-4.2%
Codeforces Rating	2029	1891	+138
MMLU	90.8%	91.8%	-1.0%

Key observations:

Mathematics: Near parity with o1
Coding: Actually outperforms o1 on Codeforces
General reasoning: Slight deficit on some benchmarks
Efficiency: Activates only 37B parameters per token (MoE)

The Distillation Breakthrough

Perhaps most impactful: DeepSeek demonstrated that reasoning capabilities can be distilled into smaller models. Using DeepSeek-R1’s outputs as training data:

Model	AIME 2024	MATH-500
DeepSeek-R1-Distill-Qwen-32B	72.6%	94.3%
DeepSeek-R1-Distill-Qwen-7B	55.5%	92.8%
DeepSeek-R1-Distill-Llama-8B	50.0%	89.1%

A 7B model trained on R1’s reasoning traces achieves 92.8% on MATH-500—performance that required 70B+ models just months prior.

Implications for Open-Source AI

DeepSeek-R1 represents several paradigm shifts:

1. Reasoning Without Human Demonstration

Previous approaches required extensive human-written reasoning traces. GRPO+RLVR shows that correct answer incentives alone are sufficient for reasoning emergence. This dramatically reduces the data bottleneck for training reasoning models.

2. Verifiable Domains as Training Grounds

Math and code aren’t just application domains—they’re training scaffolds. The techniques learned in these verifiable domains transfer to general reasoning tasks.

3. Distillation Democratizes Reasoning

The ability to distill reasoning capabilities into smaller models means:

Organizations can train reasoning models without massive compute
Fine-tuning on domain-specific problems becomes more accessible
Edge deployment of reasoning-capable models is feasible

4. Open Weights Accelerate Research

By releasing model weights, training details, and the GRPO algorithm, DeepSeek enabled rapid community iteration. Within weeks of R1’s release, multiple open-source reasoning models emerged building on similar techniques.

The Path Forward

GRPO isn’t the final word on LLM reasoning—it’s the beginning of a new research direction. Key open questions include:

Process supervision: Can we improve on outcome-only rewards?
Cross-domain transfer: How well do math-learned strategies apply to other domains?
Efficiency: Can we achieve similar results with fewer RL steps?
Safety: How do we ensure reasoning models don’t develop harmful strategies?

What GRPO definitively demonstrates is that the barrier to training reasoning models has collapsed. What once required proprietary datasets, massive compute clusters, and closed-source techniques can now be replicated with publicly available tools and open research.

The question is no longer whether open-source reasoning models can match proprietary ones—it’s how quickly the community will surpass them.

The Problem with Traditional LLM Training#

From PPO to GRPO: A Paradigm Shift#

The GRPO Algorithm: Mathematics and Intuition#

The Algorithm#

Why This Works#

Memory and Compute Efficiency#

RLVR: Reinforcement Learning with Verifiable Rewards#

How RLVR Works#

Why RLVR > RLHF for Reasoning#

DeepSeek-R1: The Training Pipeline#

Stage 1: Cold Start Supervised Fine-Tuning#

Stage 2: Reasoning-Oriented Reinforcement Learning#

Stage 3: Rejection Sampling and Supervised Fine-Tuning#

Stage 4: All-Task Reinforcement Learning#

The “Aha Moment”: Emergence of Meta-Cognition#

Inference-Time Scaling: Beyond Training#

Strategies for Inference Scaling#

Performance vs. Compute Trade-off#

Benchmark Results: Matching OpenAI o1#

The Distillation Breakthrough#

Implications for Open-Source AI#

1. Reasoning Without Human Demonstration#

2. Verifiable Domains as Training Grounds#

3. Distillation Democratizes Reasoning#

4. Open Weights Accelerate Research#

The Path Forward#