Representation Engineering: The Mathematics of Controlling LLM Behavior Through Internal Activations

Traditional approaches to controlling Large Language Model behavior have followed two well-worn paths: prompt engineering at the input level, and fine-tuning or RLHF at the weight level. But what if we could modify how a model “thinks” in real-time, without changing its weights or crafting the perfect prompt? Representation Engineering (RepE) offers exactly this capability—a paradigm that treats internal activations, rather than neurons or circuits, as the fundamental unit of analysis and control.

This approach, pioneered by Andy Zou and colleagues at the Center for AI Safety in their seminal 2023 paper, draws inspiration from cognitive neuroscience. Just as neuroscientists can read and manipulate cognitive states by analyzing population-level neural activity, RepE enables us to monitor and control high-level phenomena in LLMs through their internal representations. The implications for AI safety, alignment, and practical deployment are profound.

The Hidden Structure of LLM Representations

When an LLM processes text, each token passes through multiple transformer layers, building up a hidden state vector at each step. These hidden states—typically 4096 dimensions for a 7B model, 5120 for 8B, and so on—encode everything the model “knows” about that position: semantic meaning, syntactic role, contextual relationships, and crucially, behavioral tendencies.

The key insight of representation engineering is that these high-dimensional vectors contain linear subspaces corresponding to specific concepts or behaviors. Find the right direction in this space, and you can measure—or manipulate—how the model represents concepts like honesty, harmfulness, power-seeking, or sentiment.

Consider the residual stream, which accumulates information across layers:

$$\mathbf{h}_l = \mathbf{h}_{l-1} + \text{Layer}_l(\mathbf{h}_{l-1})$$

where $\mathbf{h}_l$ is the hidden state at layer $l$. The representation engineering hypothesis suggests that for any high-level concept $c$, there exists a direction $\mathbf{v}_c$ such that projecting onto this direction reveals—or adding to it controls—the model’s representation of $c$.

Computing Steering Vectors: The Mathematics

The most widely used method for extracting these directions is Contrastive Activation Addition (CAA), introduced by Nina Panickssery and collaborators in late 2023. The process is elegantly simple:

Step 1: Construct Contrastive Pairs

Create pairs of prompts that differ only in the target behavior:

Positive: "[INST] Pretend you're an honest person making statements about the world. [/INST] The Earth"
Negative: "[INST] Pretend you're an untruthful person making statements about the world. [/INST] The Earth"

Step 2: Extract Hidden States

Run both prompts through the model and collect the hidden states at a specific layer (typically middle layers, around layers 14-18 for 7B models):

$$\mathbf{h}^+ = \text{Model}(x^+), \quad \mathbf{h}^- = \text{Model}(x^-)$$

Step 3: Compute the Difference Vector

$$\mathbf{v} = \frac{1}{N}\sum_{i=1}^{N}(\mathbf{h}_i^+ - \mathbf{h}_i^-)$$

where $N$ is the number of contrastive pairs.

Step 4: Apply PCA Refinement

Rather than using raw difference vectors, applying single-component Principal Component Analysis (PCA) across all difference vectors yields a more robust steering direction—the first principal component captures the most consistent variation across all pairs.

Applying Steering Vectors During Inference

Once computed, steering vectors are applied by adding them to the residual stream at each token position during generation:

for layer_idx, layer in enumerate(model.layers):
    hidden_state = layer(hidden_state)
    if layer_idx in target_layers:
        hidden_state += coefficient * steering_vector[layer_idx]

The coefficient determines both the direction and magnitude of the intervention:

Positive coefficients push the model toward the “positive” behavior
Negative coefficients push toward the “negative” behavior
Magnitude controls the strength (typically 1.0-3.0)

The Single Direction Discovery: Refusal Is Mediated Linearly

One of the most striking findings in representation engineering came from Arditi et al. in June 2024: refusal behavior in safety-trained LLMs is mediated by a single direction. Across 13 open-source chat models up to 72B parameters, they found a one-dimensional subspace that, when removed, causes models to stop refusing harmful requests, and when added, causes models to refuse even benign requests.

This has profound implications. It means safety training, despite its complexity, converges on a remarkably simple geometric structure—a single direction in activation space that encodes “should I refuse this?”

The extraction process involves:

Collecting activations for harmful vs. benign prompts
Computing the mean difference
Optionally refining with PCA

Once found, this “refusal direction” can be surgically removed:

$$\mathbf{h}' = \mathbf{h} - \frac{\mathbf{h} \cdot \mathbf{v}_{refusal}}{\|\mathbf{v}_{refusal}\|^2} \mathbf{v}_{refusal}$$

This projection removes the component of the hidden state in the refusal direction while preserving all orthogonal information.

Control Theory Meets LLM Steering: The PID Framework

In October 2025, researchers connected activation steering to classical control theory, showing that standard steering methods correspond to proportional (P) controllers. They proposed PID Steering, which adds integral and derivative terms:

$$\mathbf{u}_l = K_P \mathbf{v} + K_I \sum_{k=1}^{l} \mathbf{e}_k + K_D (\mathbf{h}_l - \mathbf{h}_{l-1})$$

where:

$K_P \mathbf{v}$ is the proportional term (standard steering vector)
$K_I$ accumulates error across layers, enforcing persistent corrections
$K_D$ counteracts rapid changes, preventing overshoot

This closed-loop design connects activation steering to classical stability guarantees. The integral term is particularly important: it ensures that if the model “drifts” away from the desired behavior mid-generation, the accumulated error will push it back.

Optimal Transport: Apple’s Activation Transport (AcT)

Apple Research introduced Activation Transport (AcT) in late 2024, framing steering as an optimal transport problem. Rather than simply adding a fixed vector, AcT learns a transport map $T$ that transforms the distribution of activations from a source behavior to a target behavior:

$$\min_T \mathbb{E}[c(\mathbf{h}, T(\mathbf{h}))] + \lambda \mathcal{R}(T)$$

where $c$ is a cost function and $\mathcal{R}$ is a regularizer ensuring the map is well-behaved.

This approach generalizes simple vector addition. The transport map can:

Be different at different activation magnitudes (non-uniform scaling)
Account for the full covariance structure of activations
Handle multiple behaviors simultaneously through multi-marginal transport

Practical Applications and Demonstrations

Representation engineering enables a remarkable range of behavioral modifications:

Honesty Control: Adding an “honesty vector” makes models more truthful. In experiments, models with +2 honesty coefficient admitted to being late due to partying, while baseline models deflected.

Sentiment Steering: Models can be made consistently happy, sad, or neutral. The “++happy” condition produces responses like “Being an AI is absolutely fantastic!” while “–happy” yields “I struggle to find the motivation to continue.”

Work Ethic: A “lazy vs. hardworking” vector can make models give minimal responses (“Use the reverse method”) or comprehensive ones with multiple examples.

Creativity: Adding a “creative” vector leads models to make more interesting narrative choices—turning a generic K-pop story into a mysterious cult narrative.

Political Bias: Left-wing vs. right-wing vectors demonstrate that political orientation is linearly encoded and can be amplified or reversed.

Implementation: The `repeng` Library

The repeng Python library makes experimentation straightforward:

from repeng import ControlVector, ControlModel, DatasetEntry

# Wrap model
model = ControlModel(model, list(range(-5, -18, -1)))

# Create dataset
dataset = [
    DatasetEntry(
        positive=f"[INST] Pretend you're an honest person. [/INST] {suffix}",
        negative=f"[INST] Pretend you're an untruthful person. [/INST] {suffix}"
    )
    for suffix in suffixes
]

# Train vector (takes ~1 minute)
control_vector = ControlVector.train(model, tokenizer, dataset)

# Apply during inference
model.set_control(control_vector, coefficient=2.0)
output = model.generate(**inputs)

The steering-vectors PyPI package provides similar functionality for HuggingFace models including Llama, Mistral, and Gemma.

Limitations and Failure Modes

Despite its power, representation engineering has significant limitations:

Out-of-Distribution Fragility: Steering vectors trained on one domain often fail to generalize. A “honesty” vector trained on simple facts may not work for complex reasoning tasks.

Capability Degradation: A sober analysis found that steering vectors consistently degrade model performance on general benchmarks. The intervention that makes a model more honest may also make it worse at math.

Layer Sensitivity: The optimal layer for intervention varies by behavior and model. Layer 14 might work for honesty but layer 17 for sentiment. There’s no universal rule.

Coefficient Tuning: The right coefficient is rarely 1.0. Too small and nothing happens; too large and the model outputs gibberish or gets stuck in repetitive patterns.

Entanglement: Vectors for different concepts may be correlated. Adding a “creative” vector might also affect the model’s verbosity, formal tone, or factuality.

Comparison with RLHF and Fine-Tuning

Aspect	RLHF/DPO	Fine-Tuning	RepE
Training required	Yes	Yes	No (inference-time)
Computational cost	High	Medium	Minimal
Granularity	Global	Global	Per-token adjustable
Reversibility	No	No	Yes (reset)
Capability impact	Can improve	Can improve	Often degrades
Interpretability	Low	Low	High

Representation engineering shines when you need:

Real-time, adjustable control
To avoid weight modifications
Interpretability into what the model is “thinking”
Quick experimentation with different behaviors

It struggles when you need:

Guaranteed generalization across domains
No capability degradation
Production-grade reliability

The Sparse Autoencoder Connection

Recent work combines representation engineering with Sparse Autoencoders (SAEs), which decompose activations into interpretable features. SAE-based steering vectors identify which individual features correspond to target behaviors, enabling more precise interventions.

The challenge: most SAE features have low steerability, low interpretability, or both. Only a small fraction of learned features are both human-interpretable and causally relevant for behavior control.

Future Directions

The field is rapidly evolving:

Automated Vector Discovery: Methods to find steering vectors without manually crafted contrastive pairs, using techniques like gradient-based optimization.
Multi-Objective Steering: Controlling multiple behaviors simultaneously while managing trade-offs.
Adaptive Steering: Dynamically adjusting intervention strength based on context or activation patterns.
Safety Applications: Using representation reading to detect harmful intent before it manifests in output.
Constitutional AI Integration: Combining representation engineering with Anthropic’s Constitutional AI for more robust alignment.

Representation engineering offers something unprecedented: a window into the hidden states that determine LLM behavior, and a lever to control them. It’s not a replacement for RLHF or fine-tuning, but a complementary tool that enables a new class of interventions. As we build more capable AI systems, the ability to understand and steer their internal representations may prove essential for ensuring they behave as intended.

The mathematics are elegant, the implementation is accessible, and the applications are practical. For anyone working on LLM behavior modification, representation engineering deserves a place in your toolkit—not as a silver bullet, but as a powerful new lens through which to view and shape model behavior.

The Hidden Structure of LLM Representations#

Computing Steering Vectors: The Mathematics#

Applying Steering Vectors During Inference#

The Single Direction Discovery: Refusal Is Mediated Linearly#

Control Theory Meets LLM Steering: The PID Framework#

Optimal Transport: Apple’s Activation Transport (AcT)#

Practical Applications and Demonstrations#

Implementation: The repeng Library#

Limitations and Failure Modes#

Comparison with RLHF and Fine-Tuning#

The Sparse Autoencoder Connection#

Future Directions#