Model Interpretability

Traditional approaches to controlling Large Language Model behavior have followed two well-worn paths: prompt engineering at the input level, and fine-tuning or RLHF at the weight level. But what if we could modify how a model “thinks” in real-time, without changing its weights or crafting the perfect prompt? Representation Engineering (RepE) offers exactly this capability—a paradigm that treats internal activations, rather than neurons or circuits, as the fundamental unit of analysis and control. ...