AI Safety Logo

Interpretability and Controlling LLMs

Large Language Models (LLMs) often exhibit behaviors that seem surprising or emergent. While these behaviors have been praised for their benefits to solve complex tasks such as mathematical reasoning, they also raise safety and reliability concerns. These behaviors are difficult to anticipate or control without a clear understanding of how models represent and process information internally. Our work aims to address this by developing a framework for making LLMs interpretable and controllable.

Publications