Interpretability and Controlling LLMs

Large Language Models (LLMs) often exhibit behaviors that seem surprising or emergent. While these behaviors have been praised for their benefits to solve complex tasks such as mathematical reasoning, they also raise safety and reliability concerns. These behaviors are difficult to anticipate or control without a clear understanding of how models represent and process information internally. Our work aims to address this by developing a framework for making LLMs interpretable and controllable.

Publications

May 2025

Hovhannes Tamoyan, Subhabrata Dutta, Iryna Gurevych. Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling. Preprint under review.
Paper: Link
Repository: GitHub
May 2025

Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych. Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning. Preprint under review.
Paper: Link
Repository: GitHub
Jan. 2025

Irina Bigoulaeva, Harish Tayyar Madabushi, Iryna Gurevych. The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities. Preprint under review.
Paper: Link
Repository: GitHub
Aug. 2024

Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Iryna Gurevych. Are Emergent Abilities in Large Language Models just In-Context Learning? In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Paper: Link
Repository: GitHub
Data: TUdatalib