Reliability and Robustness of Foundational Models

The reliability and robustness of AI-powered apps do not only depend on the traditional software security methods, but also on the security of the underlying AI models. In this area, we explore the vulnerabilities of foundational models such as jailbreaks, hallucinations, and unsafe code generation among others, and devise new defense mechanisms. With our methods, we hope to make AI-powered applications safer to use.

Publications

July 2025

Jiahui Geng, Thy Thy Tran, Preslav Nakov, Iryna Gurevych. ConInstruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities. In ACL 2025.
Paper: Link
Repository: GitHub
July 2025

Rachneet Sachdeva, Yixiao Song, Mohit Iyyer, Iryna Gurevych. Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering. In ACL 2025 Findings.
Paper: Link
Repository: GitHub
Data: GitHub Data
Apr. 2025

Anmol Goel, Yaxi Hu, Iryna Gurevych, Amartya Sanyal. Differentially Private Steering for Large Language Model Alignment. In ICLR 2025.
Paper: Link
Repository: GitHub
Jan. 2025

Rachneet Sachdeva, Rima Hazra, Iryna Gurevych. Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions. Preprint under review.
Paper: Link
Repository: GitHub
Dec. 2024

Haishuo Fang, Xiaodan Zhu, Iryna Gurevych. Preemptive Detection and Correction of Misaligned Actions in LLM Agents. Preprint under review.
Paper: Link
June 2024

Sheng Lu, Hendrik Schuff, and Iryna Gurevych. How are Prompts Different in Terms of Sensitivity? In NAACL 2024.
Paper: Link
Repository: GitHub
Mar. 2024

Rachneet Sachdeva, Martin Tutek, Iryna Gurevych. CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration. In EACL 2024.
Paper: Link
Repository: GitHub