Trustworthy AI

Research
Areas

Explainability research
01

Explainable AI

Making AI decisions transparent, interpretable, and actionable for real stakeholders — from concept-level explanations to interpretability agents. Our work spans concept explanations, mechanistic interpretability, model inspection, and interactive explanation interfaces.

Our work investigates why LLM-generated explanations often appear plausible to humans but may fail to accurately reflect the model's decision-making process. By analyzing the trade-offs between faithfulness (how well an explanation captures the model's reasoning) and plausibility (how convincing it seems), our work proposes strategies to improve explanation quality, such as refining prompting techniques and developing evaluation metrics. In addition, we also tackle the challenge of eliciting faithful Chain-of-Thought (CoT) reasoning in LLMs. Our studies have been cited in OpenAI o1 System Card and reveal the limitations of current approaches like in-context learning, LoRA fine-tuning, and activation editing in guaranteeing accurate reasoning paths, advocating for new architectural designs and training paradigms to enhance transparency in LLMs, paving the way for more reliable decision-making in domains like legal and medical AI.

In our work, we develop new explainability algorithms to study the behavior of complex black-box unimodal and multimodal models. The research on multilingual and multimodal explanation methods is at a very nascent stage and we aim to introduce novel algorithms that generate actionable explanations for multimodal models. Through these efforts, our lab aims to shape the future of XAI, ensuring AI systems are both powerful and interpretable, with far-reaching implications for ethical AI deployment.

Concept Abstractions Mechanistic Interpretability Interpretable Agents Multimodal XAI
arxiv'26
Mechanistic Interpretability · LLM
Towards Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric
How to understand model memorization and unlearning in LLMs using circuit analysis?
PDF →
IUI'26
Interactive Explanations· LLM
Improving Human Verification of LLM Reasoning through Interactive Explanation Interfaces
Interfaces allowing humans to verify and interrogate LLM reasoning chains.
PDF →
arXiv 2025
Multimodal XAI· Survey
Rethinking Explainability in Multimodal AI
A new framework for evaluating explanations across vision and language in frontier models.
PDF →
Robustness research
02

Safety & Alignment

AI Safety and Alignment is a cornerstone of current AI research that focuses on ensuring that AI systems are robust, reliable, and free from unintended consequences, particularly in high-stakes environments. Our work addresses critical risks, such as adversarial attacks and model hallucinations, which can undermine the trustworthiness of AI systems. By developing rigorous evaluation benchmarks and certification methods, our research aims to safeguard AI deployment in domains like healthcare.

Our recent AI Safety work focused on evaluating hallucinations in large-vision language models (LVLMs), where we show that state-of-the-art LVLMs generate false or misleading outputs and pose a major safety risk. In addition to hallucination benchmark, we have also introduced comprehensive benchmark (MedSafety Bench and CLINIC) for assessing LLM safety in healthcare settings, where we evaluate models on their ability to provide accurate and safe medical advice, addressing risks like misdiagnosis or harmful recommendations.

While previous work has identified both semantic and non-semantic tokens capable of jailbreaking frontier models using pre-defined responses like “Sure, ...” , these studies have largely overlooked the potential for models to create an internal, model-specific “language” of non-semantic text that reliably triggers specific behaviors. For example, a non-semantic phrase that appears benign to a naive user could consistently prompt the model to produce malicious code, posing a serious threat. Despite prior work on prompt engineering and adversarial attacks, a systematic, mechanistic understanding of how these vulnerabilities arise—and how to mitigate them—remains an open challenge, which we aim to explore in our work.

Adversarial Certified Defense Alignment Probing Distribution Shift
AAAI'26 Oral 🏆
Alignment · Mechanistic Interpretability
Polarity-Aware Probing for Quantifying Latent Alignment in Language Models
Quantifying latent alignment in language models through polarity-aware probing strategies.
PDF →
COLM'24
Robustness · AI Safety · LLMs
Certifying Robustness for Large Language Models
Certifying robustness of LLMs against adversarial text perturbations using formal verification techniques.
PDF →
EMNLP'25
Hallucination · Video LLMs
Egocentric Video Hallucinations in Multimodal LLMs
Studying how multimodal LLMs hallucinate facts in first-person video understanding tasks.
PDF →
Data rights research
03

Reasoning & Agents

Our research serves as a critical bridge between the raw performance of frontier AI models and the necessity for verifiable, honest, and human-centric reasoning. By investigating the disconnect between an LLM's "plausibility" (what sounds right) and its "faithfulness" (what the model actually did), our work aims to pioneer frameworks that move beyond static evaluations of intelligence. This work—spanning the development of Explainability Agents and agents with Interpretable Memory—transforms AI from a "black-box" predictor into a transparent collaborator. Our recent works, including interactive verification interfaces and certification for medical safety, ensure that in high-stakes environments like healthcare and global regulation, AI reasoning is not just sophisticated but demonstrably reliable across linguistic and adversarial boundaries.

Building on this foundation, our future work aims to pioneer the next generation of self-monitoring agents that possess latent alignment. Our research aims to envision a paradigm where AI systems do not merely provide post-hoc explanations but operate with intrinsic, interpretable reasoning loops that can be monitored and intervened upon in real-time. By leading global efforts to operationalize AI regulation and expanding the frontiers of mechanistic interpretability, our lab is poised to define the mathematical and ethical bounds of safe AI. This trajectory will lead to autonomous agents that are inherently regulatable, providing a future where complex AI systems are as transparent as they are capable, fostering a safer and more equitable integration of AI into global society.

Chain of Thought Reasoning Interpretable Agents Interactive Explanation
OpenAI o1 System Card Report
Reasoning · LLMs
Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models
Understanding the dichotomy between the plausible and faithful reasoning generated by frontier LLMs.
PDF →
arxiv'26
Interpretability · Agents · Reasoning
STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks
How to develop Agents that have interpretable AND/OR planning trees for easy human intervention?
PDF →
NAACL'25 Oral 🏆
Reasoning · LLMs
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
PDF →