Making AI decisions transparent, interpretable, and actionable for real stakeholders — from concept-level explanations to interpretability agents. Our work spans concept explanations, mechanistic interpretability, model inspection, and interactive explanation interfaces.
Our work investigates why LLM-generated explanations often appear plausible to humans but may fail to accurately reflect the model's decision-making process. By analyzing the trade-offs between faithfulness (how well an explanation captures the model's reasoning) and plausibility (how convincing it seems), our work proposes strategies to improve explanation quality, such as refining prompting techniques and developing evaluation metrics. In addition, we also tackle the challenge of eliciting faithful Chain-of-Thought (CoT) reasoning in LLMs. Our studies have been cited in OpenAI o1 System Card and reveal the limitations of current approaches like in-context learning, LoRA fine-tuning, and activation editing in guaranteeing accurate reasoning paths, advocating for new architectural designs and training paradigms to enhance transparency in LLMs, paving the way for more reliable decision-making in domains like legal and medical AI.
In our work, we develop new explainability algorithms to study the behavior of complex black-box unimodal and multimodal models. The research on multilingual and multimodal explanation methods is at a very nascent stage and we aim to introduce novel algorithms that generate actionable explanations for multimodal models. Through these efforts, our lab aims to shape the future of XAI, ensuring AI systems are both powerful and interpretable, with far-reaching implications for ethical AI deployment.



AI Safety and Alignment is a cornerstone of current AI research that focuses on ensuring that AI systems are robust, reliable, and free from unintended consequences, particularly in high-stakes environments. Our work addresses critical risks, such as adversarial attacks and model hallucinations, which can undermine the trustworthiness of AI systems. By developing rigorous evaluation benchmarks and certification methods, our research aims to safeguard AI deployment in domains like healthcare.
Our recent AI Safety work focused on evaluating hallucinations in large-vision language models (LVLMs), where we show that state-of-the-art LVLMs generate false or misleading outputs and pose a major safety risk. In addition to hallucination benchmark, we have also introduced comprehensive benchmark (MedSafety Bench and CLINIC) for assessing LLM safety in healthcare settings, where we evaluate models on their ability to provide accurate and safe medical advice, addressing risks like misdiagnosis or harmful recommendations.
While previous work has identified both semantic and non-semantic tokens capable of jailbreaking frontier models using pre-defined responses like “Sure, ...” , these studies have largely overlooked the potential for models to create an internal, model-specific “language” of non-semantic text that reliably triggers specific behaviors. For example, a non-semantic phrase that appears benign to a naive user could consistently prompt the model to produce malicious code, posing a serious threat. Despite prior work on prompt engineering and adversarial attacks, a systematic, mechanistic understanding of how these vulnerabilities arise—and how to mitigate them—remains an open challenge, which we aim to explore in our work.



Our research serves as a critical bridge between the raw performance of frontier AI models and the necessity for verifiable, honest, and human-centric reasoning. By investigating the disconnect between an LLM's "plausibility" (what sounds right) and its "faithfulness" (what the model actually did), our work aims to pioneer frameworks that move beyond static evaluations of intelligence. This work—spanning the development of Explainability Agents and agents with Interpretable Memory—transforms AI from a "black-box" predictor into a transparent collaborator. Our recent works, including interactive verification interfaces and certification for medical safety, ensure that in high-stakes environments like healthcare and global regulation, AI reasoning is not just sophisticated but demonstrably reliable across linguistic and adversarial boundaries.
Building on this foundation, our future work aims to pioneer the next generation of self-monitoring agents that possess latent alignment. Our research aims to envision a paradigm where AI systems do not merely provide post-hoc explanations but operate with intrinsic, interpretable reasoning loops that can be monitored and intervened upon in real-time. By leading global efforts to operationalize AI regulation and expanding the frontiers of mechanistic interpretability, our lab is poised to define the mathematical and ethical bounds of safe AI. This trajectory will lead to autonomous agents that are inherently regulatable, providing a future where complex AI systems are as transparent as they are capable, fostering a safer and more equitable integration of AI into global society.

