Home - Chirag Agarwal

Explainable Artificial Intelligence

Explainable AI (XAI) is pivotal for fostering trust in AI systems by making their decision-making processes transparent and interpretable to humans. Our research in XAI focuses on overcoming the inherent complexity of modern AI models, particularly frontier models and black-box systems, to produce reliable and actionable explanations. Our work addresses the trade-off between model performance and explainability, ensuring that explanations are not only understandable but also faithful to the model's actual behavior, which is critical for applications where stakeholders—such as doctors, judges, or policymakers—require clear justifications for AI-driven decisions. We have developed widely used benchmarks, including OpenXAI and GraphXAI , which have redefined standards for evaluating AI explainability, and empowered researchers worldwide to build more transparent and equitable models.

Some recent key contributions of our work is the reliability of explanations from LLMs . Our work investigates why LLM-generated explanations often appear plausible to humans but may fail to accurately reflect the model's decision-making process. By analyzing the trade-offs between faithfulness (how well an explanation captures the model's reasoning) and plausibility (how convincing it seems), our work proposes strategies to improve explanation quality, such as refining prompting techniques and developing evaluation metrics. This research is foundational for ensuring that LLMs, increasingly used in critical applications, provide trustworthy insights. In addition, we also tackle the challenge of eliciting faithful Chain-of-Thought (CoT) reasoning in LLMs. Our studies have been cited in OpenAI o1 System Card and reveal the limitations of current approaches like in-context learning, LoRA fine-tuning, and activation editing in guaranteeing accurate reasoning paths, advocating for new architectural designs and training paradigms to enhance transparency in LLMs, paving the way for more reliable decision-making in domains like legal and medical AI.

In our work, we develop new explainability algorithms to study the behavior of complex black-box unimodal and multimodal models. The research on multilingual and multimodal explanation methods is at a very nascent stage and we aim to introduce novel algorithms that generate actionable explanations for multimodal models. Through these efforts, our lab aims to shape the future of XAI, ensuring AI systems are both powerful and interpretable, with far-reaching implications for ethical AI deployment.

“We found that if you ask the LLM, surprisingly it always says that I'm 100% confident about my reasoning.”@_cagarwal examines the (un)reliability of chain-of-thought reasoning, highlighting issues in faithfulness, uncertainty & hallucination. pic.twitter.com/Yj4rajXuNx
— FAR.AI (@farairesearch) January 13, 2025

The Multilingual Mind: A Survey of Multilingual Reasoning in Language Models
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Quantifying Uncertainty in Natural Language Explanations of Large Language Models
Analyzing Memorization in Large Language Models through the Lens of Model Attribution
Faithfulness vs. Plausibility: On the (un)Reliability of Explanations from Large Language Models
On the Difficulty of Faithful Chain-of-Thought Reasoning in Large Language Models
In-context Explainers: Harnessing LLMs for Explaining Black-Box Models
Openxai: Towards a transparent evaluation of model explanations
Evaluating Explainability for Graph Neural Networks

AI Safety and Alignment

AI Safety and Alignment is a cornerstone of current AI research that focuses on ensuring that AI systems are robust, reliable, and free from unintended consequences, particularly in high-stakes environments. Our work addresses critical risks, such as adversarial attacks, model hallucinations, and privacy violations, which can undermine the trustworthiness of AI systems. By developing rigorous evaluation benchmarks and certification methods , our research aims to safeguard AI deployment in domains like healthcare, where errors can have life-altering consequences. Our contributions are particularly timely as AI systems, especially large language and vision-language models, become integral to real-world applications.

Our recent AI Safety work focused on evaluating hallucinations in large-vision language models (LVLMs) , where we show that state-of-the-art LVLMs generate false or misleading outputs and pose a major safety risk. In addition to hallucination benchmark, we have also introduced comprehensive benchmark for assessing LLM safety in healthcare settings , where we evaluate models on their ability to provide accurate and safe medical advice, addressing risks like misdiagnosis or harmful recommendations. By integrating real-world medical scenarios and stress-testing models under adversarial conditions, our benchmark sets a new standard for safety evaluation. Further, we have developed methods to certify that LLMs can withstand malicious prompts designed to elicit harmful or incorrect responses. By combining theoretical analysis with empirical testing, our work ensures that models remain safe under real-world threats. Additionally, our exploration of iterative prompting’s impact on truthfulness in LLMs, fairness in LVLMs and study on operationalizing data protection rights in LLMs underscores our holistic approach to AI Safety, addressing both technical and ethical dimensions and advancing the field, ensuring AI systems are secure and aligned with societal needs.

Recent advancements in LLMs have introduced new security risks, particularly in the form of jailbreak attacks that bypass safety mechanisms. While previous work has identified both semantic and non-semantic tokens capable of jailbreaking frontier models using pre-defined responses like “Sure, ...” , these studies have largely overlooked the potential for models to create an internal, model-specific “language” of non-semantic text that reliably triggers specific behaviors. For example, a non-semantic phrase that appears as a simple typo or misspelling to a naive user could consistently prompt the model to produce malicious code, posing a serious threat. Despite prior work on prompt engineering and adversarial attacks, a systematic, mechanistic understanding of how these vulnerabilities arise—and how to mitigate them—remains an open challenge, which we aim to explore in our future work.

Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models
Towards Operationalizing Right to Data Protection
Understanding the effects of iterative prompting on truthfulness
MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
Certifying LLM Safety against Adversarial Prompting
Dear: Debiasing vision-language models with additive residuals
GNNDelete: A general unlearning strategy for graph neural networks

Explainable Artificial Intelligence

Relevant Papers:

AI Safety and Alignment

Relevant Papers: