Explainable Artificial Intelligence
Explainable AI (XAI) is pivotal for fostering trust in AI systems by making their decision-making processes transparent and interpretable to humans. Our research in XAI focuses on overcoming the inherent complexity of modern AI models, particularly frontier models and black-box systems, to produce reliable and actionable explanations. Our work addresses the trade-off between model performance and explainability, ensuring that explanations are not only understandable but also faithful to the model's actual behavior, which is critical for applications where stakeholders—such as doctors, judges, or policymakers—require clear justifications for AI-driven decisions. We have developed widely used benchmarks, including OpenXAI and GraphXAI , which have redefined standards for evaluating AI explainability, and empowered researchers worldwide to build more transparent and equitable models.
Some recent key contributions of our work is the
In our work, we develop new explainability algorithms to study the behavior of complex black-box unimodal and multimodal models. The research on
“We found that if you ask the LLM, surprisingly it always says that I'm 100% confident about my reasoning.”@_cagarwal examines the (un)reliability of chain-of-thought reasoning, highlighting issues in faithfulness, uncertainty & hallucination. pic.twitter.com/Yj4rajXuNx
— FAR.AI (@farairesearch) January 13, 2025
Relevant Papers:
The Multilingual Mind: A Survey of Multilingual Reasoning in Language Models
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Quantifying Uncertainty in Natural Language Explanations of Large Language Models
Analyzing Memorization in Large Language Models through the Lens of Model Attribution
Faithfulness vs. Plausibility: On the (un)Reliability of Explanations from Large Language Models
On the Difficulty of Faithful Chain-of-Thought Reasoning in Large Language Models
In-context Explainers: Harnessing LLMs for Explaining Black-Box Models
Openxai: Towards a transparent evaluation of model explanations
Evaluating Explainability for Graph Neural Networks
AI Safety and Alignment
AI Safety and Alignment is a cornerstone of current AI research that focuses on ensuring that AI systems are robust, reliable, and free from unintended consequences, particularly in high-stakes environments. Our work addresses critical risks, such as adversarial attacks, model hallucinations, and privacy violations, which can undermine the trustworthiness of AI systems. By developing rigorous evaluation benchmarks and
Our recent AI Safety work focused on
Recent advancements in LLMs have introduced new security risks, particularly in the form of jailbreak attacks that bypass safety mechanisms. While previous work has identified both semantic and non-semantic tokens capable of jailbreaking frontier models using pre-defined responses like “Sure, ...” , these studies have largely overlooked the potential for models to create an internal, model-specific “language” of non-semantic text that reliably triggers specific behaviors. For example, a non-semantic phrase that appears as a simple typo or misspelling to a naive user could consistently prompt the model to produce malicious code, posing a serious threat. Despite prior work on prompt engineering and adversarial attacks, a systematic, mechanistic understanding of how these vulnerabilities arise—and how to mitigate them—remains an open challenge, which we aim to explore in our future work.
Relevant Papers:
Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models
Towards Operationalizing Right to Data Protection
Understanding the effects of iterative prompting on truthfulness
MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
Certifying LLM Safety against Adversarial Prompting
Dear: Debiasing vision-language models with additive residuals
GNNDelete: A general unlearning strategy for graph neural networks