Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models

Published in Annual Meeting of the Association for Computational Linguistics (ACL), 2026

This work proposes a proxy-based framework for reducing the cost of model-agnostic explanations for expensive LLMs. It uses efficient models to approximate local decision boundaries and applies a statistical screening step before relying on proxy explanations.

The paper evaluates the framework across diverse LLMs and tasks, showing that reliable proxy explanations can support actionable workflows such as prompt compression and poisoned example removal.

Recommended citation: Junhao Liu, Haonan Yu, Zhenyu Yan, and Xin Zhang. (2026). "Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models." Annual Meeting of the Association for Computational Linguistics (ACL).
Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)