Query Circuits: Explaining How Language Models Answer User Prompts

Tung-Yu Wu1, Fazl Barez1,2
1University of Oxford 2Martian
ICML 2026
Teaser

Query circuit discovery aims to identify a sparse sub-network within the language model that underlies the model response to a user input query.

Abstract

Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific input query in a particular way. We introduce query circuits, which directly trace the information flow inside a model that maps a specific input to the output. Unlike surrogate-based approaches (e.g., sparse autoencoders), query circuits are identified within the model itself, resulting in more faithful and computationally accessible explanations. To make query circuits practical, we address two challenges. First, we introduce Normalized Deviation Faithfulness (NDF), a robust metric to evaluate how well a discovered circuit recovers the model's decision for a specific input, and is broadly applicable to circuit discovery beyond our setting. Second, we develop sampling-based methods to efficiently identify circuits that are sparse yet faithfully describe the model’s behavior. Across benchmarks (IOI, arithmetic, MMLU, and ARC), we find that there exist sparse query circuits within the model that recover much of its performance on single queries. For example, on average, a circuit covering only 1.3% of model connections can recover about 60% of performance on an MMLU question. Overall, query circuits provide a step towards faithful, scalable explanations of how language models process individual inputs.

Key Research Questions

Three core questions this paper asks and answers.

Q1

Can a sparse sub-network explain why an LLM answers one specific query the way it does?


Yes

Even for complex real-world questions (MMLU, ARC), a circuit covering only ~1.3% of edges can recover ~60% of model behavior on a single query. Compact, query-specific circuits exist and are findable.

Q2

Is the standard evaluation metric (NFS) reliable for measuring query-level circuit quality?


No

NFS becomes numerically unstable on general datasets, with values routinely blowing past [0, 1]. We propose NDF (Normalized Deviation Faithfulness) — bounded in [0, 1] and symmetric around the model's true performance — as a drop-in replacement.

Q3

Do existing circuit-discovery methods work well at the single-query level, and if not, what does?


No

Existing methods (e.g., EAP-IG) need ~50% of all edges before beating a random baseline per query. We propose Best-of-N (BoN) sampling — generating semantically equivalent paraphrases and picking the best-scoring circuit — which reduces required edges by ~40×.

Poster

BibTeX

@inproceedings{wu2026query,
  title={Query Circuits: Explaining How Language Models Answer User Prompts},
  author={Tung-Yu Wu and Fazl Barez},
  booktitle={Forty-third International Conference on Machine Learning},
  year={2026},
  url={https://openreview.net/forum?id=7F0sragazb}
}