Interpretability research for large language models faces a combinatorial problem: the number of possible interactions between input features, training data points, or internal model components grows exponentially as models scale. Exhaustive analysis is computationally infeasible. A blog post from Berkeley AI Research describes SPEX and ProxySPEX, two algorithms that make interaction discovery tractable by exploiting structural properties of how real systems behave.
The post frames the problem across three parallel interpretability approaches: feature attribution (which input segments drive a prediction), data attribution (which training examples influence behavior), and mechanistic interpretability (which internal components are responsible). In each case, the fundamental challenge is the same — isolating the drivers of complex behavior requires systematically perturbing the system, and each perturbation (called an ablation) is expensive.
Attribution through ablation
The post defines a unified framework based on ablation. For feature attribution, you mask or remove specific segments of an input prompt and measure how the model’s output shifts. For data attribution, you train on different subsets and observe how test predictions change. For mechanistic interpretability, you intervene on the model’s forward pass by removing or bypassing specific internal components.
All three approaches share the goal of identifying which elements, in combination, are responsible for a model’s behavior. The post emphasizes that “model behavior is rarely the result of isolated components; rather, it emerges from complex dependencies and patterns.” Single-feature attribution is often insufficient because the real driver of a prediction can be an interaction between features — a double negative changing sentiment, or a fact that requires synthesizing multiple documents in a retrieval-augmented generation task.
The post notes that each ablation carries significant cost: “expensive inference calls or retrainings.” Minimizing the number of ablations required is therefore central to the problem.
SPEX: spectral interaction discovery
SPEX (Spectral Explainer) is built on two structural observations about real models. The first is sparsity: “relatively few interactions truly drive the output.” The second is low-degreeness: “influential interactions typically involve only a small subset of features.” These properties transform an otherwise intractable exponential search into a sparse recovery problem.
The algorithm draws on tools from signal processing and coding theory. Rather than testing interactions one at a time, SPEX uses strategically selected ablations that combine many candidate interactions together. Efficient decoding algorithms then disentangle these combined signals to isolate the specific interactions responsible for the observed behavior. The post describes this as advancing “interaction discovery to scales orders of magnitude greater than prior methods.”
For feature attribution, the post reports that SPEX “matches the high faithfulness of existing interaction techniques (Faith-Shap, Faith-Banzhaf) on short inputs” and “uniquely retains this performance” on longer inputs where prior methods degrade. Faithfulness here measures how accurately recovered attributions predict the model’s output on unseen test ablations — a direct test of whether the attribution reflects the true underlying mechanism.
ProxySPEX: hierarchy cuts ablations by 10x
ProxySPEX adds a third structural observation: hierarchy. The post describes this as meaning “where a higher-order interaction is important, its lower-order subsets are likely to be important as well.” This is an empirical property of complex machine learning models, not an assumption built into the model architecture.
Exploiting hierarchy means ProxySPEX can focus ablation budget on interactions whose lower-order components are already known to be important, rather than exhaustively testing all possible combinations. The post reports that ProxySPEX “matches the performance of SPEX with around 10x fewer ablations” — a substantial reduction in compute for large-scale attribution tasks.
Applications across interpretability domains
The post describes applications in all three interpretability domains. For feature attribution, the running example is sentiment analysis, where SPEX can identify not just that individual words matter but that specific word combinations (such as double negatives) drive the prediction. For retrieval-augmented generation tasks, the post illustrates that the necessary interaction might span multiple retrieved documents, and single-document attribution would miss it entirely.
For data attribution, the framework identifies which training examples, in combination, influence model behavior on a test point. For mechanistic interpretability, it identifies which internal components — attention heads, MLP layers — interact to produce specific outputs.
The post frames interpretability not as an academic exercise but as a practical step toward “safer and more trustworthy AI,” enabling both model builders and affected humans to understand how decisions are made. The scalability gains from SPEX and ProxySPEX are what make this practical: prior methods could find interactions among small numbers of features, but broke down as the number of candidates grew. The sparse recovery framing, combined with the hierarchy insight in ProxySPEX, extends the reach of attribution methods to the scale at which modern LLMs actually operate.