Top logits can leak task-irrelevant image information as readily as full residual stream projections

When a model produces outputs, it exposes more than just its final answer. The ranked probability distribution over tokens — the logits — contains information about the model’s internal state. Prior work has shown that probing model internals can reveal information not apparent from generations. Apple ML Research’s paper on logit information leakage provides the first systematic comparison of how much information is retained at different representational levels as it passes through two natural bottlenecks between the model’s rich internal representation and its output.

The paper is authored by Masha Fedzechkina, Eleonora Gualdoni, Rita Ramos, and Sinead Williamson.

The information cascade

A transformer’s residual stream encodes rich, high-dimensional information about the input. As computation progresses toward the output, this information passes through two compression points. The first is a low-dimensional projection of the residual stream, obtained using tuned lens — a technique for examining what information the model holds at intermediate layers. The second is the set of top-k logits most likely to influence the model’s answer.

These two bottlenecks represent points at which the full internal representation is compressed for different purposes. The tuned lens projection is a diagnostic tool; the top logits are what get exposed through the model’s API in practice. A model owner might reasonably assume that the top logits — designed to represent the model’s answer distribution — reveal only information relevant to the task.

What logits actually leak

Using vision-language models as a testbed, the paper shows that this assumption is wrong. The top logit values can leak task-irrelevant information present in the image-based query — information the model owner did not intend to expose and that is not needed for the model’s answer.

The severity of this leakage is the key finding: in some cases, the top logits reveal as much information as direct projections of the full residual stream. The comparison to residual stream projections is significant because those projections are generally understood to be information-rich and would not typically be exposed to model users. The finding that the top logits — which are routinely accessible through standard model APIs — can match that level of information disclosure is the practical concern.

Implications for model deployment

This has direct relevance for multi-modal model APIs that return logprob information. Many inference APIs expose top-k log probabilities as an optional output, sometimes used for downstream tasks like calibration, uncertainty estimation, or constrained decoding. The paper’s result suggests that when the input includes images, this logprob exposure may inadvertently reveal image-derived information that the model owner assumed was not accessible to the API caller.

The risk the paper names is “unintentional or malicious information leakage, where model users are able to learn information that the model owner assumed was inaccessible.” Both forms are relevant: the unintentional case covers legitimate users inadvertently gaining access to private information in a multi-tenant system, while the malicious case covers adversarial probing through the logprob interface.

Why vision-language models

The choice of vision-language models as a testbed is deliberate. In these models, an image is encoded into a rich representation that must be compressed as it flows toward the text token prediction task. The gap between what the image contains and what the model’s answer requires creates the conditions for task-irrelevant information to persist in the logits. A model answering a question about an image of a person may not need to represent the person’s clothing in its answer, but residual signal about that clothing may remain in the logit distribution.

The paper represents a systematic measurement of how much this happens, across representational levels, enabling more informed decisions about which model outputs to expose and when. Full experimental details and numerical results are in the paper, published in April 2026.