Apple at ICLR 2026: RNN parallelization, tool-augmented SSMs, unified image models, and more

The Fourteenth International Conference on Learning Representations (ICLR 2026) is being held this week in Rio de Janeiro, Brazil. Apple is participating with a combination of oral papers, poster presentations, workshops, and a demonstration booth at #204. The research spans recurrent architectures, state space model limitations, unified vision-language systems, real-time 3D scene synthesis, and computational biology. Apple is also sponsoring the event and hosting affinity group events for underrepresented groups in the ML community.

Five highlights are worth covering in detail, as each addresses a distinct open problem in the field.

Parallel training for nonlinear RNNs

The most prominent result is ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models, accepted as an Oral at ICLR. RNNs have a well-documented inference efficiency advantage: because their forward pass does not grow with context length, token generation throughput stays constant regardless of how much prior context the model has seen. The problem is that training RNNs is inherently sequential — each step depends on the previous hidden state, preventing the parallelization that transformers enjoy across sequence positions.

Apple researchers resolve this by adapting Newton’s method to the training problem. Instead of computing hidden states sequentially, they reframe the entire sequence as a system of equations and solve it iteratively, where each iteration takes the form of a parallelizable linear state space model. The framework achieves a 665x speedup over sequential RNN training. The resulting models — trained at up to 7 billion parameters — achieve language modeling performance competitive with transformers on perplexity and downstream tasks. The ParaRNN codebase has been released as an open-source framework. The paper’s first author is also giving an Expo Talk at the conference.

SSM length generalization via tool access

The second oral paper addresses a different architecture: state space models. SSMs like Mamba have linear computational complexity and fixed-size memory, which makes them efficient for long contexts. But that fixed memory is also a ceiling. The Apple paper To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models shows that SSMs fail to solve long-form generation tasks when task complexity exceeds model capacity, even when the model is allowed to generate chain-of-thought of any length. The problem is the bounded memory, which limits expressive power during long generation.

The solution demonstrated in the paper is interactive tool access. Given the right choice of tools and problem-dependent training data, the paper shows that SSMs can learn to solve any tractable problem and generalize to arbitrary problem length and complexity. Tool-augmented SSMs achieve strong length generalization on arithmetic, reasoning, and coding tasks. The implication is that SSMs, previously seen as limited in agentic contexts, may be viable alternatives to transformers in tool-use and agentic settings when memory limitations are offloaded to external tools.

Unified image understanding and generation

A recurring challenge in multimodal LLMs is the performance tradeoff between image understanding and image generation. Models that excel at one tend to underperform on the other. Apple’s MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer addresses this with a single shared vision encoder feeding two lightweight adapters. One adapter produces continuous embeddings for image-to-text understanding; the other produces discrete tokens for text-to-image generation, both in a shared semantic space. A unified autoregressive LLM predicts text and image tokens, and an auxiliary diffusion decoder translates image tokens to pixels.

The paper describes a unified training recipe that scales across model sizes and achieves state-of-the-art results among unified models, with particularly strong performance on text-rich evaluation benchmarks. The architecture’s simplicity — a single encoder, two adapters — distinguishes it from approaches that use entirely separate pipelines for the two tasks.

Real-time 3D scene synthesis from a single image

SHARP (Single-image High-Accuracy Real-time Parallax) produces a 3D Gaussian representation from a single photograph using a single forward pass through a neural network, completing in under one second on a standard GPU. The resulting representation renders as a high-resolution photorealistic 3D scene from nearby views in real time. The metric representation supports absolute-scale camera movements. Apple reports that SHARP reduces LPIPS by 25-34% and DISTS by 21-43% compared to the best prior model, while lowering synthesis time by three orders of magnitude. Code is available publicly, and the work is being demonstrated live at the Apple booth #204 during exhibition hours.

Protein folding with standard transformers

The fifth highlight is SimpleFold, which approaches protein folding — predicting 3D atomic coordinates from amino acid sequences — using a general-purpose architecture based on standard transformer blocks, similar in design to text-to-image or text-to-3D models, rather than the specialized architectures that have dominated this problem. The paper’s position is that the problem is simpler than it appears when the right architecture is applied. The full details are available in the SimpleFold paper presented at ICLR.

Apple is also showing local LLM inference on Apple silicon using MLX as part of its exhibition booth demonstrations.