ParaRNN: Apple researchers train a 7B-parameter nonlinear RNN, competitive with transformers

Recurrent neural networks have a well-known efficiency advantage at inference: unlike transformers, where attention cost scales quadratically with sequence length, a single RNN forward pass costs the same regardless of context length. That advantage has been difficult to exploit at scale because RNN training cannot be parallelized along the sequence dimension — each step depends on the previous hidden state. Apple ML Research’s ParaRNN work, accepted as an Oral at ICLR 2026, changes that.

The framework achieves a 665x speedup over sequential RNN training and, for the first time, enables training classical nonlinear RNNs at 7 billion parameters, where they reach language modeling performance competitive with transformers.

Why the training bottleneck existed

Modern state space models like Mamba resolved the RNN training problem by making the recurrence linear in the hidden state. Linear recurrences are associative, which enables parallel scan algorithms: the same mathematical property that allows a cumulative sum to be computed in a tree-parallel structure rather than sequentially. This transforms an O(n) sequential computation into O(log n) parallel steps. Doubling sequence length adds one step rather than double the work.

The cost of linearity is expressivity. A linear hidden state evolution covers a narrower range of dynamics than a nonlinear one, which constrains the model’s ability to track state and retrieve information on tasks that require it.

Classical RNNs — including GRU and LSTM — include nonlinearities in the recurrence. Those nonlinearities are precisely what breaks the associativity needed for parallel scan. The question ParaRNN answers is whether there is a way to train nonlinear RNNs in parallel without discarding the nonlinearity.

Newton’s method as the key

The approach reframes RNN training not as a sequential chain of steps but as a single system of equations, with hidden states across all time steps as simultaneous unknowns. Newton’s method solves this system iteratively, replacing the nonlinear equations with linear approximations using their Jacobians at each iteration.

The linearized system has the same form as a linear SSM, with Jacobians playing the role of state matrices. This means each Newton iteration can be solved using parallel scan — the same algorithm used for efficient SSM computation. The full nonlinear RNN behavior is recovered by iterating: each iteration refines the approximation, converging to the true nonlinear solution.

In practice, the researchers apply this to GRU and LSTM cells and observe convergence in three iterations. Three parallel SSM applications recover the same hidden state evolution as sequential nonlinear RNN computation, with a dramatic reduction in wall-clock training time.

Engineering for scale

Newton iterations introduce Jacobian matrices into the parallel reduction. For generic RNNs, these Jacobians are dense, making storage quadratic and multiplication cubic in hidden state size — intractable for large models.

The solution, drawing from design principles in modern SSMs, is to constrain the cells so their Jacobians have structured sparsity. The ParaGRU cell produces diagonal Jacobians; ParaLSTM produces block-diagonal Jacobians. Custom CUDA kernels implement the parallel reduction of these structured Jacobians, fusing Newton iterations, system assembly, and parallel reduction into a single kernel. The fully-fused implementation follows the GPU memory hierarchy to keep data local across the computation, achieving speedups over Mamba in timing comparisons shown in the paper.

Three performance tiers are offered: pure PyTorch for prototyping, CUDA-accelerated reduction for generic cells with diagonal or block-diagonal Jacobians, and a fully-fused single-kernel implementation for production use.

Results at 7B parameters

The researchers trained models from 400M to 7B parameters on language modeling tasks. Both ParaGRU and ParaLSTM at 7B parameters reach perplexity and downstream task scores comparable to transformers and state-of-the-art SSMs.

The inference advantage holds. RNN constant-time token generation means throughput does not degrade with context length, which matters for applications where generation speed is the binding constraint.

Nonlinearity also shows a measurable benefit on tasks requiring state tracking and retrieval. Table 2 in the paper shows that nonlinear RNNs outperform their linear counterparts on these synthetic benchmarks, indicating that the additional expressivity translates to real capability differences rather than just a theoretical nicety.

Open-source framework

The ParaRNN codebase has been released as an open-source framework. Defining a custom cell requires implementing a single recurrence step method inheriting from a base class; the framework automatically applies Newton’s method, assembles the Jacobian system, and runs the parallel reduction. The modular design supports custom Jacobian structures and solver configurations.

The paper’s first author is delivering an Expo Talk at ICLR 2026 alongside the oral presentation. The framing in the post is direct: the nonlinearity-versus-training-efficiency tradeoff was not fundamental — it was a consequence of computational limitations that are now removed.