LLMs are not being replaced — they are being rebuilt from the inside

The next major wave in AI is not a different technology — it is better LLMs. That is the framing in this MIT Technology Review overview, which surveys what the piece calls “LLMs+” — the architectural and efficiency improvements that are reshaping how large language models are built and deployed. The ambition, as described in the article, is to get LLMs to work through complex, multi-part problems that would take humans days or weeks to solve, doing so autonomously for extended periods without going off the rails.

The piece, written by Will Douglas Heaven, identifies several approaches actively in development toward those goals.

Making models cheaper to run

The most widely adopted efficiency technique described in the article is mixture-of-experts (MoE). Rather than running the full model for every inference, MoE splits an LLM into smaller specialized components, each with expertise in a different type of task. Only some of those components need to be activated for any given input, which cuts the compute required per query.

The article also describes a more experimental direction: replacing the transformer architecture — which currently underlies almost all LLMs — with diffusion models, the type of neural network more commonly used for image and video generation. This is presented as a live area of investigation rather than a settled transition.

A third approach the piece mentions came from DeepSeek: encoding text in images as a way to cut computation costs. This was demonstrated last year, according to the article.

The context window problem

A context window is, as Heaven describes it, “the amount of text (or video) that a model can take in at once, equivalent to its working memory.” The article notes that a couple of years ago, LLMs could handle several thousand tokens, roughly a few dozen pages. The latest models now support context windows up to a million tokens.

But longer context windows create their own reliability problems. The piece states that “the larger the context window and the longer the task, the more likely models are to go off the rails or forget what they were doing.”

Recursive LLMs as a reliability fix

The article describes one concrete research direction aimed at the reliability problem: recursive LLMs, from a paper by researchers at MIT CSAIL. Instead of feeding a massive context window to a single model instance, recursive LLMs break the input into chunks and send each chunk to a copy of the model. Those copies can in turn break their chunks further and distribute to additional copies. The article states that “multiple LLMs processing smaller pieces of information seem to be far more reliable for long, hard tasks.”

The piece frames this as a meaningful architectural departure: “The result is an LLM, but not as we know it.”

The article does not claim any of these approaches has definitively solved the reliability problem. It frames MoE as in production use, while diffusion-model alternatives and recursive architectures are presented as earlier-stage bets.