Ai2 releases BAR, a modular post-training recipe for mixture-of-experts models

The Allen Institute for AI published a new post-training recipe called BAR — Branch-Adapt-Route — that allows language models to be updated by domain without retraining the full model. The Ai2 post describes the approach, releases the recipe and technical report, and makes available the checkpoints used to validate it.

The problem BAR addresses

After pretraining, language models go through multiple post-training stages to become useful — instruction following, reasoning, tool use, safety constraints. Updating a model after these stages is difficult. The authors describe three options and their costs: retraining from scratch is reliable but expensive and requires the full original training setup; continued training on new data is cheaper but can degrade existing capabilities; and adjusting each stage in sequence to accommodate new skills is complex and fragile.

BAR’s premise is that these problems stem from training a single model on all data at once. The recipe instead trains independent domain experts — each through its own complete pipeline — and composes them into a unified model using a mixture-of-experts architecture.

How BAR works

The recipe has three stages.

In Stage 1, each domain expert is instantiated as a two-expert mixture-of-experts: one frozen “anchor” expert that preserves the base model’s feed-forward network weights, and one trainable expert. Each expert goes through whatever training stages its domain requires. In Ai2’s experiments, math and code experts go through mid-training, supervised fine-tuning, and reinforcement learning with verified rewards; tool use and safety experts use supervised fine-tuning only.

A key technical decision is a progressive unfreezing schedule for shared parameters across stages. During mid-training, all shared layers remain frozen — consistent with pretraining, where knowledge acquisition is captured by feed-forward network updates. During supervised fine-tuning, the embedding layer and language modeling head are unfrozen. The Ai2 post cites a concrete result illustrating why: without this unfreezing, a tool-use expert scored 20.3 on the Berkeley Function Calling Leaderboard; with it, the same expert reached 46.4. During reinforcement learning, all shared parameters including attention layers are unfrozen, because the distributional shifts induced by RL extend beyond what expert feed-forward networks alone can accommodate.

Each expert also trains on a mixture of domain-specific and general supervised fine-tuning data. The post states this is critical: domain-only supervised fine-tuning produces strong in-domain performance but degrades general capabilities such as instruction following.

In Stage 2, all experts are merged into a single mixture-of-experts model. Shared parameters that diverged across expert runs are averaged. The post states this averaging introduces “little to no measurable performance loss on domain-specific evaluations compared to any individual expert.”

In Stage 3, a router is trained with all experts and shared weights frozen. Ai2 found that a stratified five percent sample of supervised fine-tuning data was sufficient for effective routing.

Performance

Ai2’s models operate at the 7B-parameter scale, with experts for math, code, tool use, and safety trained on top of a fully post-trained OLMo 2 base. The post compares BAR against six baselines across 19 benchmarks in seven evaluation categories.

BAR outperforms all baselines that do not require rerunning mid-training from scratch, scoring 49.1 overall against the next-best post-training-only retraining baseline at 47.8. Gains were largest in math, where BAR leads by 7.8 points, and code, where it leads by 4.7 points. The post attributes this to a structural property of modular training: in a monolithic pipeline, late-stage RL on math and code can degrade safety capabilities learned in earlier supervised fine-tuning stages; modular training avoids this because each domain’s pipeline is isolated.

Dense model merging after mid-training fails substantially. The post reports that mid-training causes models to diverge enough that naive weight averaging produces a nearly non-functional model scoring 6.5 overall. BTX, a technique that trains each expert as a fully independent dense model, also underperforms BAR (46.7 vs. 49.1), because training without shared parameters leads to greater divergence that makes routing harder.

Full retraining with mid-training remains the performance ceiling at 50.5, but the post notes this requires access to the original pretraining checkpoint and full reprocessing — impractical for most open-weight models.

Modular upgrades

The post demonstrates two types of independent expert upgrades. Replacing a code expert with one trained on higher-quality data and RL improved code performance by 16.5 points in the combined model, with other domains “essentially unchanged.” Adding RL on top of an existing math expert’s supervised fine-tuning improved math by 13 points, again with minimal impact elsewhere.

The post notes that in either case only the affected expert and the lightweight router require retraining. In a monolithic pipeline, either upgrade would require retraining the full model across all domains.

Earlier work and context

BAR extends Ai2’s earlier FlexOlmo work, which demonstrated that modular mixture-of-experts training works for pretraining. The post says the FlexOlmo recipe — freezing all shared layers — did not transfer to post-training. During reinforcement learning with verified rewards, the post states, “the reward curve was completely flat; the model simply could not learn with all shared parameters frozen.” The progressive unfreezing schedule in BAR was developed in response.

Ai2 released the recipe, technical report, and checkpoints alongside the post.