Simula treats synthetic data generation as mechanism design, not sample-by-sample prompting

Google researchers have published Simula, a framework for generating synthetic training data at scale, in Transactions on Machine Learning Research. The framing distinguishes it from existing approaches: rather than optimizing individual data points through manual prompts or evolutionary algorithms, Simula treats the entire dataset as the unit of design. According to the research post, current generation methods “typically operate at the sample level — optimizing one data point at a time — rather than designing the dataset as a whole.”

The practical motivation is that the next wave of specialized AI — in domains like security, law, and medicine — requires data that will not exist at sufficient scale from human sources. Real-world data for these domains is expensive to collect, slow to update, difficult to annotate, and constrained by privacy. Simula is presented as an alternative that treats data generation as a “programmable workflow” where data is versioned, reproducible, and inspectable, much like code.

Four controllable axes

Simula decomposes data generation into four distinct steps, each independently controllable.

The first is global diversification. Instead of random sampling, reasoning models map a target domain into deep, hierarchical taxonomies. These act as a “sampling scaffold” that defines coverage across the long tail of a domain rather than clustering around common modes. A cybersecurity dataset would not just repeat common SQL injection examples — the taxonomy ensures the full conceptual space is represented.

The second is local diversification. For each node in the taxonomy, the system generates “meta-prompts” — scenario descriptions — and then produces multiple distinct instantiations of each scenario. The post describes this as preventing mode collapse, ensuring that a specific concept is represented through diverse framings rather than identical repetitions.

The third axis is complexification. Complexity is treated as orthogonal to semantic coverage. A configurable fraction of meta-prompts is refined to be more elaborate or difficult, shifting the difficulty distribution of the final dataset without changing what topics it covers.

The fourth is quality checking. A “dual-critic” loop independently assesses whether each generated answer is correct. Using two independent critics is described as mitigating sycophancy — the tendency of models to agree with plausible-sounding outputs — and ensuring high-quality labels without human intervention.

Evaluating synthetic data quality

The post describes a separate evaluation challenge: standard metrics like embedding-based cosine distance provide a signal but limited actionable insight. Simula applies a reasoning-first approach to evaluation as well, introducing two custom metrics. Taxonomic Coverage measures how well the generated dataset represents the conceptual space defined by the taxonomy. Calibrated Complexity Scoring uses LLM-driven batch comparisons to assign chess-style Elo ratings to individual data points, providing a calibrated difficulty ordering without requiring exhaustive pairwise comparisons.

No universal recipe

Simula was evaluated using Gemini 2.5 Flash as a teacher model and Gemma-3 4B as the student, across five domains: cybersecurity (CTI-MCQ and CTI-RCM from CTIBench), legal reasoning (LEXam), grade-school math (GSM8k), and multilingual academic knowledge (Global MMLU). Datasets of up to 512K data points were generated for each domain.

The results showed no single optimal configuration. High complexity yielded a 10% accuracy gain in math reasoning on GSM8k. The same approach hurt performance on the legal reasoning benchmark LEXam, where the teacher model was weaker. The post draws a direct conclusion: data must be tailored to the capabilities of the model consuming it, and there are no fixed recipes transferable across domains.

Across all five domains, the full Simula system — combining global coverage, local diversity, and critiquing — consistently outperformed simpler baselines. The post frames this as “quality is the new quantity”: Simula achieved higher downstream performance with fewer samples than baseline approaches, suggesting scaling laws are driven by data properties rather than raw volume.

Production deployment

Simula is described as already deployed across several Google products. Within the Gemma ecosystem, it has been a key enabler for ShieldGemma, FunctionGemma, and MedGemma, and provides the primary synthetic data backbone for both on-device and server-side Gemini safety classifiers. Beyond foundation models, the post credits Simula with shipping AI-powered scam detection for Android calls and spam filtering in Google Messages.

Further applications mentioned include synthesizing realistic attack scenarios for enterprise security research and generating structured training data for teaching AI models to read maps.

The post is explicit that Simula is not a research prototype. The production deployments listed span consumer features, safety infrastructure, and specialized models — a deployment footprint that grounds the framework’s claims in observable outcomes rather than benchmark performance alone. The core observation — that mechanism design applied to dataset creation produces more controllable and higher-quality results than sample-level generation — will be worth watching as other labs grapple with data scarcity in specialized domains.