Google DeepMind has released Gemini 3.1 Flash TTS, a text-to-speech model the lab describes as its most natural and expressive to date. The model is rolling out in preview to developers via the Gemini API and Google AI Studio, to enterprises via Vertex AI, and to Workspace users through Google Vids. According to the announcement, it scored an Elo of 1,211 on the Artificial Analysis TTS leaderboard, which captures thousands of blind human preferences across models.

Artificial Analysis has also placed Gemini 3.1 Flash TTS in what it calls its “most attractive quadrant,” citing an ideal blend of high-quality speech generation and low cost. The model supports native multi-speaker dialogue, more than 70 languages, and granular creative control through natural language instructions.

Audio tags: directing AI speech like a scene

The most distinct new capability is audio tags — a mechanism for embedding natural language directions directly into text input to shape how the speech is rendered. Rather than adjusting sliders for pitch or speed after the fact, a developer or creator can describe desired vocal qualities inline and have the model respond to those cues during generation.

The post describes a multi-layer system for creative control in Google AI Studio. Scene direction lets developers set an environment and provide dialogue instructions, giving characters context that helps them stay “in-character” and react to each other across multiple turns. Speaker-level specificity lets developers assign unique Audio Profiles to characters, set Director’s Notes to control pace, tone, and accent at a high level, and then override those settings mid-sentence using inline tags for moment-to-moment expression.

Once a configuration is finalized, the post notes that “these exact parameters can be exported as Gemini API code to ensure consistent, recognizable voices across various projects and platforms.” That portability is relevant for production use, where reproducibility matters.

The announcement says early developer and enterprise testers have highlighted the audio tags as providing “a new level of creative precision, transforming simple text into a high-fidelity vocal performance.”

Global reach and language support

The post describes 3.1 Flash TTS as built for global scale. Support for 70+ languages includes what the announcement calls “core optimizations” that bring “advanced style, pacing and accent control to major markets.” The framing suggests this is not simply a list of languages the model can speak, but an attempt to make the creative control features — not just intelligibility — available across those language contexts.

The multi-speaker capability is native, meaning multiple distinct speakers in the same audio output are handled within the model rather than requiring multiple API calls or post-processing stitching. For applications that need dialogue — conversational agents, narrated content, interactive fiction — this matters for both quality and latency.

SynthID watermarking across all output

All audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID, Google DeepMind’s AI-content detection technology. The post describes the watermark as “imperceptible” and “interwoven directly into the audio output,” designed to allow reliable detection of AI-generated audio and help prevent misinformation. A model card is referenced for details on the safety and responsibility approach.

SynthID appearing as a default — not an opt-in — is worth noting. Every piece of audio the model generates carries the watermark regardless of whether the developer specifically enables it. For enterprises building applications where the provenance of audio may matter legally or editorially, this is a meaningful default.

The combination of high benchmark scores, native multi-speaker dialogue, per-speaker and per-sentence creative direction, and embedded watermarking makes this a more complete product offering than a model benchmark alone would suggest. The developer experience tooling — inline audio tags, exportable API parameters, a dedicated Studio playground — indicates this is aimed at production use cases, not just demonstrations.