Google DeepMind releases Gemini 3.1 Flash TTS with audio tags, 70+ languages, and SynthID watermarking

Google DeepMind has released Gemini 3.1 Flash TTS, a text-to-speech model. The model is rolling out in preview to developers via the Gemini API and Google AI Studio, to enterprises via Vertex AI, and to Workspace users through Google Vids. According to the announcement, it scored an Elo of 1,211 on the Artificial Analysis TTS leaderboard, which captures blind human preferences across models.

Artificial Analysis has also placed Gemini 3.1 Flash TTS in what it calls its “most attractive quadrant,” citing an ideal blend of high-quality speech generation and low cost. The model supports native multi-speaker dialogue, more than 70 languages, and granular creative control through natural language instructions.

Audio tags: directing speech at generation time

The most distinct new capability is audio tags — a mechanism for embedding natural language directions directly into text input to shape how the speech is rendered. Rather than adjusting sliders for pitch or speed after the fact, a developer or creator can describe desired vocal qualities inline and have the model respond to those cues during generation.

The post describes a multi-layer system for creative control in Google AI Studio. Scene direction lets developers set an environment and provide dialogue instructions. Speaker-level specificity lets developers assign unique Audio Profiles to characters, set Director’s Notes to control pace, tone, and accent at a high level, and then override those settings mid-sentence using inline tags for moment-to-moment expression.

Once a configuration is finalized, the post notes that “these exact parameters can be exported as Gemini API code to ensure consistent, recognizable voices across various projects and platforms.” That portability is relevant for production use, where reproducibility matters.

The announcement says early developer and enterprise testers highlighted the audio tags as providing “a new level of creative precision, transforming simple text into a high-fidelity vocal performance.”

Global reach and language support

The post describes 3.1 Flash TTS as built for global scale. Support for 70+ languages includes what the announcement calls “core optimizations” that bring “advanced style, pacing and accent control to major markets.” The multi-speaker capability is native, meaning multiple distinct speakers in the same audio output are handled within the model rather than requiring multiple API calls or post-processing stitching.

SynthID watermarking across all output

All audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID, Google DeepMind’s AI-content detection technology. The post describes the watermark as “imperceptible” and “interwoven directly into the audio output,” designed to allow reliable detection of AI-generated audio and help prevent misinformation. A model card is referenced for details on the safety and responsibility approach.

SynthID is applied as a default, not an opt-in. Every piece of audio the model generates carries the watermark regardless of whether the developer specifically enables it.