Voice Output (TTS)
The “voice output” layer is what speaks the assistant’s replies back to the user out loud. It runs entirely on-device — text never leaves the machine for synthesis.
When the user asks “how do I change my voice”, “turn off voice output”, “why does it sound robotic”, or anything else about how the assistant sounds, this is the topic.
What’s available
Section titled “What’s available”Voice output (on/off). The master switch. When off, the assistant still answers — just silently — and the reply text is delivered through the drawer / paste flow only. When on, the active TTS engine speaks the reply through the active voice. This setting is managed in the Dashboard, not through MCP.
TTS engine. Pocket TTS, on-device. Fast, lightweight, low memory, ships with a curated set of character voices that map cleanly to Voice Mode’s personas. There is no engine choice to make — voice output is always Pocket.
Voice. A Pocket voice identifier (e.g. alba). Personas can override
the voice on a per-persona basis — see personas.md. Agents may pass a
one-call voice override to speak, but MCP does not persistently change
the user’s configured voice.
Smart punctuation. When on, the TTS engine infers commas and pauses from prosody rather than rendering only the punctuation in the input text. Generally improves naturalness for free-form answers. Most users want this on. Turn it off if the engine over-corrects on technical content (e.g. inserting pauses inside a function name or URL).
Text normalization (expanding numbers, units, and abbreviations into spelled-out forms) is always on and not user-configurable.
Calling speak from an agent
Section titled “Calling speak from an agent”Pass the entire reply in a single speak call. Do not pre-split the
text into chunks — Voice Mode serializes calls into a FIFO queue, so
multiple back-to-back speak calls play one after another, but each
call still adds an audible gap. One call → one smooth utterance.
If you need to stream or split a reply anyway, pass the same stable
identifier for every chunk that belongs to one user-facing answer.
Voice Mode groups those lines together in the UI and, for free users,
dedupes chunks with the same peer + identifier for a short window so one
answer is one Pro use. Use a new identifier for a new answer.
If you do call speak more than once (e.g. across separate replies in
the same session), each new call appends to the queue. The agent isn’t
blocked on playback — speak returns as soon as the text is queued.
If Voice Output is disabled in the Dashboard, speak returns queued: false with a warning and does not play audio. Do not retry, and do not
try to enable Voice Output through MCP. Ask the user to enable Voice
Output in the Dashboard if they want spoken replies.
The user always wins: if they start dictating, hit the assistant hotkey, or trigger any new assistant turn, the queue is drained and stale agent audio stops. Don’t rely on a queued utterance still being audible later in the session.
There is a 2,000-character cap per call as a sanity bound. For ordinary status, prefer succinct spoken output, often one or two short sentences. Speak longer when the user asks, when voice detail is needed, or when the situation warrants it. Let user instructions, active character/persona, augments, and settings guide the right length.
Quota and Pro behavior
Section titled “Quota and Pro behavior”Voice output and character voices are Pro surfaces. Free-tier dictation works without any TTS layer.
After the 14-day trial, free users have a weekly allowance of Pro uses.
An accepted MCP speak call consumes one Pro use. Paid Pro users are
unlimited. Active-trial users are unlimited too, but Voice Mode may
shadow-count the same events so it can explain what will happen after
the trial.
Before speaking a long or optional reply, call current_settings and
look at proQuota:
applies: truemeans quota is enforced. Checkremaining,limit,resetAt, andmcpSpeakAllowed.applies: false, reason: "trial"means the user is in trial. Speaking is allowed;shadowUsedis diagnostic only.applies: false, reason: "paidPro"means the user has unlimited Pro.
If speak returns a quota error, do not retry the same spoken reply.
Fall back to text in your normal assistant response and, if useful,
mention that spoken replies reset with the next weekly allowance or are
unlimited on Pro. The error data includes quotaExceeded, remaining,
limit, resetAt, and upgradeURL.
Runtime speak quota errors are separate from the Dashboard’s Voice
Output setting. Voice Output may be enabled, but the free weekly
spoken-reply allowance can still be exhausted.
Pro tips
Section titled “Pro tips”- Pair voice with persona, not despite it. A laid-back persona reads flatter with a clipped voice; a precise persona feels off in a warm voice. Voice Mode lets each persona pin its own voice — that’s the intended ergonomic.
- Smart punctuation is great for prose, less great for code. If the user’s typical replies include a lot of code or technical jargon spoken aloud, turning smart punctuation off can sound more natural.
What’s not in scope
Section titled “What’s not in scope”- Voice cloning / custom voices. Not supported. The underlying TTS models are open-source (FluidAudio on GitHub) and technically-inclined users can experiment with their own pipelines, but importing custom voices into Voice Mode itself is not officially supported and is at the user’s own risk.
- Cloud TTS. Voice Mode is local-first; there is no cloud TTS backend and there are no plans for one.