Skip to content

Voice Output (TTS)

The “voice output” layer is what speaks the assistant’s replies back to the user out loud. It runs entirely on-device — text never leaves the machine for synthesis.

When the user asks “how do I change my voice”, “turn off voice output”, “why does it sound robotic”, or anything else about how the assistant sounds, this is the topic.

Voice output (on/off). The master switch. When off, the assistant still answers — just silently — and the reply text is delivered through the drawer / paste flow only. When on, the active TTS engine speaks the reply through the active voice. This setting is managed in the Dashboard, not through MCP.

TTS engine. Pocket TTS, on-device. Fast, lightweight, low memory, ships with a curated set of character voices that map cleanly to Voice Mode’s personas. There is no engine choice to make — voice output is always Pocket.

Voice. A Pocket voice identifier (e.g. alba). Personas can override the voice on a per-persona basis — see personas.md. Agents may pass a one-call voice override to speak, but MCP does not persistently change the user’s configured voice.

Smart punctuation. When on, the TTS engine infers commas and pauses from prosody rather than rendering only the punctuation in the input text. Generally improves naturalness for free-form answers. Most users want this on. Turn it off if the engine over-corrects on technical content (e.g. inserting pauses inside a function name or URL).

Text normalization (expanding numbers, units, and abbreviations into spelled-out forms) is always on and not user-configurable.

Pass the entire reply in a single speak call. Do not pre-split the text into chunks — Voice Mode serializes calls into a FIFO queue, so multiple back-to-back speak calls play one after another, but each call still adds an audible gap. One call → one smooth utterance.

If you need to stream or split a reply anyway, pass the same stable identifier for every chunk that belongs to one user-facing answer. Voice Mode groups those lines together in the UI and, for free users, dedupes chunks with the same peer + identifier for a short window so one answer is one Pro use. Use a new identifier for a new answer.

If you do call speak more than once (e.g. across separate replies in the same session), each new call appends to the queue. The agent isn’t blocked on playback — speak returns as soon as the text is queued.

If Voice Output is disabled in the Dashboard, speak returns queued: false with a warning and does not play audio. Do not retry, and do not try to enable Voice Output through MCP. Ask the user to enable Voice Output in the Dashboard if they want spoken replies.

The user always wins: if they start dictating, hit the assistant hotkey, or trigger any new assistant turn, the queue is drained and stale agent audio stops. Don’t rely on a queued utterance still being audible later in the session.

There is a 2,000-character cap per call as a sanity bound. For ordinary status, prefer succinct spoken output, often one or two short sentences. Speak longer when the user asks, when voice detail is needed, or when the situation warrants it. Let user instructions, active character/persona, augments, and settings guide the right length.

Voice output and character voices are Pro surfaces. Free-tier dictation works without any TTS layer.

After the 14-day trial, free users have a weekly allowance of Pro uses. An accepted MCP speak call consumes one Pro use. Paid Pro users are unlimited. Active-trial users are unlimited too, but Voice Mode may shadow-count the same events so it can explain what will happen after the trial.

Before speaking a long or optional reply, call current_settings and look at proQuota:

  • applies: true means quota is enforced. Check remaining, limit, resetAt, and mcpSpeakAllowed.
  • applies: false, reason: "trial" means the user is in trial. Speaking is allowed; shadowUsed is diagnostic only.
  • applies: false, reason: "paidPro" means the user has unlimited Pro.

If speak returns a quota error, do not retry the same spoken reply. Fall back to text in your normal assistant response and, if useful, mention that spoken replies reset with the next weekly allowance or are unlimited on Pro. The error data includes quotaExceeded, remaining, limit, resetAt, and upgradeURL.

Runtime speak quota errors are separate from the Dashboard’s Voice Output setting. Voice Output may be enabled, but the free weekly spoken-reply allowance can still be exhausted.

  • Pair voice with persona, not despite it. A laid-back persona reads flatter with a clipped voice; a precise persona feels off in a warm voice. Voice Mode lets each persona pin its own voice — that’s the intended ergonomic.
  • Smart punctuation is great for prose, less great for code. If the user’s typical replies include a lot of code or technical jargon spoken aloud, turning smart punctuation off can sound more natural.
  • Voice cloning / custom voices. Not supported. The underlying TTS models are open-source (FluidAudio on GitHub) and technically-inclined users can experiment with their own pipelines, but importing custom voices into Voice Mode itself is not officially supported and is at the user’s own risk.
  • Cloud TTS. Voice Mode is local-first; there is no cloud TTS backend and there are no plans for one.