LLM/moltbot

Fork 0

mirror of https://github.com/moltbot/moltbot.git synced 2026-05-06 15:18:58 +00:00

Files

Peter Steinberger 7431cb8def docs: detail talk refactor plan

2026-05-06 02:39:15 +01:00

21 KiB

Raw Blame History

summary, read_when, title

summary

read_when

title

Breaking refactor plan for one Talk architecture across realtime voice, STT/TTS, browser, native, telephony, meetings, and walkie-talkie handoff

Refactoring Talk mode, realtime voice, voice-call, Google Meet, browser realtime voice, native push-to-talk, STT, or TTS

Changing Talk Gateway protocol, provider contracts, realtime transports, managed rooms, audio events, cancellation, or tool policy

Deciding whether a voice feature belongs in core, a provider plugin, a native app, a meeting adapter, or a telephony adapter

Talk refactor plan

Talk Refactor Plan

This is the breaking-clean plan for unifying every live voice path behind one Talk architecture.

The old architecture grew by product surface: browser realtime, Gateway relay, managed native handoff, streaming transcription, Voice Call, Google Meet, local STT/TTS, one-shot TTS, and a retired realtime WebSocket endpoint each learned their own names for sessions, turns, capture, output, barge-in, tool calls, cancellation, and transcript events.

The new architecture grows by primitive. There is one public Talk API, one event envelope, one turn model, one cancellation contract, one provider policy boundary, and one place for shared runtime state. Browser, native, telephony, meetings, and walkie-talkie become adapters over those primitives.

Product Target

OpenClaw supports three Talk products:

Product	User experience	Mode
Realtime conversation	Low-latency duplex speech with interruption and provider tool calls	`realtime`
Walkie-talkie	Press or hold to speak, release, then hear OpenClaw answer	`stt-tts`
Transcription	Live captions, dictation, notes, meeting transcript, no assistant audio	`transcription`

All three products share session identity, join/reconnect state, turn and capture ids, input audio metadata, output text/audio state, transcript finality, tool-call correlation, cancellation, replay, provider capabilities, policy, auth, and observability.

One-shot uploaded audio and one-shot TTS do not need live Talk session state unless they participate in live capture, turns, interruption, replay, or cancellation.

Hard Decisions

This refactor intentionally removes compatibility that would keep the design muddy:

remove public talk.realtime.* RPCs
remove public talk.transcription.* RPCs
remove public talk.handoff.* RPCs
remove generic talk.session.inputAudio, talk.session.control, and talk.session.toolResult
remove old relay event channels
remove /voiceclaw/realtime
remove src/gateway/voiceclaw-realtime/
remove request-time instruction overrides
keep talk.speak as one-shot TTS, not a live session API
keep legacy realtime config repair in doctor, not startup
keep platform and product names out of core branching

Vocabulary

Keep mode, transport, brain, and surface separate.

type TalkMode = "realtime" | "stt-tts" | "transcription";

type TalkTransport = "webrtc" | "provider-websocket" | "gateway-relay" | "managed-room";

type TalkBrain = "agent-consult" | "direct-tools" | "none";

Modes

realtime means a provider owns a live voice session. Audio goes in, audio comes out, interruptions are possible, and provider tool calls may happen during one provider session.

stt-tts means input speech is transcribed, OpenClaw answers as text, and TTS renders the answer. This is the native Talk and walkie-talkie path when a full duplex provider session is not the right shape.

transcription means speech-to-text without assistant audio output. It covers captions, dictation, notes, meeting transcript capture, and live voice-note ingestion.

Transports

webrtc is client-owned SDP/media/data-channel transport. It fits browser-owned OpenAI Realtime sessions with ephemeral credentials.

provider-websocket is client-owned provider JSON and audio framing. It fits browser-owned Google Live style sessions.

gateway-relay means the Gateway owns the provider connection. The client sends authenticated audio frames to the Gateway and receives talk.event plus audio output through Gateway-managed relay state.

managed-room means the Gateway owns a room-like session that clients can join, replace, and drive with explicit turn verbs. It is the primitive for walkie-talkie and native handoff.

Telephony and meetings are not core transports. They are adapters that map phone or meeting media into gateway-relay, managed-room, or stt-tts while keeping call and meeting lifecycle outside core.

Brain Strategies

agent-consult means provider tool calls or session turns consult an OpenClaw agent. Gateway owns prompt construction, context selection, authorization, abort signals, and final result delivery.

direct-tools means a trusted first-party surface can call selected OpenClaw tools directly through Gateway policy. Keep this privileged.

none means transcription-only, external orchestration, or no OpenClaw tool access.

Ownership Boundaries

Core owns generic Talk semantics:

mode, transport, brain, codec, and audio descriptors
session records and session ownership
turn ids and capture ids
event envelope, sequencing, replay, and stale-output suppression
active capture state
active assistant output state
replacement and reconnect state
cancellation propagation
tool policy and tool-call correlation
usage, latency, and health events

Provider plugins own vendor behavior:

OpenAI Realtime SDP and data-channel details
Google Live WebSocket framing
streaming STT provider details
TTS provider details
provider auth, model, voice, codec, and resume quirks
provider capability declarations

Surface adapters own IO and product quirks:

browser capture and playback
native audio sessions, local speech engines, and foreground Talk UX
node command dispatch
telephony media streams, marks, clear events, u-law, and call lifecycle
meeting join/leave, participants, echo suppression, and authorization

Core may store optional surface metadata for diagnostics. Core must not branch on browser, iOS, Android, macOS, Google Meet, Voice Call, or any retired product name.

Final Gateway API

The public Gateway surface is deliberately small:

// Discovery and configuration.
talk.catalog;
talk.config;

// One-shot speech output.
talk.speak;

// Client-owned provider sessions.
talk.client.create;
talk.client.toolCall;

// Gateway-owned live sessions.
talk.session.create;
talk.session.join;
talk.session.appendAudio;
talk.session.startTurn;
talk.session.endTurn;
talk.session.cancelTurn;
talk.session.cancelOutput;
talk.session.submitToolResult;
talk.session.close;

// Events and foreground node mode.
talk.event;
talk.mode;

Use talk.client.* when the client owns provider media transport. Use talk.session.* when the Gateway owns live session state.

talk.mode is the existing foreground node mode broadcast. It can stay, but it is not part of the Talk session control API.

Supported Creation Matrix

Method	Mode	Transport	Brain	Owner
`talk.client.create`	`realtime`	`webrtc`	`agent-consult`	client
`talk.client.create`	`realtime`	`provider-websocket`	`agent-consult`	client
`talk.session.create`	`realtime`	`gateway-relay`	`agent-consult`	Gateway
`talk.session.create`	`transcription`	`gateway-relay`	`none`	Gateway
`talk.session.create`	`stt-tts`	`managed-room`	`agent-consult`	Gateway
`talk.session.create`	`stt-tts`	`managed-room`	`direct-tools`	Gateway

Reject combinations that blur ownership. talk.client.create must reject Gateway-owned transports. talk.session.create must reject client-owned transports.

Removed API

Remove these names from handlers, method lists, scopes, protocol schemas, generated clients, broadcast guards, tests, and docs except explicit migration tables:

Removed	Replacement
`talk.realtime.session`	`talk.client.create`
`talk.realtime.toolCall`	`talk.client.toolCall`
`talk.realtime.relayAudio`	`talk.session.appendAudio`
`talk.realtime.relayCancel`	`talk.session.cancelOutput` or `talk.session.cancelTurn`
`talk.realtime.relayMark`	internal relay output state
`talk.realtime.relayToolResult`	`talk.session.submitToolResult`
`talk.realtime.relayClose`	`talk.session.close`
`talk.realtime.relay`	`talk.event`
`talk.transcription.session`	`talk.session.create({ mode: "transcription" })`
`talk.transcription.audio`	`talk.session.appendAudio`
`talk.transcription.cancel`	`talk.session.cancelTurn`
`talk.transcription.close`	`talk.session.close`
`talk.transcription.relay`	`talk.event`
`talk.handoff.create`	`talk.session.create({ transport: "managed-room" })`
`talk.handoff.join`	`talk.session.join`
`talk.handoff.revoke`	`talk.session.close`
`talk.session.inputAudio`	`talk.session.appendAudio`
`talk.session.control`	explicit turn/output verbs
`talk.session.toolResult`	`talk.session.submitToolResult`

Delete this endpoint:

/voiceclaw/realtime

Delete this folder:

src/gateway/voiceclaw-realtime/

Do not leave a compatibility namespace around retired code.

Target Source Layout

Shared runtime:

src/talk/
  audio-codec.ts
  agent-consult-runtime.ts
  agent-consult-tool.ts
  agent-talkback-runtime.ts
  fast-context-runtime.ts
  provider-registry.ts
  provider-resolver.ts
  provider-types.ts
  session-log-runtime.ts
  session-runtime.ts
  talk-events.ts
  talk-session-controller.ts

Gateway adapters:

src/gateway/server-methods/
  talk.ts          # catalog, config, speak, mode, composition
  talk-client.ts   # client-owned provider sessions
  talk-session.ts  # Gateway-owned live sessions

Gateway relay helpers can exist while the code moves, but the long-term shape is that relay, transcription, and handoff state use src/talk primitives instead of each reimplementing turns and events.

Public SDK:

src/plugin-sdk/realtime-voice.ts

Keep this SDK subpath as the stable plugin import facade. It may re-export Talk runtime contracts, but plugin authors should not import core file layout.

Event Contract

All live paths emit talk.event with the envelope defined in Talk API and runtime contract. The required shape is: id, type, sessionId, seq, timestamp, mode, transport, brain, and payload, with turnId, captureId, callId, itemId, and parentId when the event is tied to turn, capture, provider item, tool call, or TTS output.

Core event families are session.*, turn.*, capture.*, input.audio.*, transcript.*, output.text.*, output.audio.*, tool.*, usage.metrics, latency.metrics, and health.changed. Payloads must not duplicate large raw audio frames when the transport already carries them. Text-ready is not audio-ready; clients enter playback state only on audio events.

Cancellation Contract

Cancellation must abort underlying work, not only ignore stale output.

When a turn or session is cancelled:

provider realtime response is cancelled when supported
provider session is closed or reset when cancellation cannot be scoped
streaming STT receives abort
agent consult receives abort
queued tools do not start after abort
already-started side-effecting tools receive abort and report cancellation
pending TTS jobs are drained
playback sources are stopped
relay streams are cleared
managed-room capture and output state reset
stale finals and stale audio deltas are ignored
one terminal cancellation event is emitted

Barge-in requires real speech: provider speech-started, local VAD, or an adapter-owned speech detector. Silence, echo, or microphone buffers alone must not cancel assistant output.

Config Contract

Config stays under talk; do not add talk.speech. talk.provider and talk.providers.* remain speech/STT/TTS provider config. Realtime selectors live under talk.realtime.provider, talk.realtime.providers.*, model, voice, mode, transport, and brain.

talk.config returns effective config without secrets unless privileged. talk.catalog returns provider capabilities, not inferred provider-id guesses. Doctor migrates old realtime placement into talk.realtime; runtime startup does not reinterpret Voice Call, STT, or TTS config as realtime config.

Surface Mapping

Surface	Talk mapping
Browser WebRTC	`talk.client.create`, client-owned provider media, `talk.client.toolCall` for provider tool calls
Browser provider WebSocket	`talk.client.create`, browser-owned provider framing, Gateway-owned credentials and policy
Browser Gateway relay	`talk.session.create`, `appendAudio`, `submitToolResult`, `cancelOutput`, `close`, and `talk.event`
Native push-to-talk	`stt-tts` plus `managed-room`; press/startTurn, release/endTurn, cancel/cancelTurn
Walkie-talkie	managed-room join/replacement plus shared turn/output events
Voice Call	telephony adapter over Talk events; call ids, stream ids, u-law, marks, clear events stay plugin side
Google Meet and future meetings	meeting adapter over Talk events; participant state, permissions, mute, and echo suppression stay out

See Talk surface mapping for the adapter-level rules.

Detailed Refactor Phases

Phase 1: Protocol Is The Source Of Truth

define final talk.client.*, talk.session.*, talk.event, talk.catalog, talk.config, talk.speak, and talk.mode
delete removed RPCs from method lists and generated metadata
delete removed event channels from hello feature advertising
classify every final method in METHOD_SCOPE_GROUPS
regenerate TypeScript and Swift protocol clients
add protocol tests proving removed names are absent

Exit criteria: generated clients expose only the final public Talk API.

Phase 2: Shared Runtime Becomes `src/talk`

move provider-agnostic realtime voice modules into src/talk
keep the plugin SDK facade at openclaw/plugin-sdk/realtime-voice
rename logs and tests from realtime-voice wording to Talk wording where that improves clarity
centralize event sequencing, active turn state, capture state, output state, stale-turn rejection, and replay history
keep provider adapters out of this folder

Exit criteria: core and bundled surfaces import shared semantics from src/talk or the SDK facade, not from surface-local helpers.

Phase 3: Gateway Method Split

make talk.ts a composition point for catalog, config, speak, mode, client, and session handlers
put client-owned provider session methods in talk-client.ts
put Gateway-owned session methods in talk-session.ts
make relay, transcription, and managed-room handlers thin adapters over shared runtime primitives
route session replacement notifications to the displaced connection
reject stale turn completion before mutating active room state

Exit criteria: public RPC handlers read like API adapters, not separate Talk implementations.

Phase 4: Browser UI Uses The Final API

update WebRTC and provider WebSocket startup to talk.client.create
update browser provider tool calls to talk.client.toolCall
update Gateway relay startup to talk.session.create
update relay audio to talk.session.appendAudio
update relay tool result submission to talk.session.submitToolResult
update relay close to talk.session.close
listen only to talk.event
handle aborted consult runs immediately instead of timing out
gate relay barge-in on speech or VAD

Exit criteria: UI tests contain no calls to removed Talk RPC names.

Phase 5: Native And Nodes Become Event-Driven

map native push-to-talk into managed-room sessions
start, end, cancel, and replace turns through explicit session verbs
clean capture state when push-to-talk start fails
keep local STT and TTS as native adapter behavior
remove chat-history polling from the success path
keep fallback polling only if there is an explicit degraded-mode test

Exit criteria: native Talk success path is driven by talk.event, not hidden chat side effects.

Phase 6: Telephony And Meetings Become Adapters

map Voice Call realtime and streaming STT into Talk event/cancellation semantics
create or guard a turn before early speech cancellation events
keep telephony codec, marks, clear events, and call lifecycle outside core
map Google Meet transcript and assistant output into talk.event
keep participant and echo-suppression behavior in the meeting adapter
pass abort signals into agent consult and tool runtime

Exit criteria: Voice Call and meetings share event and cancellation semantics without introducing telephony or meeting branches in core.

Phase 7: Config And Doctor Cleanup

keep talk.provider and talk.providers.* as speech/STT/TTS config
keep realtime voice selectors under talk.realtime
make talk.config return only resolved effective provider data
repair legacy realtime placement in doctor
document that runtime startup does not guess or rewrite config
update SDK migration, Gateway protocol, Talk node, Control UI, and TTS docs

Exit criteria: no second speech namespace, no startup migrations, and no ambiguous active provider in talk.config.

Phase 8: Delete The Retired Stack

remove /voiceclaw/realtime
delete src/gateway/voiceclaw-realtime/
remove request-time instructionsOverride
remove old RPC handlers, scopes, broadcast guards, protocol schemas, generated clients, docs, and UI calls
keep old names only in explicit migration tables and negative tests

Exit criteria: repository search finds removed public names only in migration notes or tests that assert absence.

Test And Verification Plan

The full matrix lives in Talk refactor execution checklist. The required proof areas are:

protocol and generated clients expose only the final Talk API
Gateway tests cover every talk.client.* and talk.session.* method
UI tests prove browser WebRTC, provider WebSocket, and relay paths use the final API
native tests prove managed-room push-to-talk cleanup, replacement, and event flow
Voice Call and meeting tests prove early speech, barge-in, output state, and cancellation behavior
config tests prove talk.config reports only resolved effective provider data
architecture searches prove removed RPCs, events, endpoint, folder, and instruction override stay gone
docs, protocol generation, SDK API checks, Android tests, build, and pnpm check:changed pass before push

Definition Of Done

The refactor is complete when:

final API is the only advertised public API
removed RPCs are gone from handlers, scopes, method lists, schemas, generated clients, docs, and UI
removed event channels are gone
retired realtime HTTP endpoint is gone
retired realtime folder is gone
browser Talk works through talk.client.* or talk.session.*
native Talk works through session events
streaming STT works through talk.session.*
TTS one-shot remains talk.speak
walkie-talkie works through managed-room sessions
Voice Call and meetings use shared events and cancellation semantics
cancellation aborts underlying work
event envelopes are consistent
config migration is handled by doctor
tests prove the deleted API cannot accidentally return

Supporting details:

The end state: one Talk system, a small public API, provider-owned vendor logic, surface-owned IO, and a Gateway core that owns policy, events, sessions, turns, cancellation, and observability.

21 KiB Raw Blame History