chore: Run pnpm format:fix.

This commit is contained in:
cpojer
2026-01-31 21:13:13 +09:00
parent dcc2de15a6
commit 8cab78abbc
624 changed files with 10729 additions and 7514 deletions

View File

@@ -3,28 +3,31 @@ summary: "How inbound audio/voice notes are downloaded, transcribed, and injecte
read_when:
- Changing audio transcription or media handling
---
# Audio / Voice Notes — 2026-01-17
## What works
- **Media understanding (audio)**: If audio understanding is enabled (or autodetected), OpenClaw:
1) Locates the first audio attachment (local path or URL) and downloads it if needed.
2) Enforces `maxBytes` before sending to each model entry.
3) Runs the first eligible model entry in order (provider or CLI).
4) If it fails or skips (size/timeout), it tries the next entry.
5) On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}`.
1. Locates the first audio attachment (local path or URL) and downloads it if needed.
2. Enforces `maxBytes` before sending to each model entry.
3. Runs the first eligible model entry in order (provider or CLI).
4. If it fails or skips (size/timeout), it tries the next entry.
5. On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}`.
- **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
- **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.
## Auto-detection (default)
If you **dont configure models** and `tools.media.audio.enabled` is **not** set to `false`,
OpenClaw auto-detects in this order and stops at the first working option:
1) **Local CLIs** (if installed)
1. **Local CLIs** (if installed)
- `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
- `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
- `whisper` (Python CLI; downloads models automatically)
2) **Gemini CLI** (`gemini`) using `read_many_files`
3) **Provider keys** (OpenAI → Groq → Deepgram → Google)
2. **Gemini CLI** (`gemini`) using `read_many_files`
3. **Provider keys** (OpenAI → Groq → Deepgram → Google)
To disable auto-detection, set `tools.media.audio.enabled: false`.
To customize, set `tools.media.audio.models`.
@@ -33,6 +36,7 @@ Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI
## Config examples
### Provider + CLI fallback (OpenAI + Whisper CLI)
```json5
{
tools: {
@@ -46,16 +50,17 @@ Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"],
timeoutSeconds: 45
}
]
}
}
}
timeoutSeconds: 45,
},
],
},
},
},
}
```
### Provider-only with scope gating
```json5
{
tools: {
@@ -64,34 +69,32 @@ Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI
enabled: true,
scope: {
default: "allow",
rules: [
{ action: "deny", match: { chatType: "group" } }
]
rules: [{ action: "deny", match: { chatType: "group" } }],
},
models: [
{ provider: "openai", model: "gpt-4o-mini-transcribe" }
]
}
}
}
models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
},
},
},
}
```
### Provider-only (Deepgram)
```json5
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "deepgram", model: "nova-3" }]
}
}
}
models: [{ provider: "deepgram", model: "nova-3" }],
},
},
},
}
```
## Notes & limits
- Provider auth follows the standard model auth order (auth profiles, env vars, `models.providers.*.apiKey`).
- Deepgram picks up `DEEPGRAM_API_KEY` when `provider: "deepgram"` is used.
- Deepgram setup details: [Deepgram (audio transcription)](/providers/deepgram).
@@ -104,6 +107,7 @@ Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI
- CLI stdout is capped (5MB); keep CLI output concise.
## Gotchas
- Scope rules use first-match wins. `chatType` is normalized to `direct`, `group`, or `room`.
- Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text`.
- Keep timeouts reasonable (`timeoutSeconds`, default 60s) to avoid blocking the reply queue.

View File

@@ -74,6 +74,7 @@ openclaw nodes camera clip --node <id> --no-audio
```
Notes:
- `nodes camera snap` defaults to **both** facings to give the agent both views.
- Output files are temporary (in the OS temp directory) unless you build your own wrapper.
@@ -131,6 +132,7 @@ openclaw nodes camera clip --node <id> --no-audio
```
Notes:
- `openclaw nodes camera snap` defaults to `maxWidth=1600` unless overridden.
- On macOS, `camera.snap` waits `delayMs` (default 2000ms) after warm-up/exposure settle before capturing.
- Photo payloads are recompressed to keep base64 under 5 MB.
@@ -142,11 +144,12 @@ Notes:
## macOS screen video (OS-level)
For *screen* video (not camera), use the macOS companion:
For _screen_ video (not camera), use the macOS companion:
```bash
openclaw nodes screen record --node <id> --duration 10s --fps 15 # prints MEDIA:<path>
```
Notes:
- Requires macOS **Screen Recording** permission (TCC).

View File

@@ -3,21 +3,25 @@ summary: "Image and media handling rules for send, gateway, and agent replies"
read_when:
- Modifying media pipeline or attachments
---
# Image & Media Support — 2025-12-05
The WhatsApp channel runs via **Baileys Web**. This document captures the current media handling rules for send, gateway, and agent replies.
## Goals
- Send media with optional captions via `openclaw message send --media`.
- Allow auto-replies from the web inbox to include media alongside text.
- Keep per-type limits sane and predictable.
## CLI Surface
- `openclaw message send --media <path-or-url> [--message <caption>]`
- `--media` optional; caption can be empty for media-only sends.
- `--dry-run` prints the resolved payload; `--json` emits `{ channel, to, messageId, mediaUrl, caption }`.
## WhatsApp Web channel behavior
- Input: local file path **or** HTTP(S) URL.
- Flow: load into a Buffer, detect media kind, and build the correct payload:
- **Images:** resize & recompress to JPEG (max side 2048px) targeting `agents.defaults.mediaMaxMb` (default 5MB), capped at 6MB.
@@ -29,11 +33,13 @@ The WhatsApp channel runs via **Baileys Web**. This document captures the curren
- Logging: non-verbose shows `↩️`/`✅`; verbose includes size and source path/URL.
## Auto-Reply Pipeline
- `getReplyFromConfig` returns `{ text?, mediaUrl?, mediaUrls? }`.
- When media is present, the web sender resolves local paths or URLs using the same pipeline as `openclaw message send`.
- Multiple media entries are sent sequentially if provided.
## Inbound Media to Commands (Pi)
- When inbound web messages include media, OpenClaw downloads to a temp file and exposes templating variables:
- `{{MediaUrl}}` pseudo-URL for the inbound media.
- `{{MediaPath}}` local temp path written before running the command.
@@ -44,18 +50,22 @@ The WhatsApp channel runs via **Baileys Web**. This document captures the curren
- By default only the first matching image/audio/video attachment is processed; set `tools.media.<cap>.attachments` to process multiple attachments.
## Limits & Errors
**Outbound send caps (WhatsApp web send)**
- Images: ~6MB cap after recompression.
- Audio/voice/video: 16MB cap; documents: 100MB cap.
- Oversize or unreadable media → clear error in logs and the reply is skipped.
**Media understanding caps (transcription/description)**
- Image default: 10MB (`tools.media.image.maxBytes`).
- Audio default: 20MB (`tools.media.audio.maxBytes`).
- Video default: 50MB (`tools.media.video.maxBytes`).
- Oversize media skips understanding, but replies still go through with the original body.
## Notes for Tests
- Cover send + reply flows for image/audio/document cases.
- Validate recompression for images (size bound) and voice-note flag for audio.
- Ensure multi-media replies fan out as sequential sends.

View File

@@ -15,6 +15,7 @@ Legacy transport: [Bridge protocol](/gateway/bridge-protocol) (TCP JSONL; deprec
macOS can also run in **node mode**: the menubar app connects to the Gateways WS server and exposes its local canvas/camera commands as a node (so `openclaw nodes …` works against this Mac).
Notes:
- Nodes are **peripherals**, not gateways. They dont run the gateway service.
- Telegram/WhatsApp/etc. messages land on the **gateway**, not on nodes.
@@ -34,6 +35,7 @@ openclaw nodes describe --node <idOrNameOrIp>
```
Notes:
- `nodes status` marks a node as **paired** when its device pairing role includes `node`.
- `node.pair.*` (CLI: `openclaw nodes pending/approve/reject`) is a separate gateway-owned
node pairing store; it does **not** gate the WS `connect` handshake.
@@ -45,6 +47,7 @@ to execute on another. The model still talks to the **gateway**; the gateway
forwards `exec` calls to the **node host** when `host=node` is selected.
### What runs where
- **Gateway host**: receives messages, runs the model, routes tool calls.
- **Node host**: executes `system.run`/`system.which` on the node machine.
- **Approvals**: enforced on the node host via `~/.openclaw/exec-approvals.json`.
@@ -75,6 +78,7 @@ openclaw nodes list
```
Naming options:
- `--display-name` on `openclaw node run` / `openclaw node install` (persists in `~/.openclaw/node.json` on the node).
- `openclaw nodes rename --node <id|name|ip> --name "Build Node"` (gateway override).
@@ -109,6 +113,7 @@ Once set, any `exec` call with `host=node` runs on the node host (subject to the
node allowlist/approvals).
Related:
- [Node host CLI](/cli/node)
- [Exec tool](/tools/exec)
- [Exec approvals](/tools/exec-approvals)
@@ -144,6 +149,7 @@ openclaw nodes canvas eval --node <idOrNameOrIp> --js "document.title"
```
Notes:
- `canvas present` accepts URLs or local file paths (`--target`), plus optional `--x/--y/--width/--height` for positioning.
- `canvas eval` accepts inline JS (`--js`) or a positional arg.
@@ -156,6 +162,7 @@ openclaw nodes canvas a2ui reset --node <idOrNameOrIp>
```
Notes:
- Only A2UI v0.8 JSONL is supported (v0.9/createSurface is rejected).
## Photos + videos (node camera)
@@ -176,6 +183,7 @@ openclaw nodes camera clip --node <idOrNameOrIp> --duration 3000 --no-audio
```
Notes:
- The node must be **foregrounded** for `canvas.*` and `camera.*` (background calls return `NODE_BACKGROUND_UNAVAILABLE`).
- Clip duration is clamped (currently `<= 60s`) to avoid oversized base64 payloads.
- Android will prompt for `CAMERA`/`RECORD_AUDIO` permissions when possible; denied permissions fail with `*_PERMISSION_REQUIRED`.
@@ -190,6 +198,7 @@ openclaw nodes screen record --node <idOrNameOrIp> --duration 10s --fps 10 --no-
```
Notes:
- `screen.record` requires the node app to be foregrounded.
- Android will show the system screen-capture prompt before recording.
- Screen recordings are clamped to `<= 60s`.
@@ -208,6 +217,7 @@ openclaw nodes location get --node <idOrNameOrIp> --accuracy precise --max-age 1
```
Notes:
- Location is **off by default**.
- “Always” requires system permission; background fetch is best-effort.
- The response includes lat/lon, accuracy (meters), and timestamp.
@@ -223,6 +233,7 @@ openclaw nodes invoke --node <idOrNameOrIp> --command sms.send --params '{"to":"
```
Notes:
- The permission prompt must be accepted on the Android device before the capability is advertised.
- Wi-Fi-only devices without telephony will not advertise `sms.send`.
@@ -239,6 +250,7 @@ openclaw nodes notify --node <idOrNameOrIp> --title "Ping" --body "Gateway ready
```
Notes:
- `system.run` returns stdout/stderr/exit code in the payload.
- `system.notify` respects notification permission state on the macOS app.
- `system.run` supports `--cwd`, `--env KEY=VAL`, `--command-timeout`, and `--needs-screen-recording`.
@@ -290,6 +302,7 @@ openclaw node run --host <gateway-host> --port 18789
```
Notes:
- Pairing is still required (the Gateway will show a node approval prompt).
- The node host stores its node id, token, display name, and gateway connection info in `~/.openclaw/node.json`.
- Exec approvals are enforced locally via `~/.openclaw/exec-approvals.json`

View File

@@ -8,13 +8,16 @@ read_when:
# Location command (nodes)
## TL;DR
- `location.get` is a node command (via `node.invoke`).
- Off by default.
- Settings use a selector: Off / While Using / Always.
- Separate toggle: Precise Location.
## Why a selector (not just a switch)
OS permissions are multi-level. We can expose a selector in-app, but the OS still decides the actual grant.
- iOS/macOS: user can choose **While Using** or **Always** in system prompts/Settings. App can request upgrade, but OS may require Settings.
- Android: background location is a separate permission; on Android 10+ it often requires a Settings flow.
- Precise location is a separate grant (iOS 14+ “Precise”, Android “fine” vs “coarse”).
@@ -22,22 +25,28 @@ OS permissions are multi-level. We can expose a selector in-app, but the OS stil
Selector in UI drives our requested mode; actual grant lives in OS settings.
## Settings model
Per node device:
- `location.enabledMode`: `off | whileUsing | always`
- `location.preciseEnabled`: bool
UI behavior:
- Selecting `whileUsing` requests foreground permission.
- Selecting `always` first ensures `whileUsing`, then requests background (or sends user to Settings if required).
- If OS denies requested level, revert to the highest granted level and show status.
## Permissions mapping (node.permissions)
Optional. macOS node reports `location` via the permissions map; iOS/Android may omit it.
## Command: `location.get`
Called via `node.invoke`.
Params (suggested):
```json
{
"timeoutMs": 10000,
@@ -47,6 +56,7 @@ Params (suggested):
```
Response payload:
```json
{
"lat": 48.20849,
@@ -62,6 +72,7 @@ Response payload:
```
Errors (stable codes):
- `LOCATION_DISABLED`: selector is off.
- `LOCATION_PERMISSION_REQUIRED`: permission missing for requested mode.
- `LOCATION_BACKGROUND_UNAVAILABLE`: app is backgrounded but only While Using allowed.
@@ -69,26 +80,32 @@ Errors (stable codes):
- `LOCATION_UNAVAILABLE`: system failure / no providers.
## Background behavior (future)
Goal: model can request location even when node is backgrounded, but only when:
- User selected **Always**.
- OS grants background location.
- App is allowed to run in background for location (iOS background mode / Android foreground service or special allowance).
Push-triggered flow (future):
1) Gateway sends a push to the node (silent push or FCM data).
2) Node wakes briefly and requests location from the device.
3) Node forwards payload to Gateway.
1. Gateway sends a push to the node (silent push or FCM data).
2. Node wakes briefly and requests location from the device.
3. Node forwards payload to Gateway.
Notes:
- iOS: Always permission + background location mode required. Silent push may be throttled; expect intermittent failures.
- Android: background location may require a foreground service; otherwise, expect denial.
## Model/tooling integration
- Tool surface: `nodes` tool adds `location_get` action (node required).
- CLI: `openclaw nodes location get --node <id>`.
- Agent guidelines: only call when user enabled location and understands the scope.
## UX copy (suggested)
- Off: “Location sharing is disabled.”
- While Using: “Only when OpenClaw is open.”
- Always: “Allow background location. Requires system permission.”

View File

@@ -4,22 +4,25 @@ read_when:
- Designing or refactoring media understanding
- Tuning inbound audio/video/image preprocessing
---
# Media Understanding (Inbound) — 2026-01-17
OpenClaw can **summarize inbound media** (image/audio/video) before the reply pipeline runs. It autodetects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.
## Goals
- Optional: predigest inbound media into short text for faster routing + better command parsing.
- Preserve original media delivery to the model (always).
- Support **provider APIs** and **CLI fallbacks**.
- Allow multiple models with ordered fallback (error/size/timeout).
## Highlevel behavior
1) Collect inbound attachments (`MediaPaths`, `MediaUrls`, `MediaTypes`).
2) For each enabled capability (image/audio/video), select attachments per policy (default: **first**).
3) Choose the first eligible model entry (size + capability + auth).
4) If a model fails or the media is too large, **fall back to the next entry**.
5) On success:
1. Collect inbound attachments (`MediaPaths`, `MediaUrls`, `MediaTypes`).
2. For each enabled capability (image/audio/video), select attachments per policy (default: **first**).
3. Choose the first eligible model entry (size + capability + auth).
4. If a model fails or the media is too large, **fall back to the next entry**.
5. On success:
- `Body` becomes `[Image]`, `[Audio]`, or `[Video]` block.
- Audio sets `{{Transcript}}`; command parsing uses caption text when present,
otherwise the transcript.
@@ -28,7 +31,9 @@ OpenClaw can **summarize inbound media** (image/audio/video) before the reply pi
If understanding fails or is disabled, **the reply flow continues** with the original body + attachments.
## Config overview
`tools.media` supports **shared models** plus percapability overrides:
- `tools.media.models`: shared model list (use `capabilities` to gate).
- `tools.media.image` / `tools.media.audio` / `tools.media.video`:
- defaults (`prompt`, `maxChars`, `maxBytes`, `timeoutSeconds`, `language`)
@@ -43,21 +48,30 @@ If understanding fails or is disabled, **the reply flow continues** with the ori
{
tools: {
media: {
models: [ /* shared list */ ],
image: { /* optional overrides */ },
audio: { /* optional overrides */ },
video: { /* optional overrides */ }
}
}
models: [
/* shared list */
],
image: {
/* optional overrides */
},
audio: {
/* optional overrides */
},
video: {
/* optional overrides */
},
},
},
}
```
### Model entries
Each `models[]` entry can be **provider** or **CLI**:
```json5
{
type: "provider", // default if omitted
type: "provider", // default if omitted
provider: "openai",
model: "gpt-5.2",
prompt: "Describe the image in <= 500 chars.",
@@ -66,7 +80,7 @@ Each `models[]` entry can be **provider** or **CLI**:
timeoutSeconds: 60,
capabilities: ["image"], // optional, used for multimodal entries
profile: "vision-profile",
preferredProfile: "vision-fallback"
preferredProfile: "vision-fallback",
}
```
@@ -79,22 +93,25 @@ Each `models[]` entry can be **provider** or **CLI**:
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
],
maxChars: 500,
maxBytes: 52428800,
timeoutSeconds: 120,
capabilities: ["video", "image"]
capabilities: ["video", "image"],
}
```
CLI templates can also use:
- `{{MediaDir}}` (directory containing the media file)
- `{{OutputDir}}` (scratch dir created for this run)
- `{{OutputBase}}` (scratch file base path, no extension)
## Defaults and limits
Recommended defaults:
- `maxChars`: **500** for image/video (short, commandfriendly)
- `maxChars`: **unset** for audio (full transcript unless you set a limit)
- `maxBytes`:
@@ -103,6 +120,7 @@ Recommended defaults:
- video: **50MB**
Rules:
- If media exceeds `maxBytes`, that model is skipped and the **next model is tried**.
- If the model returns more than `maxChars`, output is trimmed.
- `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only).
@@ -110,37 +128,42 @@ Rules:
**active reply model** when its provider supports the capability.
### Auto-detect media understanding (default)
If `tools.media.<capability>.enabled` is **not** set to `false` and you havent
configured models, OpenClaw auto-detects in this order and **stops at the first
working option**:
1) **Local CLIs** (audio only; if installed)
1. **Local CLIs** (audio only; if installed)
- `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
- `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
- `whisper` (Python CLI; downloads models automatically)
2) **Gemini CLI** (`gemini`) using `read_many_files`
3) **Provider keys**
2. **Gemini CLI** (`gemini`) using `read_many_files`
3. **Provider keys**
- Audio: OpenAI → Groq → Deepgram → Google
- Image: OpenAI → Anthropic → Google → MiniMax
- Video: Google
To disable auto-detection, set:
```json5
{
tools: {
media: {
audio: {
enabled: false
}
}
}
enabled: false,
},
},
},
}
```
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.
## Capabilities (optional)
If you set `capabilities`, the entry only runs for those media types. For shared
lists, OpenClaw can infer defaults:
- `openai`, `anthropic`, `minimax`: **image**
- `google` (Gemini API): **image + audio + video**
- `groq`: **audio**
@@ -150,28 +173,35 @@ For CLI entries, **set `capabilities` explicitly** to avoid surprising matches.
If you omit `capabilities`, the entry is eligible for the list it appears in.
## Provider support matrix (OpenClaw integrations)
| Capability | Provider integration | Notes |
|------------|----------------------|-------|
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
| Audio | OpenAI, Groq, Deepgram, Google | Provider transcription (Whisper/Deepgram/Gemini). |
| Video | Google (Gemini API) | Provider video understanding. |
| Capability | Provider integration | Notes |
| ---------- | ------------------------------------------------ | ------------------------------------------------- |
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
| Audio | OpenAI, Groq, Deepgram, Google | Provider transcription (Whisper/Deepgram/Gemini). |
| Video | Google (Gemini API) | Provider video understanding. |
## Recommended providers
**Image**
- Prefer your active model if it supports images.
- Good defaults: `openai/gpt-5.2`, `anthropic/claude-opus-4-5`, `google/gemini-3-pro-preview`.
**Audio**
- `openai/gpt-4o-mini-transcribe`, `groq/whisper-large-v3-turbo`, or `deepgram/nova-3`.
- CLI fallback: `whisper-cli` (whisper-cpp) or `whisper`.
- Deepgram setup: [Deepgram (audio transcription)](/providers/deepgram).
**Video**
- `google/gemini-3-flash-preview` (fast), `google/gemini-3-pro-preview` (richer).
- CLI fallback: `gemini` CLI (supports `read_file` on video/audio).
## Attachment policy
Percapability `attachments` controls which attachments are processed:
- `mode`: `first` (default) or `all`
- `maxAttachments`: cap the number processed (default **1**)
- `prefer`: `first`, `last`, `path`, `url`
@@ -181,13 +211,18 @@ When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
## Config examples
### 1) Shared models list + overrides
```json5
{
tools: {
media: {
models: [
{ provider: "openai", model: "gpt-5.2", capabilities: ["image"] },
{ provider: "google", model: "gemini-3-flash-preview", capabilities: ["image", "audio", "video"] },
{
provider: "google",
model: "gemini-3-flash-preview",
capabilities: ["image", "audio", "video"],
},
{
type: "cli",
command: "gemini",
@@ -196,23 +231,24 @@ When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
],
capabilities: ["image", "video"]
}
capabilities: ["image", "video"],
},
],
audio: {
attachments: { mode: "all", maxAttachments: 2 }
attachments: { mode: "all", maxAttachments: 2 },
},
video: {
maxChars: 500
}
}
}
maxChars: 500,
},
},
},
}
```
### 2) Audio + Video only (image off)
```json5
{
tools: {
@@ -224,9 +260,9 @@ When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"]
}
]
args: ["--model", "base", "{{MediaPath}}"],
},
],
},
video: {
enabled: true,
@@ -241,17 +277,18 @@ When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
]
}
]
}
}
}
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
],
},
],
},
},
},
}
```
### 3) Optional image understanding
```json5
{
tools: {
@@ -271,30 +308,56 @@ When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
]
}
]
}
}
}
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
],
},
],
},
},
},
}
```
### 4) Multimodal single entry (explicit capabilities)
```json5
{
tools: {
media: {
image: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
audio: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
video: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] }
}
}
image: {
models: [
{
provider: "google",
model: "gemini-3-pro-preview",
capabilities: ["image", "video", "audio"],
},
],
},
audio: {
models: [
{
provider: "google",
model: "gemini-3-pro-preview",
capabilities: ["image", "video", "audio"],
},
],
},
video: {
models: [
{
provider: "google",
model: "gemini-3-pro-preview",
capabilities: ["image", "video", "audio"],
},
],
},
},
},
}
```
## Status output
When media understanding runs, `/status` includes a short summary line:
```
@@ -304,10 +367,12 @@ When media understanding runs, `/status` includes a short summary line:
This shows percapability outcomes and the chosen provider/model when applicable.
## Notes
- Understanding is **besteffort**. Errors do not block replies.
- Attachments are still passed to models even when understanding is disabled.
- Use `scope` to limit where understanding runs (e.g. only DMs).
## Related docs
- [Configuration](/gateway/configuration)
- [Image & Media Support](/nodes/images)

View File

@@ -4,15 +4,18 @@ read_when:
- Implementing Talk mode on macOS/iOS/Android
- Changing voice/TTS/interrupt behavior
---
# Talk Mode
Talk mode is a continuous voice conversation loop:
1) Listen for speech
2) Send transcript to the model (main session, chat.send)
3) Wait for the response
4) Speak it via ElevenLabs (streaming playback)
1. Listen for speech
2. Send transcript to the model (main session, chat.send)
3. Wait for the response
4. Speak it via ElevenLabs (streaming playback)
## Behavior (macOS)
- **Always-on overlay** while Talk mode is enabled.
- **Listening → Thinking → Speaking** phase transitions.
- On a **short pause** (silence window), the current transcript is sent.
@@ -20,13 +23,15 @@ Talk mode is a continuous voice conversation loop:
- **Interrupt on speech** (default on): if the user starts talking while the assistant is speaking, we stop playback and note the interruption timestamp for the next prompt.
## Voice directives in replies
The assistant may prefix its reply with a **single JSON line** to control voice:
```json
{"voice":"<voice-id>","once":true}
{ "voice": "<voice-id>", "once": true }
```
Rules:
- First non-empty line only.
- Unknown keys are ignored.
- `once: true` applies to the current reply only.
@@ -34,6 +39,7 @@ Rules:
- The JSON line is stripped before TTS playback.
Supported keys:
- `voice` / `voice_id` / `voiceId`
- `model` / `model_id` / `modelId`
- `speed`, `rate` (WPM), `stability`, `similarity`, `style`, `speakerBoost`
@@ -41,19 +47,21 @@ Supported keys:
- `once`
## Config (`~/.openclaw/openclaw.json`)
```json5
{
"talk": {
"voiceId": "elevenlabs_voice_id",
"modelId": "eleven_v3",
"outputFormat": "mp3_44100_128",
"apiKey": "elevenlabs_api_key",
"interruptOnSpeech": true
}
talk: {
voiceId: "elevenlabs_voice_id",
modelId: "eleven_v3",
outputFormat: "mp3_44100_128",
apiKey: "elevenlabs_api_key",
interruptOnSpeech: true,
},
}
```
Defaults:
- `interruptOnSpeech`: true
- `voiceId`: falls back to `ELEVENLABS_VOICE_ID` / `SAG_VOICE_ID` (or first ElevenLabs voice when API key is available)
- `modelId`: defaults to `eleven_v3` when unset
@@ -61,6 +69,7 @@ Defaults:
- `outputFormat`: defaults to `pcm_44100` on macOS/iOS and `pcm_24000` on Android (set `mp3_*` to force MP3 streaming)
## macOS UI
- Menu bar toggle: **Talk**
- Config tab: **Talk Mode** group (voice id + interrupt toggle)
- Overlay:
@@ -71,6 +80,7 @@ Defaults:
- Click X: exit Talk mode
## Notes
- Requires Speech + Microphone permissions.
- Uses `chat.send` against session key `main`.
- TTS uses ElevenLabs streaming API with `ELEVENLABS_API_KEY` and incremental playback on macOS/iOS/Android for lower latency.

View File

@@ -4,6 +4,7 @@ read_when:
- Changing voice wake words behavior or defaults
- Adding new node platforms that need wake word sync
---
# Voice Wake (Global Wake Words)
OpenClaw treats **wake words as a single global list** owned by the **Gateway**.
@@ -32,6 +33,7 @@ Shape:
- `voicewake.set` with params `{ triggers: string[] }``{ triggers: string[] }`
Notes:
- Triggers are normalized (trimmed, empties dropped). Empty lists fall back to defaults.
- Limits are enforced for safety (count/length caps).
@@ -40,6 +42,7 @@ Notes:
- `voicewake.changed` payload `{ triggers: string[] }`
Who receives it:
- All WebSocket clients (macOS app, WebChat, etc.)
- All connected nodes (iOS/Android), and also on node connect as an initial “current state” push.