docs: document music generation async flow

2026-04-23 14:45:46 +00:00 · 2026-04-06 01:46:25 +01:00
parent 3027f0dde5
commit f6dbcf4cda
13 changed files with 253 additions and 46 deletions
--- a/docs/tools/index.md
+++ b/docs/tools/index.md
@@ -66,6 +66,7 @@ These tools ship with OpenClaw and are available without installing any plugins:
 | `nodes`                                    | Discover and target paired devices                                    |                                             |
 | `cron` / `gateway`                         | Manage scheduled jobs; inspect, patch, restart, or update the gateway |                                             |
 | `image` / `image_generate`                 | Analyze or generate images                                            | [Image Generation](/tools/image-generation) |
+| `music_generate`                           | Generate music tracks                                                 | [Music Generation](/tools/music-generation) |
 | `video_generate`                           | Generate videos                                                       | [Video Generation](/tools/video-generation) |
 | `tts`                                      | One-shot text-to-speech conversion                                    | [TTS](/tools/tts)                           |
 | `sessions_*` / `subagents` / `agents_list` | Session management, status, and sub-agent orchestration               | [Sub-agents](/tools/subagents)              |
@@ -73,6 +74,8 @@ These tools ship with OpenClaw and are available without installing any plugins:

 For image work, use `image` for analysis and `image_generate` for generation or editing. If you target `openai/*`, `google/*`, `fal/*`, or another non-default image provider, configure that provider's auth/API key first.

+For music work, use `music_generate`. If you target `google/*`, `minimax/*`, or another non-default music provider, configure that provider's auth/API key first.
+
 For video work, use `video_generate`. If you target `qwen/*` or another non-default video provider, configure that provider's auth/API key first.

 For workflow-driven audio generation, use `music_generate` when a plugin such as
@@ -128,12 +131,12 @@ config. Deny always wins over allow.
 `tools.profile` sets a base allowlist before `allow`/`deny` is applied.
 Per-agent override: `agents.list[].tools.profile`.

-| Profile     | What it includes                                                                                                                |
-| ----------- | ------------------------------------------------------------------------------------------------------------------------------- |
-| `full`      | No restriction (same as unset)                                                                                                  |
-| `coding`    | `group:fs`, `group:runtime`, `group:web`, `group:sessions`, `group:memory`, `cron`, `image`, `image_generate`, `video_generate` |
-| `messaging` | `group:messaging`, `sessions_list`, `sessions_history`, `sessions_send`, `session_status`                                       |
-| `minimal`   | `session_status` only                                                                                                           |
+| Profile     | What it includes                                                                                                                                  |
+| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `full`      | No restriction (same as unset)                                                                                                                    |
+| `coding`    | `group:fs`, `group:runtime`, `group:web`, `group:sessions`, `group:memory`, `cron`, `image`, `image_generate`, `music_generate`, `video_generate` |
+| `messaging` | `group:messaging`, `sessions_list`, `sessions_history`, `sessions_send`, `session_status`                                                         |
+| `minimal`   | `session_status` only                                                                                                                             |

 ### Tool groups

@@ -151,7 +154,7 @@ Use `group:*` shorthands in allow/deny lists:
 | `group:messaging`  | message                                                                                                   |
 | `group:nodes`      | nodes                                                                                                     |
 | `group:agents`     | agents_list                                                                                               |
-| `group:media`      | image, image_generate, video_generate, tts                                                                |
+| `group:media`      | image, image_generate, music_generate, video_generate, tts                                                |
 | `group:openclaw`   | All built-in OpenClaw tools (excludes plugin tools)                                                       |

 `sessions_history` returns a bounded, safety-filtered recall view. It strips
--- a/docs/tools/music-generation.md
+++ b/docs/tools/music-generation.md
@@ -1,23 +1,68 @@
 ---
-summary: "Generate music or audio with plugin-provided tools such as ComfyUI workflows"
+summary: "Generate music with shared providers or plugin-provided workflows"
 read_when:
  - Generating music or audio via the agent
-  - Configuring plugin-provided music generation tools
+  - Configuring music generation providers and models
  - Understanding the music_generate tool parameters
 title: "Music Generation"
 ---

 # Music Generation

-The `music_generate` tool lets the agent create audio files when a plugin
-registers music generation support.
+The `music_generate` tool lets the agent create music or audio through either:

-The bundled `comfy` plugin currently provides `music_generate` using a
-workflow-configured ComfyUI graph.
+- the shared music-generation capability with configured providers such as Google and MiniMax
+- plugin-provided tool surfaces such as a workflow-configured ComfyUI graph
+
+For shared provider-backed agent sessions, OpenClaw starts music generation as a
+background task, tracks it in the task ledger, then wakes the agent again when
+the track is ready so the agent can post the finished audio back into the
+original channel.
+
+<Note>
+The built-in shared tool only appears when at least one music-generation provider is available. If you don't see `music_generate` in your agent's tools, configure `agents.defaults.musicGenerationModel` or set up a provider API key.
+</Note>
+
+<Note>
+Plugin-provided `music_generate` implementations can expose different parameters or runtime behavior. The async task/status flow below applies to the built-in shared provider-backed path.
+</Note>

 ## Quick start

-1. Configure `models.providers.comfy.music` with a workflow JSON and prompt/output nodes.
+### Shared provider-backed generation
+
+1. Set an API key for at least one provider, for example `GEMINI_API_KEY` or
+   `MINIMAX_API_KEY`.
+2. Optionally set your preferred model:
+
+```json5
+{
+  agents: {
+    defaults: {
+      musicGenerationModel: {
+        primary: "google/lyria-3-clip-preview",
+      },
+    },
+  },
+}
+```
+
+3. Ask the agent: _"Generate an upbeat synthpop track about a night drive
+   through a neon city."_
+
+The agent calls `music_generate` automatically. No tool allow-listing needed.
+
+For direct synchronous contexts without a session-backed agent run, the built-in
+tool still falls back to inline generation and returns the final media path in
+the tool result.
+
+### Workflow-driven plugin generation
+
+The bundled `comfy` plugin can also provide `music_generate` using a
+workflow-configured ComfyUI graph.
+
+1. Configure `models.providers.comfy.music` with a workflow JSON and
+   prompt/output nodes.
 2. If you use Comfy Cloud, set `COMFY_API_KEY` or `COMFY_CLOUD_API_KEY`.
 3. Ask the agent for music or call the tool directly.

@@ -27,22 +72,102 @@ Example:
 /tool music_generate prompt="Warm ambient synth loop with soft tape texture"
 ```

-## Tool parameters
+## Shared bundled provider support

-| Parameter  | Type   | Description                                         |
-| ---------- | ------ | --------------------------------------------------- |
-| `prompt`   | string | Music or audio generation prompt                    |
-| `action`   | string | `"generate"` (default) or `"list"`                  |
-| `model`    | string | Provider/model override. Currently `comfy/workflow` |
-| `filename` | string | Output filename hint for the saved audio file       |
+| Provider | Default model          | Reference inputs | Supported controls                                        | API key                            |
+| -------- | ---------------------- | ---------------- | --------------------------------------------------------- | ---------------------------------- |
+| Google   | `lyria-3-clip-preview` | Up to 10 images  | `lyrics`, `instrumental`, `format`                        | `GEMINI_API_KEY`, `GOOGLE_API_KEY` |
+| MiniMax  | `music-2.5+`           | None             | `lyrics`, `instrumental`, `durationSeconds`, `format=mp3` | `MINIMAX_API_KEY`                  |

-## Current provider support
+## Plugin-provided support

 | Provider | Model      | Notes                           |
 | -------- | ---------- | ------------------------------- |
 | ComfyUI  | `workflow` | Workflow-defined music or audio |

-## Live test
+Use `action: "list"` to inspect available shared providers and models at
+runtime:
+
+```text
+/tool music_generate action=list
+```
+
+## Built-in tool parameters
+
+| Parameter         | Type     | Description                                                                                       |
+| ----------------- | -------- | ------------------------------------------------------------------------------------------------- |
+| `prompt`          | string   | Music generation prompt (required for `action: "generate"`)                                       |
+| `action`          | string   | `"generate"` (default), `"status"` for the current session task, or `"list"` to inspect providers |
+| `model`           | string   | Provider/model override, e.g. `google/lyria-3-pro-preview` or `comfy/workflow`                   |
+| `lyrics`          | string   | Optional lyrics when the provider supports explicit lyric input                                   |
+| `instrumental`    | boolean  | Request instrumental-only output when the provider supports it                                    |
+| `image`           | string   | Single reference image path or URL                                                                |
+| `images`          | string[] | Multiple reference images (up to 10)                                                              |
+| `durationSeconds` | number   | Target duration in seconds when the provider supports duration hints                              |
+| `format`          | string   | Output format hint (`mp3` or `wav`) when the provider supports it                                 |
+| `filename`        | string   | Output filename hint                                                                              |
+
+Not all providers or plugins support all parameters. The shared built-in tool
+validates provider capability limits before it submits the request.
+
+## Async behavior for the shared provider-backed path
+
+- Session-backed agent runs: `music_generate` creates a background task, returns a started/task response immediately, and posts the finished track later in a follow-up agent message.
+- Duplicate prevention: while that background task is still `queued` or `running`, later `music_generate` calls in the same session return task status instead of starting another generation.
+- Status lookup: use `action: "status"` to inspect the active session-backed music task without starting a new one.
+- Task tracking: use `openclaw tasks list` or `openclaw tasks show <taskId>` to inspect queued, running, and terminal status for the generation.
+- Completion wake: OpenClaw injects an internal completion event back into the same session so the model can write the user-facing follow-up itself.
+- Prompt hint: later user/manual turns in the same session get a small runtime hint when a music task is already in flight so the model does not blindly call `music_generate` again.
+- No-session fallback: direct/local contexts without a real agent session still run inline and return the final audio result in the same turn.
+
+## Configuration
+
+### Model selection
+
+```json5
+{
+  agents: {
+    defaults: {
+      musicGenerationModel: {
+        primary: "google/lyria-3-clip-preview",
+        fallbacks: ["minimax/music-2.5+"],
+      },
+    },
+  },
+}
+```
+
+### Provider selection order
+
+When generating music, OpenClaw tries providers in this order:
+
+1. `model` parameter from the tool call, if the agent specifies one
+2. `musicGenerationModel.primary` from config
+3. `musicGenerationModel.fallbacks` in order
+4. Auto-detection using auth-backed provider defaults only:
+   - current default provider first
+   - remaining registered music-generation providers in provider-id order
+
+If a provider fails, the next candidate is tried automatically. If all fail, the
+error includes details from each attempt.
+
+## Provider notes
+
+- Google uses Lyria 3 batch generation. The current bundled flow supports
+  prompt, optional lyrics text, and optional reference images.
+- MiniMax uses the batch `music_generation` endpoint. The current bundled flow
+  supports prompt, optional lyrics, instrumental mode, duration steering, and
+  mp3 output.
+- ComfyUI support is workflow-driven and depends on the configured graph plus
+  node mapping for prompt/output fields.
+
+## Live tests
+
+Opt-in live coverage for the shared bundled providers:
+
+```bash
+OPENCLAW_LIVE_TEST=1 pnpm test:live -- extensions/music-generation-providers.live.test.ts
+```

 Opt-in live coverage for the bundled ComfyUI music path:

@@ -50,10 +175,15 @@ Opt-in live coverage for the bundled ComfyUI music path:
 OPENCLAW_LIVE_TEST=1 COMFY_LIVE_TEST=1 pnpm test:live -- extensions/comfy/comfy.live.test.ts
 ```

-The live file also covers comfy image and video workflows when those sections
-are configured.
+The Comfy live file also covers comfy image and video workflows when those
+sections are configured.

 ## Related

+- [Background Tasks](/automation/tasks) - task tracking for detached `music_generate` runs
+- [Configuration Reference](/gateway/configuration-reference#agent-defaults) - `musicGenerationModel` config
 - [ComfyUI](/providers/comfy)
+- [Google (Gemini)](/providers/google)
+- [MiniMax](/providers/minimax)
+- [Models](/concepts/models) - model configuration and failover
 - [Tools Overview](/tools)
--- a/docs/tools/plugin.md
+++ b/docs/tools/plugin.md
@@ -319,6 +319,7 @@ Common registration methods:
 | `registerRealtimeVoiceProvider`         | Duplex realtime voice       |
 | `registerMediaUnderstandingProvider`    | Image/audio analysis        |
 | `registerImageGenerationProvider`       | Image generation            |
+| `registerMusicGenerationProvider`       | Music generation            |
 | `registerVideoGenerationProvider`       | Video generation            |
 | `registerWebFetchProvider`              | Web fetch / scrape provider |
 | `registerWebSearchProvider`             | Web search                  |