feat(tts): add BytePlus Seed Speech provider

Add Volcengine/BytePlus Seed Speech as a bundled TTS provider with current API-key auth, legacy AppID/token fallback, native Ogg/Opus voice-note output, and MP3 audio-file output.

Co-authored-by: Peter Steinberger <steipete@gmail.com>
This commit is contained in:
Rui Xu
2026-04-25 23:34:57 +01:00
committed by Peter Steinberger
parent b1b29a8fc2
commit 1531123d35
10 changed files with 937 additions and 16 deletions

View File

@@ -55,6 +55,7 @@ Docs: https://docs.openclaw.ai
- Providers/ElevenLabs: include `eleven_v3` in the bundled TTS model catalog so model selection surfaces can offer ElevenLabs v3. (#68321) Thanks @itsuzef.
- Providers/Local CLI TTS: add a bundled local command speech provider with file/stdout input, voice-note Opus conversion, and telephony PCM output. (#56239) Thanks @solar2ain.
- Providers/Inworld: add Inworld as a bundled speech provider with streaming TTS synthesis, voice listing, voice-note output, and PCM telephony output. (#55972) Thanks @cshape.
- Providers/Volcengine: add Volcengine/BytePlus Seed Speech as a bundled TTS provider with API-key auth, native Ogg/Opus voice-note output, and MP3 audio-file output. (#55641) Thanks @xuruiray.
- Android/Talk Mode: expose Talk Mode in the Voice tab with runtime-owned voice capture modes and microphone foreground-service escalation. Thanks @alex-latitude.
- Providers/LiteLLM: register `litellm` as an image-generation provider so `image_generate model=litellm/...` calls and `agents.defaults.imageGenerationModel.fallbacks` entries resolve through the LiteLLM proxy. Thanks @zqchris.
- Codex harness: require Codex app-server `0.125.0` or newer and cover native MCP `PreToolUse`, `PostToolUse`, and `PermissionRequest` payloads through the OpenClaw hook relay.

View File

@@ -111,6 +111,10 @@
"source": "BytePlus (International)",
"target": "BytePlus国际版"
},
{
"source": "Volcengine TTS HTTP API",
"target": "Volcengine TTS HTTP API"
},
{
"source": "Amazon Bedrock Mantle",
"target": "Amazon Bedrock Mantle"

View File

@@ -1,20 +1,23 @@
---
summary: "Volcano Engine setup (Doubao models, general + coding endpoints)"
summary: "Volcano Engine setup (Doubao models, coding endpoints, and Seed Speech TTS)"
title: "Volcengine (Doubao)"
read_when:
- You want to use Volcano Engine or Doubao models with OpenClaw
- You need the Volcengine API key setup
- You want to use Volcengine Speech text-to-speech
---
The Volcengine provider gives access to Doubao models and third-party models
hosted on Volcano Engine, with separate endpoints for general and coding
workloads.
workloads. The same bundled plugin can also register Volcengine Speech as a TTS
provider.
| Detail | Value |
| --------- | --------------------------------------------------- |
| Providers | `volcengine` (general) + `volcengine-plan` (coding) |
| Auth | `VOLCANO_ENGINE_API_KEY` |
| API | OpenAI-compatible |
| Detail | Value |
| ---------- | ---------------------------------------------------------- |
| Providers | `volcengine` (general + TTS) + `volcengine-plan` (coding) |
| Model auth | `VOLCANO_ENGINE_API_KEY` |
| TTS auth | `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY` |
| API | OpenAI-compatible models, BytePlus Seed Speech TTS |
## Getting started
@@ -95,6 +98,59 @@ Both providers are configured from a single API key. Setup registers both automa
</Tab>
</Tabs>
## Text-to-speech
Volcengine TTS uses the BytePlus Seed Speech HTTP API and is configured
separately from the OpenAI-compatible Doubao model API key. In the BytePlus
console, open Seed Speech > Settings > API Keys and copy the API key, then set:
```bash
export VOLCENGINE_TTS_API_KEY="byteplus_seed_speech_api_key"
export VOLCENGINE_TTS_RESOURCE_ID="seed-tts-1.0"
```
Then enable it in `openclaw.json`:
```json5
{
messages: {
tts: {
auto: "always",
provider: "volcengine",
providers: {
volcengine: {
apiKey: "byteplus_seed_speech_api_key",
voice: "en_female_anna_mars_bigtts",
speedRatio: 1.0,
},
},
},
},
}
```
For voice-note targets, OpenClaw asks Volcengine for provider-native
`ogg_opus`. For normal audio attachments, it asks for `mp3`. Provider aliases
`bytedance` and `doubao` also resolve to the same speech provider.
The default resource id is `seed-tts-1.0` because that is what BytePlus grants
to newly created Seed Speech API keys in the default project. If your project
has TTS 2.0 entitlement, set `VOLCENGINE_TTS_RESOURCE_ID=seed-tts-2.0`.
<Warning>
`VOLCANO_ENGINE_API_KEY` is for the ModelArk/Doubao model endpoints and is not a
Seed Speech API key. TTS needs a Seed Speech API key from the BytePlus Speech
Console, or a legacy Speech Console AppID/token pair.
</Warning>
Legacy AppID/token auth remains supported for older Speech Console applications:
```bash
export VOLCENGINE_TTS_APPID="speech_app_id"
export VOLCENGINE_TTS_TOKEN="speech_access_token"
export VOLCENGINE_TTS_CLUSTER="volcano_tts"
```
## Advanced configuration
<AccordionGroup>
@@ -112,8 +168,10 @@ Both providers are configured from a single API key. Setup registers both automa
</Accordion>
<Accordion title="Environment variables for daemon processes">
If the Gateway runs as a daemon (launchd/systemd), make sure
`VOLCANO_ENGINE_API_KEY` is available to that process (for example, in
If the Gateway runs as a daemon (launchd/systemd), make sure model and TTS
env vars such as `VOLCANO_ENGINE_API_KEY`, `VOLCENGINE_TTS_API_KEY`,
`BYTEPLUS_SEED_SPEECH_API_KEY`, `VOLCENGINE_TTS_APPID`, and
`VOLCENGINE_TTS_TOKEN` are available to that process (for example, in
`~/.openclaw/.env` or via `env.shellEnv`).
</Accordion>
</AccordionGroup>

View File

@@ -7,7 +7,7 @@ read_when:
title: "Text-to-speech"
---
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Inworld, Local CLI, Microsoft, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo.
It works anywhere OpenClaw can send audio.
## Supported services
@@ -20,6 +20,7 @@ It works anywhere OpenClaw can send audio.
- **Microsoft** (primary or fallback provider; current bundled implementation uses `node-edge-tts`)
- **MiniMax** (primary or fallback provider; uses the T2A v2 API)
- **OpenAI** (primary or fallback provider; also used for summaries)
- **Volcengine** (primary or fallback provider; uses the BytePlus Seed Speech HTTP API)
- **Vydra** (primary or fallback provider; shared image, video, and speech provider)
- **xAI** (primary or fallback provider; uses the xAI TTS API)
- **Xiaomi MiMo** (primary or fallback provider; uses MiMo TTS through Xiaomi chat completions)
@@ -39,7 +40,7 @@ or ElevenLabs.
## Optional keys
If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo:
If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Volcengine, Vydra, xAI, or Xiaomi MiMo:
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
@@ -49,6 +50,9 @@ If you want ElevenLabs, Google Gemini, Gradium, Inworld, MiniMax, OpenAI, Vydra,
`MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, or
`MINIMAX_CODING_API_KEY`
- `OPENAI_API_KEY`
- `VOLCENGINE_TTS_API_KEY` (or `BYTEPLUS_SEED_SPEECH_API_KEY`);
legacy AppID/token auth also accepts `VOLCENGINE_TTS_APPID` and
`VOLCENGINE_TTS_TOKEN`
- `VYDRA_API_KEY`
- `XAI_API_KEY`
- `XIAOMI_API_KEY`
@@ -68,6 +72,7 @@ so that provider must also be authenticated if you enable summaries.
- [Gradium](/providers/gradium)
- [Inworld TTS API](https://docs.inworld.ai/tts/tts)
- [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2)
- [Volcengine TTS HTTP API](/providers/volcengine#text-to-speech)
- [Xiaomi MiMo speech synthesis](/providers/xiaomi#text-to-speech)
- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
- [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs)
@@ -249,6 +254,35 @@ encoding, so do not pass a raw bearer token and do not Base64-encode it
yourself. The key falls back to the `INWORLD_API_KEY` env var. See
[Inworld provider](/providers/inworld) for full setup.
### Volcengine primary
```json5
{
messages: {
tts: {
auto: "always",
provider: "volcengine",
providers: {
volcengine: {
apiKey: "byteplus_seed_speech_api_key",
resourceId: "seed-tts-1.0",
voice: "en_female_anna_mars_bigtts",
speedRatio: 1.0,
},
},
},
},
}
```
Volcengine TTS uses the BytePlus Seed Speech API key from the Speech Console,
not the OpenAI-compatible `VOLCANO_ENGINE_API_KEY` used for Doubao model
providers. Resolution order is `messages.tts.providers.volcengine.apiKey` ->
`VOLCENGINE_TTS_API_KEY` -> `BYTEPLUS_SEED_SPEECH_API_KEY`. Legacy AppID/token
auth still works through `messages.tts.providers.volcengine.appId` / `token` or
`VOLCENGINE_TTS_APPID` / `VOLCENGINE_TTS_TOKEN`. Voice-note targets request
provider-native `ogg_opus`; normal audio-file targets request `mp3`.
### xAI primary
```json5
@@ -447,7 +481,7 @@ Then run:
- `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
- `enabled`: legacy toggle (doctor migrates this to `auto`).
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"inworld"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"inworld"`, `"microsoft"`, `"minimax"`, `"openai"`, `"volcengine"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
- Legacy `provider: "edge"` config is repaired by `openclaw doctor --fix` and
rewritten to `provider: "microsoft"`.
@@ -461,7 +495,7 @@ Then run:
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
- `timeoutMs`: request timeout (ms).
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`).
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `INWORLD_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`). Volcengine uses `appId`/`token` instead.
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
@@ -497,6 +531,21 @@ Then run:
- If `messages.tts.providers.google.apiKey` is omitted, TTS can reuse `models.providers.google.apiKey` before env fallback.
- `providers.gradium.baseUrl`: override Gradium API base URL (default `https://api.gradium.ai`).
- `providers.gradium.voiceId`: Gradium voice identifier (default Emma, `YTpq7expH9539ERJ`).
- `providers.volcengine.apiKey`: BytePlus Seed Speech API key (env:
`VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY`).
- `providers.volcengine.resourceId`: BytePlus Seed Speech resource id (default
`seed-tts-1.0`, env: `VOLCENGINE_TTS_RESOURCE_ID`; use `seed-tts-2.0` when
your BytePlus project has TTS 2.0 entitlement).
- `providers.volcengine.appKey`: BytePlus Seed Speech app key header (default
`aGjiRDfUWi`, env: `VOLCENGINE_TTS_APP_KEY`).
- `providers.volcengine.baseUrl`: override the Seed Speech TTS HTTP endpoint
(env: `VOLCENGINE_TTS_BASE_URL`).
- `providers.volcengine.appId`: legacy Volcengine Speech Console application id (env: `VOLCENGINE_TTS_APPID`).
- `providers.volcengine.token`: legacy Volcengine Speech Console access token (env: `VOLCENGINE_TTS_TOKEN`).
- `providers.volcengine.cluster`: legacy Volcengine TTS cluster (default `volcano_tts`, env: `VOLCENGINE_TTS_CLUSTER`).
- `providers.volcengine.voice`: voice type (default `en_female_anna_mars_bigtts`, env: `VOLCENGINE_TTS_VOICE`).
- `providers.volcengine.speedRatio`: provider-native speed ratio.
- `providers.volcengine.emotion`: provider-native emotion tag.
- `providers.xai.apiKey`: xAI TTS API key (env: `XAI_API_KEY`).
- `providers.xai.baseUrl`: override the xAI TTS base URL (default `https://api.x.ai/v1`, env: `XAI_BASE_URL`).
- `providers.xai.voiceId`: xAI voice id (default `eve`; current live voices: `ara`, `eve`, `leo`, `rex`, `sal`, `una`).
@@ -550,12 +599,13 @@ Here you go.
Available directive keys (when enabled):
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `gradium`, `minimax`, `microsoft`, `vydra`, `xai`, or `xiaomi`; requires `allowProvider: true`)
- `voice` (OpenAI, Gradium, or Xiaomi voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / Gradium / MiniMax / xAI)
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `gradium`, `minimax`, `microsoft`, `volcengine`, `vydra`, `xai`, or `xiaomi`; requires `allowProvider: true`)
- `voice` (OpenAI, Gradium, Volcengine, or Xiaomi voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / Gradium / MiniMax / xAI)
- `model` (OpenAI TTS model, ElevenLabs model id, MiniMax model, or Xiaomi MiMo TTS model) or `google_model` (Google TTS model)
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
- `vol` / `volume` (MiniMax volume, 0-10)
- `pitch` (MiniMax integer pitch, -12 to 12; fractional values are truncated before the MiniMax request)
- `emotion` (Volcengine emotion tag)
- `applyTextNormalization` (`auto|on|off`)
- `languageCode` (ISO 639-1)
- `seed`

View File

@@ -3,6 +3,7 @@ import { createProviderApiKeyAuthMethod } from "openclaw/plugin-sdk/provider-aut
import { ensureModelAllowlistEntry } from "openclaw/plugin-sdk/provider-onboard";
import { DOUBAO_CODING_MODEL_CATALOG, DOUBAO_MODEL_CATALOG } from "./models.js";
import { buildDoubaoCodingProvider, buildDoubaoProvider } from "./provider-catalog.js";
import { buildVolcengineSpeechProvider } from "./speech-provider.js";
const PROVIDER_ID = "volcengine";
const VOLCENGINE_DEFAULT_MODEL_REF = "volcengine-plan/ark-code-latest";
@@ -78,5 +79,6 @@ export default definePluginEntry({
return [...volcengineModels, ...volcenginePlanModels];
},
});
api.registerSpeechProvider(buildVolcengineSpeechProvider());
},
});

View File

@@ -4,7 +4,13 @@
"providerDiscoveryEntry": "./provider-discovery.ts",
"providers": ["volcengine", "volcengine-plan"],
"providerAuthEnvVars": {
"volcengine": ["VOLCANO_ENGINE_API_KEY"]
"volcengine": ["VOLCANO_ENGINE_API_KEY"],
"volcengine-tts": [
"VOLCENGINE_TTS_API_KEY",
"BYTEPLUS_SEED_SPEECH_API_KEY",
"VOLCENGINE_TTS_APPID",
"VOLCENGINE_TTS_TOKEN"
]
},
"providerAuthAliases": {
"volcengine-plan": "volcengine"
@@ -28,5 +34,8 @@
"type": "object",
"additionalProperties": false,
"properties": {}
},
"contracts": {
"speechProviders": ["volcengine"]
}
}

View File

@@ -0,0 +1,229 @@
import { normalizeResolvedSecretInputString } from "openclaw/plugin-sdk/secret-input";
import type {
SpeechDirectiveTokenParseContext,
SpeechProviderConfig,
SpeechProviderOverrides,
SpeechProviderPlugin,
} from "openclaw/plugin-sdk/speech-core";
import { asFiniteNumber, asObject, trimToUndefined } from "openclaw/plugin-sdk/speech-core";
import { volcengineTTS, type VolcengineTtsEncoding } from "./tts.js";
const DEFAULT_VOICE = "en_female_anna_mars_bigtts";
const DEFAULT_CLUSTER = "volcano_tts";
const DEFAULT_RESOURCE_ID = "seed-tts-1.0";
const DEFAULT_APP_KEY = "aGjiRDfUWi";
export const VOLCENGINE_VOICES: readonly string[] = [
"en_female_anna_mars_bigtts",
"en_male_adam_mars_bigtts",
"en_female_sarah_mars_bigtts",
"en_male_smith_mars_bigtts",
"zh_female_cancan_mars_bigtts",
"zh_female_qingxinnvsheng_mars_bigtts",
"zh_female_linjia_mars_bigtts",
"zh_male_wennuanahu_moon_bigtts",
"zh_male_shaonianzixin_moon_bigtts",
"zh_female_shuangkuaisisi_moon_bigtts",
];
type VolcengineTtsProviderConfig = {
apiKey?: string;
appId?: string;
token?: string;
voice: string;
cluster: string;
resourceId: string;
appKey: string;
baseUrl?: string;
speedRatio?: number;
emotion?: string;
};
type VolcengineTtsProviderOverrides = {
voice?: string;
speedRatio?: number;
emotion?: string;
};
function normalizeVolcengineProviderConfig(
rawConfig: Record<string, unknown>,
): VolcengineTtsProviderConfig {
const providers = asObject(rawConfig.providers);
const raw = asObject(providers?.volcengine) ?? asObject(rawConfig.volcengine);
return {
apiKey: normalizeResolvedSecretInputString({
value: raw?.apiKey,
path: "messages.tts.providers.volcengine.apiKey",
}),
appId: trimToUndefined(raw?.appId),
token: normalizeResolvedSecretInputString({
value: raw?.token,
path: "messages.tts.providers.volcengine.token",
}),
voice:
trimToUndefined(raw?.voice) ??
trimToUndefined(process.env.VOLCENGINE_TTS_VOICE) ??
DEFAULT_VOICE,
cluster:
trimToUndefined(raw?.cluster) ??
trimToUndefined(process.env.VOLCENGINE_TTS_CLUSTER) ??
DEFAULT_CLUSTER,
resourceId:
trimToUndefined(raw?.resourceId) ??
trimToUndefined(process.env.VOLCENGINE_TTS_RESOURCE_ID) ??
DEFAULT_RESOURCE_ID,
appKey:
trimToUndefined(raw?.appKey) ??
trimToUndefined(process.env.VOLCENGINE_TTS_APP_KEY) ??
DEFAULT_APP_KEY,
baseUrl: trimToUndefined(raw?.baseUrl) ?? trimToUndefined(process.env.VOLCENGINE_TTS_BASE_URL),
speedRatio: asFiniteNumber(raw?.speedRatio),
emotion: trimToUndefined(raw?.emotion),
};
}
function resolveSeedSpeechApiKey(configApiKey?: string): string | undefined {
return (
configApiKey ??
trimToUndefined(process.env.VOLCENGINE_TTS_API_KEY) ??
trimToUndefined(process.env.BYTEPLUS_SEED_SPEECH_API_KEY)
);
}
function readProviderConfig(config: SpeechProviderConfig): VolcengineTtsProviderConfig {
const normalized = normalizeVolcengineProviderConfig({});
return {
apiKey:
normalizeResolvedSecretInputString({
value: config.apiKey,
path: "messages.tts.providers.volcengine.apiKey",
}) ?? normalized.apiKey,
appId: trimToUndefined(config.appId) ?? normalized.appId,
token: trimToUndefined(config.token) ?? normalized.token,
voice: trimToUndefined(config.voice) ?? normalized.voice,
cluster: trimToUndefined(config.cluster) ?? normalized.cluster,
resourceId: trimToUndefined(config.resourceId) ?? normalized.resourceId,
appKey: trimToUndefined(config.appKey) ?? normalized.appKey,
baseUrl: trimToUndefined(config.baseUrl) ?? normalized.baseUrl,
speedRatio: asFiniteNumber(config.speedRatio) ?? normalized.speedRatio,
emotion: trimToUndefined(config.emotion) ?? normalized.emotion,
};
}
function readVolcengineOverrides(
overrides: SpeechProviderOverrides | undefined,
): VolcengineTtsProviderOverrides {
if (!overrides) {
return {};
}
return {
voice: trimToUndefined(overrides.voice),
speedRatio: asFiniteNumber(overrides.speedRatio),
emotion: trimToUndefined(overrides.emotion),
};
}
function parseDirectiveToken(ctx: SpeechDirectiveTokenParseContext): {
handled: boolean;
overrides?: SpeechProviderOverrides;
warnings?: string[];
} {
switch (ctx.key) {
case "voice":
case "volcengine_voice":
case "volcenginevoice":
if (!ctx.policy.allowVoice) {
return { handled: true };
}
return { handled: true, overrides: { ...ctx.currentOverrides, voice: ctx.value } };
case "speed":
case "speedratio":
case "speed_ratio": {
if (!ctx.policy.allowVoiceSettings) {
return { handled: true };
}
const speedRatio = Number(ctx.value);
if (!Number.isFinite(speedRatio) || speedRatio < 0.2 || speedRatio > 3.0) {
return { handled: true, warnings: [`invalid Volcengine speedRatio "${ctx.value}"`] };
}
return { handled: true, overrides: { ...ctx.currentOverrides, speedRatio } };
}
case "emotion":
if (!ctx.policy.allowVoiceSettings) {
return { handled: true };
}
return { handled: true, overrides: { ...ctx.currentOverrides, emotion: ctx.value } };
default:
return { handled: false };
}
}
export function buildVolcengineSpeechProvider(): SpeechProviderPlugin {
return {
id: "volcengine",
label: "Volcengine",
autoSelectOrder: 90,
aliases: ["bytedance", "doubao"],
voices: VOLCENGINE_VOICES,
resolveConfig: ({ rawConfig }) => normalizeVolcengineProviderConfig(rawConfig),
parseDirectiveToken,
listVoices: async () =>
VOLCENGINE_VOICES.map((v) => ({
id: v,
name: v.replace(/^(?:en|zh)_(female|male)_/, "").replace(/_.*$/, ""),
locale: v.startsWith("en_") ? "en-US" : "zh-CN",
gender: v.includes("_female_") ? "female" : "male",
})),
isConfigured: ({ providerConfig }) => {
const cfg = readProviderConfig(providerConfig);
return Boolean(
resolveSeedSpeechApiKey(cfg.apiKey) ||
((cfg.appId || process.env.VOLCENGINE_TTS_APPID) &&
(cfg.token || process.env.VOLCENGINE_TTS_TOKEN)),
);
},
synthesize: async (req) => {
const cfg = readProviderConfig(req.providerConfig);
const overrides = readVolcengineOverrides(req.providerOverrides);
const apiKey = resolveSeedSpeechApiKey(cfg.apiKey);
const appId = cfg.appId || process.env.VOLCENGINE_TTS_APPID;
const token = cfg.token || process.env.VOLCENGINE_TTS_TOKEN;
if (!apiKey && (!appId || !token)) {
throw new Error(
"Volcengine TTS credentials missing. Set VOLCENGINE_TTS_API_KEY, " +
"BYTEPLUS_SEED_SPEECH_API_KEY, or legacy VOLCENGINE_TTS_APPID and VOLCENGINE_TTS_TOKEN.",
);
}
const isVoiceNote = req.target === "voice-note";
const encoding: VolcengineTtsEncoding = isVoiceNote ? "ogg_opus" : "mp3";
const audioBuffer = await volcengineTTS({
text: req.text,
apiKey,
appId,
token,
voice: overrides.voice ?? cfg.voice,
cluster: cfg.cluster,
resourceId: cfg.resourceId,
appKey: cfg.appKey,
baseUrl: cfg.baseUrl,
speedRatio: overrides.speedRatio ?? cfg.speedRatio,
emotion: overrides.emotion ?? cfg.emotion,
encoding,
timeoutMs: req.timeoutMs,
});
return {
audioBuffer,
outputFormat: encoding === "ogg_opus" ? "opus" : "mp3",
fileExtension: encoding === "ogg_opus" ? ".opus" : ".mp3",
voiceCompatible: isVoiceNote,
};
},
};
}

View File

@@ -0,0 +1,30 @@
import { describe, expect, it } from "vitest";
import { volcengineTTS } from "./tts.js";
const seedSpeechApiKey =
process.env.VOLCENGINE_TTS_API_KEY ?? process.env.BYTEPLUS_SEED_SPEECH_API_KEY;
const hasVolcengineTtsCredentials = Boolean(
seedSpeechApiKey || (process.env.VOLCENGINE_TTS_APPID && process.env.VOLCENGINE_TTS_TOKEN),
);
const describeLive =
process.env.OPENCLAW_LIVE_TEST === "1" && hasVolcengineTtsCredentials ? describe : describe.skip;
describeLive("Volcengine TTS live", () => {
it("synthesizes mp3 audio with .profile credentials", async () => {
const audio = await volcengineTTS({
text: "OpenClaw live test.",
apiKey: seedSpeechApiKey,
appId: process.env.VOLCENGINE_TTS_APPID,
token: process.env.VOLCENGINE_TTS_TOKEN,
voice: process.env.VOLCENGINE_TTS_VOICE,
cluster: process.env.VOLCENGINE_TTS_CLUSTER,
resourceId: process.env.VOLCENGINE_TTS_RESOURCE_ID,
appKey: process.env.VOLCENGINE_TTS_APP_KEY,
baseUrl: process.env.VOLCENGINE_TTS_BASE_URL,
encoding: "mp3",
timeoutMs: 30_000,
});
expect(audio.length).toBeGreaterThan(128);
});
});

View File

@@ -0,0 +1,272 @@
import { beforeEach, describe, expect, it, vi } from "vitest";
import { buildVolcengineSpeechProvider } from "./speech-provider.js";
import { volcengineTTS } from "./tts.js";
const { fetchWithSsrFGuardMock } = vi.hoisted(() => ({
fetchWithSsrFGuardMock: vi.fn(),
}));
vi.mock("openclaw/plugin-sdk/ssrf-runtime", () => ({
fetchWithSsrFGuard: fetchWithSsrFGuardMock,
}));
function makeProviderConfig(overrides?: Record<string, unknown>) {
return {
apiKey: "test-api-key",
voice: "en_female_anna_mars_bigtts",
...overrides,
};
}
function makeLegacyProviderConfig(overrides?: Record<string, unknown>) {
return {
appId: "test-app-id",
token: "test-token",
voice: "zh_female_xiaohe_uranus_bigtts",
cluster: "volcano_tts",
...overrides,
};
}
function clearTtsEnv() {
delete process.env.BYTEPLUS_API_KEY;
delete process.env.BYTEPLUS_SEED_SPEECH_API_KEY;
delete process.env.VOLCENGINE_TTS_API_KEY;
delete process.env.VOLCENGINE_TTS_APPID;
delete process.env.VOLCENGINE_TTS_TOKEN;
}
describe("Volcengine speech provider", () => {
const provider = buildVolcengineSpeechProvider();
beforeEach(() => {
fetchWithSsrFGuardMock.mockReset();
});
it("has correct id, label, and aliases", () => {
expect(provider.id).toBe("volcengine");
expect(provider.label).toBe("Volcengine");
expect(provider.aliases).toContain("bytedance");
expect(provider.aliases).toContain("doubao");
});
it("reports configured when an API key is present in providerConfig", () => {
expect(provider.isConfigured({ providerConfig: makeProviderConfig(), timeoutMs: 30000 })).toBe(
true,
);
});
it("reports configured for legacy appId and token in providerConfig", () => {
expect(
provider.isConfigured({ providerConfig: makeLegacyProviderConfig(), timeoutMs: 30000 }),
).toBe(true);
});
it("reports not configured when credentials are missing", () => {
const oldBytePlusKey = process.env.BYTEPLUS_API_KEY;
const oldSeedKey = process.env.BYTEPLUS_SEED_SPEECH_API_KEY;
const oldApiKey = process.env.VOLCENGINE_TTS_API_KEY;
const oldAppId = process.env.VOLCENGINE_TTS_APPID;
const oldToken = process.env.VOLCENGINE_TTS_TOKEN;
clearTtsEnv();
try {
expect(provider.isConfigured({ providerConfig: {}, timeoutMs: 30000 })).toBe(false);
} finally {
if (oldBytePlusKey) {
process.env.BYTEPLUS_API_KEY = oldBytePlusKey;
}
if (oldSeedKey) {
process.env.BYTEPLUS_SEED_SPEECH_API_KEY = oldSeedKey;
}
if (oldApiKey) {
process.env.VOLCENGINE_TTS_API_KEY = oldApiKey;
}
if (oldAppId) {
process.env.VOLCENGINE_TTS_APPID = oldAppId;
}
if (oldToken) {
process.env.VOLCENGINE_TTS_TOKEN = oldToken;
}
}
});
it("falls back to env vars for credentials", () => {
const oldBytePlusKey = process.env.BYTEPLUS_API_KEY;
const oldSeedKey = process.env.BYTEPLUS_SEED_SPEECH_API_KEY;
const oldApiKey = process.env.VOLCENGINE_TTS_API_KEY;
const oldAppId = process.env.VOLCENGINE_TTS_APPID;
const oldToken = process.env.VOLCENGINE_TTS_TOKEN;
clearTtsEnv();
process.env.BYTEPLUS_SEED_SPEECH_API_KEY = "env-api-key";
try {
expect(provider.isConfigured({ providerConfig: {}, timeoutMs: 30000 })).toBe(true);
} finally {
if (oldBytePlusKey) {
process.env.BYTEPLUS_API_KEY = oldBytePlusKey;
}
if (oldSeedKey) {
process.env.BYTEPLUS_SEED_SPEECH_API_KEY = oldSeedKey;
} else {
delete process.env.BYTEPLUS_SEED_SPEECH_API_KEY;
}
if (oldApiKey) {
process.env.VOLCENGINE_TTS_API_KEY = oldApiKey;
}
if (oldAppId) {
process.env.VOLCENGINE_TTS_APPID = oldAppId;
} else {
delete process.env.VOLCENGINE_TTS_APPID;
}
if (oldToken) {
process.env.VOLCENGINE_TTS_TOKEN = oldToken;
} else {
delete process.env.VOLCENGINE_TTS_TOKEN;
}
}
});
it("lists voices with locale and gender", async () => {
const voices = await provider.listVoices!({});
expect(voices.length).toBeGreaterThan(0);
expect(voices[0]).toMatchObject({ locale: "en-US" });
expect(voices[0].gender).toBeDefined();
});
it("sends the documented Seed Speech API key payload and returns voice-note Opus metadata", async () => {
const release = vi.fn();
fetchWithSsrFGuardMock.mockResolvedValue({
response: new Response(
JSON.stringify({
code: 0,
data: Buffer.from("voice-audio").toString("base64"),
}),
),
release,
});
const result = await provider.synthesize({
text: "hello",
cfg: {},
providerConfig: makeProviderConfig({ emotion: "happy", speedRatio: 1.2 }),
target: "voice-note",
providerOverrides: { voice: "zh_male_aojiao_mars_bigtts", speedRatio: 0.9 },
timeoutMs: 1234,
});
expect(result.audioBuffer.toString()).toBe("voice-audio");
expect(result.outputFormat).toBe("opus");
expect(result.fileExtension).toBe(".opus");
expect(result.voiceCompatible).toBe(true);
const call = fetchWithSsrFGuardMock.mock.calls[0]?.[0];
expect(call).toMatchObject({
url: "https://voice.ap-southeast-1.bytepluses.com/api/v3/tts/unidirectional",
timeoutMs: 1234,
policy: { hostnameAllowlist: ["voice.ap-southeast-1.bytepluses.com"] },
auditContext: "volcengine.tts",
});
expect(call.init.headers["X-Api-Key"]).toBe("test-api-key");
expect(call.init.headers["X-Api-Resource-Id"]).toBe("seed-tts-1.0");
expect(call.init.headers["X-Api-App-Key"]).toBe("aGjiRDfUWi");
const body = JSON.parse(call.init.body);
expect(body.req_params).toMatchObject({
text: "hello",
speaker: "zh_male_aojiao_mars_bigtts",
speed_ratio: 0.9,
emotion: "happy",
audio_params: {
format: "ogg_opus",
sample_rate: 24000,
},
});
expect(release).toHaveBeenCalledTimes(1);
});
});
describe("volcengineTTS", () => {
beforeEach(() => {
fetchWithSsrFGuardMock.mockReset();
});
it("joins streamed Seed Speech audio frames", async () => {
const release = vi.fn();
fetchWithSsrFGuardMock.mockResolvedValue({
response: new Response(
[
JSON.stringify({ code: 0, message: "" }),
JSON.stringify({ code: 0, data: Buffer.from("audio-1").toString("base64") }),
JSON.stringify({ code: 0, data: Buffer.from("audio-2").toString("base64") }),
JSON.stringify({ code: 20000000, message: "ok", data: null }),
].join("\n"),
),
release,
});
const audio = await volcengineTTS({
text: "hello",
apiKey: "secret-api-key",
voice: "zh_female_xiaohe_uranus_bigtts",
encoding: "mp3",
timeoutMs: 1000,
});
expect(audio.toString()).toBe("audio-1audio-2");
expect(release).toHaveBeenCalledTimes(1);
});
it("reports Seed Speech provider errors without exposing credentials", async () => {
const release = vi.fn();
fetchWithSsrFGuardMock.mockResolvedValue({
response: new Response(
JSON.stringify({ header: { code: 45000000, message: "speaker permission denied" } }),
{ status: 403 },
),
release,
});
let error: unknown;
try {
await volcengineTTS({
text: "hello",
apiKey: "secret-api-key",
timeoutMs: 1000,
});
} catch (err) {
error = err;
}
expect(error).toBeInstanceOf(Error);
expect((error as Error).message).toBe(
"BytePlus Seed Speech TTS error 45000000: speaker permission denied",
);
expect((error as Error).message).not.toContain("secret-api-key");
expect(release).toHaveBeenCalledTimes(1);
});
it("reports provider errors without exposing credentials", async () => {
const release = vi.fn();
fetchWithSsrFGuardMock.mockResolvedValue({
response: new Response(JSON.stringify({ code: 3001, message: "load grant failed" }), {
status: 401,
}),
release,
});
let error: unknown;
try {
await volcengineTTS({
text: "hello",
appId: "app-id",
token: "secret-token",
timeoutMs: 1000,
});
} catch (err) {
error = err;
}
expect(error).toBeInstanceOf(Error);
expect((error as Error).message).toBe("Volcengine TTS error 3001: load grant failed");
expect((error as Error).message).not.toContain("secret-token");
expect(release).toHaveBeenCalledTimes(1);
});
});

View File

@@ -0,0 +1,266 @@
import * as crypto from "node:crypto";
import { fetchWithSsrFGuard } from "openclaw/plugin-sdk/ssrf-runtime";
export type VolcengineTtsEncoding = "ogg_opus" | "mp3" | "pcm" | "wav";
export type VolcengineTTSParams = {
text: string;
apiKey?: string;
appId?: string;
token?: string;
voice?: string;
cluster?: string;
resourceId?: string;
appKey?: string;
baseUrl?: string;
speedRatio?: number;
volumeRatio?: number;
pitchRatio?: number;
emotion?: string;
encoding?: VolcengineTtsEncoding;
timeoutMs?: number;
};
const DEFAULT_SEED_VOICE = "en_female_anna_mars_bigtts";
const DEFAULT_LEGACY_VOICE = "zh_female_xiaohe_uranus_bigtts";
const DEFAULT_CLUSTER = "volcano_tts";
const DEFAULT_SEED_TTS_RESOURCE_ID = "seed-tts-1.0";
const DEFAULT_SEED_TTS_APP_KEY = "aGjiRDfUWi";
const BYTEPLUS_SEED_TTS_URL =
"https://voice.ap-southeast-1.bytepluses.com/api/v3/tts/unidirectional";
const VOLCENGINE_LEGACY_TTS_URL = "https://openspeech.bytedance.com/api/v1/tts";
type VolcengineTtsResponse = {
code?: number;
message?: string;
data?: string;
};
function parseJsonObject(text: string, providerName: string): Record<string, unknown> {
try {
const parsed = JSON.parse(text) as unknown;
if (!parsed || typeof parsed !== "object" || Array.isArray(parsed)) {
throw new Error("expected JSON object");
}
return parsed as Record<string, unknown>;
} catch (err) {
const detail = err instanceof Error ? err.message : String(err);
throw new Error(`${providerName} TTS: failed to parse response JSON: ${detail}`, {
cause: err,
});
}
}
function toTtsResponse(parsed: Record<string, unknown>): VolcengineTtsResponse {
const header =
parsed.header && typeof parsed.header === "object" && !Array.isArray(parsed.header)
? (parsed.header as Record<string, unknown>)
: undefined;
return {
code:
typeof parsed.code === "number"
? parsed.code
: typeof header?.code === "number"
? header.code
: undefined,
message:
typeof parsed.message === "string"
? parsed.message
: typeof header?.message === "string"
? header.message
: undefined,
data: typeof parsed.data === "string" ? parsed.data : undefined,
};
}
function parseLegacyTtsResponse(text: string): VolcengineTtsResponse {
return toTtsResponse(parseJsonObject(text, "Volcengine"));
}
function parseSeedTtsFrames(text: string): VolcengineTtsResponse[] {
const trimmed = text.trim();
if (!trimmed) {
return [];
}
try {
return [toTtsResponse(parseJsonObject(trimmed, "BytePlus Seed Speech"))];
} catch {
// The HTTP API streams JSON frames; Response.text() preserves line breaks.
}
const frames: VolcengineTtsResponse[] = [];
for (const line of trimmed.split(/\r?\n/)) {
const item = line.trim();
if (!item) {
continue;
}
const json = item.startsWith("data:") ? item.slice("data:".length).trim() : item;
frames.push(toTtsResponse(parseJsonObject(json, "BytePlus Seed Speech")));
}
return frames;
}
function hostnameAllowlist(url: string): string[] {
return [new URL(url).hostname];
}
function seedAudioFormat(encoding: VolcengineTtsEncoding): "ogg_opus" | "mp3" | "pcm" {
return encoding === "wav" ? "pcm" : encoding;
}
async function seedSpeechTTS(params: VolcengineTTSParams & { apiKey: string }): Promise<Buffer> {
const {
text,
apiKey,
voice = DEFAULT_SEED_VOICE,
resourceId = DEFAULT_SEED_TTS_RESOURCE_ID,
appKey = DEFAULT_SEED_TTS_APP_KEY,
baseUrl = BYTEPLUS_SEED_TTS_URL,
speedRatio = 1.0,
emotion,
encoding = "ogg_opus",
timeoutMs = 30_000,
} = params;
const audioFormat = seedAudioFormat(encoding);
const payload = JSON.stringify({
user: { uid: "openclaw" },
req_params: {
text,
speaker: voice,
audio_params: {
format: audioFormat,
sample_rate: 24_000,
},
...(speedRatio !== 1.0 ? { speed_ratio: speedRatio } : {}),
...(emotion ? { emotion } : {}),
},
});
const { response, release } = await fetchWithSsrFGuard({
url: baseUrl,
init: {
method: "POST",
headers: {
"Content-Type": "application/json",
Connection: "keep-alive",
"X-Api-Key": apiKey,
"X-Api-Resource-Id": resourceId,
"X-Api-App-Key": appKey,
},
body: payload,
},
timeoutMs,
policy: { hostnameAllowlist: hostnameAllowlist(baseUrl) },
auditContext: "volcengine.tts",
});
try {
const frames = parseSeedTtsFrames(await response.text());
const chunks: Buffer[] = [];
for (const frame of frames) {
if (frame.code === 0) {
if (frame.data) {
chunks.push(Buffer.from(frame.data, "base64"));
}
continue;
}
if (frame.code === 20000000) {
continue;
}
throw new Error(
`BytePlus Seed Speech TTS error ${frame.code ?? response.status}: ${
frame.message ?? "unknown"
}`,
);
}
if (!response.ok || chunks.length === 0) {
throw new Error(`BytePlus Seed Speech TTS error ${response.status}: no audio data`);
}
return Buffer.concat(chunks);
} finally {
await release();
}
}
async function legacyVolcengineTTS(
params: VolcengineTTSParams & { appId: string; token: string },
): Promise<Buffer> {
const {
text,
appId,
token,
voice = DEFAULT_LEGACY_VOICE,
cluster = DEFAULT_CLUSTER,
baseUrl = VOLCENGINE_LEGACY_TTS_URL,
speedRatio = 1.0,
volumeRatio = 1.0,
pitchRatio = 1.0,
emotion,
encoding = "ogg_opus",
timeoutMs = 30_000,
} = params;
const payload = JSON.stringify({
app: { appid: appId, token, cluster },
user: { uid: "openclaw" },
audio: {
voice_type: voice,
encoding,
speed_ratio: speedRatio,
volume_ratio: volumeRatio,
pitch_ratio: pitchRatio,
...(emotion ? { emotion } : {}),
},
request: {
reqid: crypto.randomUUID(),
text,
text_type: "plain",
operation: "query",
},
});
const { response, release } = await fetchWithSsrFGuard({
url: baseUrl,
init: {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer;${token}`,
},
body: payload,
},
timeoutMs,
policy: { hostnameAllowlist: hostnameAllowlist(baseUrl) },
auditContext: "volcengine.tts",
});
try {
const body = parseLegacyTtsResponse(await response.text());
if (!response.ok || body.code !== 3000 || !body.data) {
throw new Error(
`Volcengine TTS error ${body.code ?? response.status}: ${body.message ?? "unknown"}`,
);
}
return Buffer.from(body.data, "base64");
} finally {
await release();
}
}
export async function volcengineTTS(params: VolcengineTTSParams): Promise<Buffer> {
if (params.apiKey) {
return seedSpeechTTS({ ...params, apiKey: params.apiKey });
}
if (params.appId && params.token) {
return legacyVolcengineTTS({ ...params, appId: params.appId, token: params.token });
}
throw new Error(
"Volcengine TTS credentials missing. Set a BytePlus Seed Speech API key or legacy AppID/token.",
);
}