--- summary: "Testing kit: unit/e2e/live suites, Docker runners, and what each test covers" read_when: - Running tests locally or in CI - Adding regressions for model/provider bugs - Debugging gateway + agent behavior title: "Testing" --- OpenClaw has three Vitest suites (unit/integration, e2e, live) and a small set of Docker runners. This doc is a "how we test" guide: - What each suite covers (and what it deliberately does _not_ cover). - Which commands to run for common workflows (local, pre-push, debugging). - How live tests discover credentials and select models/providers. - How to add regressions for real-world model/provider issues. ## Quick start Most days: - Full gate (expected before push): `pnpm build && pnpm check && pnpm check:test-types && pnpm test` - Faster local full-suite run on a roomy machine: `pnpm test:max` - Direct Vitest watch loop: `pnpm test:watch` - Direct file targeting now routes extension/channel paths too: `pnpm test extensions/discord/src/monitor/message-handler.preflight.test.ts` - Prefer targeted runs first when you are iterating on a single failure. - Docker-backed QA site: `pnpm qa:lab:up` - Linux VM-backed QA lane: `pnpm openclaw qa suite --runner multipass --scenario channel-chat-baseline` When you touch tests or want extra confidence: - Coverage gate: `pnpm test:coverage` - E2E suite: `pnpm test:e2e` When debugging real providers/models (requires real creds): - Live suite (models + gateway tool/image probes): `pnpm test:live` - Target one live file quietly: `pnpm test:live -- src/agents/models.profiles.live.test.ts` - Docker live model sweep: `pnpm test:docker:live-models` - Each selected model now runs a text turn plus a small file-read-style probe. Models whose metadata advertises `image` input also run a tiny image turn. Disable the extra probes with `OPENCLAW_LIVE_MODEL_FILE_PROBE=0` or `OPENCLAW_LIVE_MODEL_IMAGE_PROBE=0` when isolating provider failures. - CI coverage: daily `OpenClaw Scheduled Live And E2E Checks` and manual `OpenClaw Release Checks` both call the reusable live/E2E workflow with `include_live_suites: true`, which includes separate Docker live model matrix jobs sharded by provider. - For focused CI reruns, dispatch `OpenClaw Live And E2E Checks (Reusable)` with `include_live_suites: true` and `live_models_only: true`. - Add new high-signal provider secrets to `scripts/ci-hydrate-live-auth.sh` plus `.github/workflows/openclaw-live-and-e2e-checks-reusable.yml` and its scheduled/release callers. - Native Codex bound-chat smoke: `pnpm test:docker:live-codex-bind` - Runs a Docker live lane against the Codex app-server path, binds a synthetic Slack DM with `/codex bind`, exercises `/codex fast` and `/codex permissions`, then verifies a plain reply and an image attachment route through the native plugin binding instead of ACP. - Moonshot/Kimi cost smoke: with `MOONSHOT_API_KEY` set, run `openclaw models list --provider moonshot --json`, then run an isolated `openclaw agent --local --session-id live-kimi-cost --message 'Reply exactly: KIMI_LIVE_OK' --thinking off --json` against `moonshot/kimi-k2.6`. Verify the JSON reports Moonshot/K2.6 and the assistant transcript stores normalized `usage.cost`. Tip: when you only need one failing case, prefer narrowing live tests via the allowlist env vars described below. ## QA-specific runners These commands sit beside the main test suites when you need QA-lab realism: CI runs QA Lab in dedicated workflows. `Parity gate` runs on matching PRs and from manual dispatch with mock providers. `QA-Lab - All Lanes` runs nightly on `main` and from manual dispatch with the mock parity gate, live Matrix lane, and Convex-managed live Telegram lane as parallel jobs. `OpenClaw Release Checks` runs the same lanes before release approval. - `pnpm openclaw qa suite` - Runs repo-backed QA scenarios directly on the host. - Runs multiple selected scenarios in parallel by default with isolated gateway workers. `qa-channel` defaults to concurrency 4 (bounded by the selected scenario count). Use `--concurrency ` to tune the worker count, or `--concurrency 1` for the older serial lane. - Exits non-zero when any scenario fails. Use `--allow-failures` when you want artifacts without a failing exit code. - Supports provider modes `live-frontier`, `mock-openai`, and `aimock`. `aimock` starts a local AIMock-backed provider server for experimental fixture and protocol-mock coverage without replacing the scenario-aware `mock-openai` lane. - `pnpm openclaw qa suite --runner multipass` - Runs the same QA suite inside a disposable Multipass Linux VM. - Keeps the same scenario-selection behavior as `qa suite` on the host. - Reuses the same provider/model selection flags as `qa suite`. - Live runs forward the supported QA auth inputs that are practical for the guest: env-based provider keys, the QA live provider config path, and `CODEX_HOME` when present. - Output dirs must stay under the repo root so the guest can write back through the mounted workspace. - Writes the normal QA report + summary plus Multipass logs under `.artifacts/qa-e2e/...`. - `pnpm qa:lab:up` - Starts the Docker-backed QA site for operator-style QA work. - `pnpm test:docker:npm-onboard-channel-agent` - Builds an npm tarball from the current checkout, installs it globally in Docker, runs non-interactive OpenAI API-key onboarding, configures Telegram by default, verifies enabling the plugin installs runtime dependencies on demand, runs doctor, and runs one local agent turn against a mocked OpenAI endpoint. - Use `OPENCLAW_NPM_ONBOARD_CHANNEL=discord` to run the same packaged-install lane with Discord. - `pnpm test:docker:npm-telegram-live` - Installs a published OpenClaw package in Docker, runs installed-package onboarding, configures Telegram through the installed CLI, then reuses the live Telegram QA lane with that installed package as the SUT Gateway. - Defaults to `OPENCLAW_NPM_TELEGRAM_PACKAGE_SPEC=openclaw@beta`. - Uses the same Telegram env credentials or Convex credential source as `pnpm openclaw qa telegram`. For CI/release automation, set `OPENCLAW_NPM_TELEGRAM_CREDENTIAL_SOURCE=convex` plus `OPENCLAW_QA_CONVEX_SITE_URL` and the role secret. If `OPENCLAW_QA_CONVEX_SITE_URL` and a Convex role secret are present in CI, the Docker wrapper selects Convex automatically. - `OPENCLAW_NPM_TELEGRAM_CREDENTIAL_ROLE=ci|maintainer` overrides the shared `OPENCLAW_QA_CREDENTIAL_ROLE` for this lane only. - `pnpm test:docker:bundled-channel-deps` - Packs and installs the current OpenClaw build in Docker, starts the Gateway with OpenAI configured, then enables bundled channel/plugins via config edits. - Verifies setup discovery leaves unconfigured plugin runtime dependencies absent, the first configured Gateway or doctor run installs each bundled plugin's runtime dependencies on demand, and a second restart does not reinstall dependencies that were already activated. - Also installs a known older npm baseline, enables Telegram before running `openclaw update --tag `, and verifies the candidate's post-update doctor repairs bundled channel runtime dependencies without a harness-side postinstall repair. - `pnpm openclaw qa aimock` - Starts only the local AIMock provider server for direct protocol smoke testing. - `pnpm openclaw qa matrix` - Runs the Matrix live QA lane against a disposable Docker-backed Tuwunel homeserver. - This QA host is repo/dev-only today. Packaged OpenClaw installs do not ship `qa-lab`, so they do not expose `openclaw qa`. - Repo checkouts load the bundled runner directly; no separate plugin install step is needed. - Provisions three temporary Matrix users (`driver`, `sut`, `observer`) plus one private room, then starts a QA gateway child with the real Matrix plugin as the SUT transport. - Uses the pinned stable Tuwunel image `ghcr.io/matrix-construct/tuwunel:v1.5.1` by default. Override with `OPENCLAW_QA_MATRIX_TUWUNEL_IMAGE` when you need to test a different image. - Matrix does not expose shared credential-source flags because the lane provisions disposable users locally. - Writes a Matrix QA report, summary, observed-events artifact, and combined stdout/stderr output log under `.artifacts/qa-e2e/...`. - `pnpm openclaw qa telegram` - Runs the Telegram live QA lane against a real private group using the driver and SUT bot tokens from env. - Requires `OPENCLAW_QA_TELEGRAM_GROUP_ID`, `OPENCLAW_QA_TELEGRAM_DRIVER_BOT_TOKEN`, and `OPENCLAW_QA_TELEGRAM_SUT_BOT_TOKEN`. The group id must be the numeric Telegram chat id. - Supports `--credential-source convex` for shared pooled credentials. Use env mode by default, or set `OPENCLAW_QA_CREDENTIAL_SOURCE=convex` to opt into pooled leases. - Exits non-zero when any scenario fails. Use `--allow-failures` when you want artifacts without a failing exit code. - Requires two distinct bots in the same private group, with the SUT bot exposing a Telegram username. - For stable bot-to-bot observation, enable Bot-to-Bot Communication Mode in `@BotFather` for both bots and ensure the driver bot can observe group bot traffic. - Writes a Telegram QA report, summary, and observed-messages artifact under `.artifacts/qa-e2e/...`. Replying scenarios include RTT from driver send request to observed SUT reply. Live transport lanes share one standard contract so new transports do not drift: `qa-channel` remains the broad synthetic QA suite and is not part of the live transport coverage matrix. | Lane | Canary | Mention gating | Allowlist block | Top-level reply | Restart resume | Thread follow-up | Thread isolation | Reaction observation | Help command | | -------- | ------ | -------------- | --------------- | --------------- | -------------- | ---------------- | ---------------- | -------------------- | ------------ | | Matrix | x | x | x | x | x | x | x | x | | | Telegram | x | | | | | | | | x | ### Shared Telegram credentials via Convex (v1) When `--credential-source convex` (or `OPENCLAW_QA_CREDENTIAL_SOURCE=convex`) is enabled for `openclaw qa telegram`, QA lab acquires an exclusive lease from a Convex-backed pool, heartbeats that lease while the lane is running, and releases the lease on shutdown. Reference Convex project scaffold: - `qa/convex-credential-broker/` Required env vars: - `OPENCLAW_QA_CONVEX_SITE_URL` (for example `https://your-deployment.convex.site`) - One secret for the selected role: - `OPENCLAW_QA_CONVEX_SECRET_MAINTAINER` for `maintainer` - `OPENCLAW_QA_CONVEX_SECRET_CI` for `ci` - Credential role selection: - CLI: `--credential-role maintainer|ci` - Env default: `OPENCLAW_QA_CREDENTIAL_ROLE` (defaults to `ci` in CI, `maintainer` otherwise) Optional env vars: - `OPENCLAW_QA_CREDENTIAL_LEASE_TTL_MS` (default `1200000`) - `OPENCLAW_QA_CREDENTIAL_HEARTBEAT_INTERVAL_MS` (default `30000`) - `OPENCLAW_QA_CREDENTIAL_ACQUIRE_TIMEOUT_MS` (default `90000`) - `OPENCLAW_QA_CREDENTIAL_HTTP_TIMEOUT_MS` (default `15000`) - `OPENCLAW_QA_CONVEX_ENDPOINT_PREFIX` (default `/qa-credentials/v1`) - `OPENCLAW_QA_CREDENTIAL_OWNER_ID` (optional trace id) - `OPENCLAW_QA_ALLOW_INSECURE_HTTP=1` allows loopback `http://` Convex URLs for local-only development. `OPENCLAW_QA_CONVEX_SITE_URL` should use `https://` in normal operation. Maintainer admin commands (pool add/remove/list) require `OPENCLAW_QA_CONVEX_SECRET_MAINTAINER` specifically. CLI helpers for maintainers: ```bash pnpm openclaw qa credentials add --kind telegram --payload-file qa/telegram-credential.json pnpm openclaw qa credentials list --kind telegram pnpm openclaw qa credentials remove --credential-id ``` Use `--json` for machine-readable output in scripts and CI utilities. Default endpoint contract (`OPENCLAW_QA_CONVEX_SITE_URL` + `/qa-credentials/v1`): - `POST /acquire` - Request: `{ kind, ownerId, actorRole, leaseTtlMs, heartbeatIntervalMs }` - Success: `{ status: "ok", credentialId, leaseToken, payload, leaseTtlMs?, heartbeatIntervalMs? }` - Exhausted/retryable: `{ status: "error", code: "POOL_EXHAUSTED" | "NO_CREDENTIAL_AVAILABLE", ... }` - `POST /heartbeat` - Request: `{ kind, ownerId, actorRole, credentialId, leaseToken, leaseTtlMs }` - Success: `{ status: "ok" }` (or empty `2xx`) - `POST /release` - Request: `{ kind, ownerId, actorRole, credentialId, leaseToken }` - Success: `{ status: "ok" }` (or empty `2xx`) - `POST /admin/add` (maintainer secret only) - Request: `{ kind, actorId, payload, note?, status? }` - Success: `{ status: "ok", credential }` - `POST /admin/remove` (maintainer secret only) - Request: `{ credentialId, actorId }` - Success: `{ status: "ok", changed, credential }` - Active lease guard: `{ status: "error", code: "LEASE_ACTIVE", ... }` - `POST /admin/list` (maintainer secret only) - Request: `{ kind?, status?, includePayload?, limit? }` - Success: `{ status: "ok", credentials, count }` Payload shape for Telegram kind: - `{ groupId: string, driverToken: string, sutToken: string }` - `groupId` must be a numeric Telegram chat id string. - `admin/add` validates this shape for `kind: "telegram"` and rejects malformed payloads. ### Adding a channel to QA Adding a channel to the markdown QA system requires exactly two things: 1. A transport adapter for the channel. 2. A scenario pack that exercises the channel contract. Do not add a new top-level QA command root when the shared `qa-lab` host can own the flow. `qa-lab` owns the shared host mechanics: - the `openclaw qa` command root - suite startup and teardown - worker concurrency - artifact writing - report generation - scenario execution - compatibility aliases for older `qa-channel` scenarios Runner plugins own the transport contract: - how `openclaw qa ` is mounted beneath the shared `qa` root - how the gateway is configured for that transport - how readiness is checked - how inbound events are injected - how outbound messages are observed - how transcripts and normalized transport state are exposed - how transport-backed actions are executed - how transport-specific reset or cleanup is handled The minimum adoption bar for a new channel is: 1. Keep `qa-lab` as the owner of the shared `qa` root. 2. Implement the transport runner on the shared `qa-lab` host seam. 3. Keep transport-specific mechanics inside the runner plugin or channel harness. 4. Mount the runner as `openclaw qa ` instead of registering a competing root command. Runner plugins should declare `qaRunners` in `openclaw.plugin.json` and export a matching `qaRunnerCliRegistrations` array from `runtime-api.ts`. Keep `runtime-api.ts` light; lazy CLI and runner execution should stay behind separate entrypoints. 5. Author or adapt markdown scenarios under the themed `qa/scenarios/` directories. 6. Use the generic scenario helpers for new scenarios. 7. Keep existing compatibility aliases working unless the repo is doing an intentional migration. The decision rule is strict: - If behavior can be expressed once in `qa-lab`, put it in `qa-lab`. - If behavior depends on one channel transport, keep it in that runner plugin or plugin harness. - If a scenario needs a new capability that more than one channel can use, add a generic helper instead of a channel-specific branch in `suite.ts`. - If a behavior is only meaningful for one transport, keep the scenario transport-specific and make that explicit in the scenario contract. Preferred generic helper names for new scenarios are: - `waitForTransportReady` - `waitForChannelReady` - `injectInboundMessage` - `injectOutboundMessage` - `waitForTransportOutboundMessage` - `waitForChannelOutboundMessage` - `waitForNoTransportOutbound` - `getTransportSnapshot` - `readTransportMessage` - `readTransportTranscript` - `formatTransportTranscript` - `resetTransport` Compatibility aliases remain available for existing scenarios, including: - `waitForQaChannelReady` - `waitForOutboundMessage` - `waitForNoOutbound` - `formatConversationTranscript` - `resetBus` New channel work should use the generic helper names. Compatibility aliases exist to avoid a flag day migration, not as the model for new scenario authoring. ## Test suites (what runs where) Think of the suites as “increasing realism” (and increasing flakiness/cost): ### Unit / integration (default) - Command: `pnpm test` - Config: untargeted runs use the `vitest.full-*.config.ts` shard set and may expand multi-project shards into per-project configs for parallel scheduling - Files: core/unit inventories under `src/**/*.test.ts`, `packages/**/*.test.ts`, `test/**/*.test.ts`, and the whitelisted `ui` node tests covered by `vitest.unit.config.ts` - Scope: - Pure unit tests - In-process integration tests (gateway auth, routing, tooling, parsing, config) - Deterministic regressions for known bugs - Expectations: - Runs in CI - No real keys required - Should be fast and stable - Untargeted `pnpm test` runs twelve smaller shard configs (`core-unit-fast`, `core-unit-src`, `core-unit-security`, `core-unit-ui`, `core-unit-support`, `core-support-boundary`, `core-contracts`, `core-bundled`, `core-runtime`, `agentic`, `auto-reply`, `extensions`) instead of one giant native root-project process. This cuts peak RSS on loaded machines and avoids auto-reply/extension work starving unrelated suites. - `pnpm test --watch` still uses the native root `vitest.config.ts` project graph, because a multi-shard watch loop is not practical. - `pnpm test`, `pnpm test:watch`, and `pnpm test:perf:imports` route explicit file/directory targets through scoped lanes first, so `pnpm test extensions/discord/src/monitor/message-handler.preflight.test.ts` avoids paying the full root project startup tax. - `pnpm test:changed` expands changed git paths into the same scoped lanes when the diff only touches routable source/test files; config/setup edits still fall back to the broad root-project rerun. - `pnpm check:changed` is the normal smart local gate for narrow work. It classifies the diff into core, core tests, extensions, extension tests, apps, docs, release metadata, and tooling, then runs the matching typecheck/lint/test lanes. Public Plugin SDK and plugin-contract changes include one extension validation pass because extensions depend on those core contracts. Release metadata-only version bumps run targeted version/config/root-dependency checks instead of the full suite, with a guard that rejects package changes outside the top-level version field. - Import-light unit tests from agents, commands, plugins, auto-reply helpers, `plugin-sdk`, and similar pure utility areas route through the `unit-fast` lane, which skips `test/setup-openclaw-runtime.ts`; stateful/runtime-heavy files stay on the existing lanes. - Selected `plugin-sdk` and `commands` helper source files also map changed-mode runs to explicit sibling tests in those light lanes, so helper edits avoid rerunning the full heavy suite for that directory. - `auto-reply` has three dedicated buckets: top-level core helpers, top-level `reply.*` integration tests, and the `src/auto-reply/reply/**` subtree. This keeps the heaviest reply harness work off the cheap status/chunk/token tests. - When you change message-tool discovery inputs or compaction runtime context, keep both levels of coverage. - Add focused helper regressions for pure routing and normalization boundaries. - Keep the embedded runner integration suites healthy: `src/agents/pi-embedded-runner/compact.hooks.test.ts`, `src/agents/pi-embedded-runner/run.overflow-compaction.test.ts`, and `src/agents/pi-embedded-runner/run.overflow-compaction.loop.test.ts`. - Those suites verify that scoped ids and compaction behavior still flow through the real `run.ts` / `compact.ts` paths; helper-only tests are not a sufficient substitute for those integration paths. - Base Vitest config defaults to `threads`. - The shared Vitest config fixes `isolate: false` and uses the non-isolated runner across the root projects, e2e, and live configs. - The root UI lane keeps its `jsdom` setup and optimizer, but runs on the shared non-isolated runner too. - Each `pnpm test` shard inherits the same `threads` + `isolate: false` defaults from the shared Vitest config. - `scripts/run-vitest.mjs` adds `--no-maglev` for Vitest child Node processes by default to reduce V8 compile churn during big local runs. Set `OPENCLAW_VITEST_ENABLE_MAGLEV=1` to compare against stock V8 behavior. - `pnpm changed:lanes` shows which architectural lanes a diff triggers. - The pre-commit hook is formatting-only. It restages formatted files and does not run lint, typecheck, or tests. - Run `pnpm check:changed` explicitly before handoff or push when you need the smart local gate. Public Plugin SDK and plugin-contract changes include one extension validation pass. - `pnpm test:changed` routes through scoped lanes when the changed paths map cleanly to a smaller suite. - `pnpm test:max` and `pnpm test:changed:max` keep the same routing behavior, just with a higher worker cap. - Local worker auto-scaling is intentionally conservative and backs off when the host load average is already high, so multiple concurrent Vitest runs do less damage by default. - The base Vitest config marks the projects/config files as `forceRerunTriggers` so changed-mode reruns stay correct when test wiring changes. - The config keeps `OPENCLAW_VITEST_FS_MODULE_CACHE` enabled on supported hosts; set `OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/abs/path` if you want one explicit cache location for direct profiling. - `pnpm test:perf:imports` enables Vitest import-duration reporting plus import-breakdown output. - `pnpm test:perf:imports:changed` scopes the same profiling view to files changed since `origin/main`. - When one hot test still spends most of its time in startup imports, keep heavy dependencies behind a narrow local `*.runtime.ts` seam and mock that seam directly instead of deep-importing runtime helpers just to pass them through `vi.mock(...)`. - `pnpm test:perf:changed:bench -- --ref ` compares routed `test:changed` against the native root-project path for that committed diff and prints wall time plus macOS max RSS. - `pnpm test:perf:changed:bench -- --worktree` benchmarks the current dirty tree by routing the changed file list through `scripts/test-projects.mjs` and the root Vitest config. - `pnpm test:perf:profile:main` writes a main-thread CPU profile for Vitest/Vite startup and transform overhead. - `pnpm test:perf:profile:runner` writes runner CPU+heap profiles for the unit suite with file parallelism disabled. ### Stability (gateway) - Command: `pnpm test:stability:gateway` - Config: `vitest.gateway.config.ts`, forced to one worker - Scope: - Starts a real loopback Gateway with diagnostics enabled by default - Drives synthetic gateway message, memory, and large-payload churn through the diagnostic event path - Queries `diagnostics.stability` over the Gateway WS RPC - Covers diagnostic stability bundle persistence helpers - Asserts the recorder remains bounded, synthetic RSS samples stay under the pressure budget, and per-session queue depths drain back to zero - Expectations: - CI-safe and keyless - Narrow lane for stability-regression follow-up, not a substitute for the full Gateway suite ### E2E (gateway smoke) - Command: `pnpm test:e2e` - Config: `vitest.e2e.config.ts` - Files: `src/**/*.e2e.test.ts`, `test/**/*.e2e.test.ts`, and bundled-plugin E2E tests under `extensions/` - Runtime defaults: - Uses Vitest `threads` with `isolate: false`, matching the rest of the repo. - Uses adaptive workers (CI: up to 2, local: 1 by default). - Runs in silent mode by default to reduce console I/O overhead. - Useful overrides: - `OPENCLAW_E2E_WORKERS=` to force worker count (capped at 16). - `OPENCLAW_E2E_VERBOSE=1` to re-enable verbose console output. - Scope: - Multi-instance gateway end-to-end behavior - WebSocket/HTTP surfaces, node pairing, and heavier networking - Expectations: - Runs in CI (when enabled in the pipeline) - No real keys required - More moving parts than unit tests (can be slower) ### E2E: OpenShell backend smoke - Command: `pnpm test:e2e:openshell` - File: `extensions/openshell/src/backend.e2e.test.ts` - Scope: - Starts an isolated OpenShell gateway on the host via Docker - Creates a sandbox from a temporary local Dockerfile - Exercises OpenClaw's OpenShell backend over real `sandbox ssh-config` + SSH exec - Verifies remote-canonical filesystem behavior through the sandbox fs bridge - Expectations: - Opt-in only; not part of the default `pnpm test:e2e` run - Requires a local `openshell` CLI plus a working Docker daemon - Uses isolated `HOME` / `XDG_CONFIG_HOME`, then destroys the test gateway and sandbox - Useful overrides: - `OPENCLAW_E2E_OPENSHELL=1` to enable the test when running the broader e2e suite manually - `OPENCLAW_E2E_OPENSHELL_COMMAND=/path/to/openshell` to point at a non-default CLI binary or wrapper script ### Live (real providers + real models) - Command: `pnpm test:live` - Config: `vitest.live.config.ts` - Files: `src/**/*.live.test.ts`, `test/**/*.live.test.ts`, and bundled-plugin live tests under `extensions/` - Default: **enabled** by `pnpm test:live` (sets `OPENCLAW_LIVE_TEST=1`) - Scope: - “Does this provider/model actually work _today_ with real creds?” - Catch provider format changes, tool-calling quirks, auth issues, and rate limit behavior - Expectations: - Not CI-stable by design (real networks, real provider policies, quotas, outages) - Costs money / uses rate limits - Prefer running narrowed subsets instead of “everything” - Live runs source `~/.profile` to pick up missing API keys. - By default, live runs still isolate `HOME` and copy config/auth material into a temp test home so unit fixtures cannot mutate your real `~/.openclaw`. - Set `OPENCLAW_LIVE_USE_REAL_HOME=1` only when you intentionally need live tests to use your real home directory. - `pnpm test:live` now defaults to a quieter mode: it keeps `[live] ...` progress output, but suppresses the extra `~/.profile` notice and mutes gateway bootstrap logs/Bonjour chatter. Set `OPENCLAW_LIVE_TEST_QUIET=0` if you want the full startup logs back. - API key rotation (provider-specific): set `*_API_KEYS` with comma/semicolon format or `*_API_KEY_1`, `*_API_KEY_2` (for example `OPENAI_API_KEYS`, `ANTHROPIC_API_KEYS`, `GEMINI_API_KEYS`) or per-live override via `OPENCLAW_LIVE_*_KEY`; tests retry on rate limit responses. - Progress/heartbeat output: - Live suites now emit progress lines to stderr so long provider calls are visibly active even when Vitest console capture is quiet. - `vitest.live.config.ts` disables Vitest console interception so provider/gateway progress lines stream immediately during live runs. - Tune direct-model heartbeats with `OPENCLAW_LIVE_HEARTBEAT_MS`. - Tune gateway/probe heartbeats with `OPENCLAW_LIVE_GATEWAY_HEARTBEAT_MS`. ## Which suite should I run? Use this decision table: - Editing logic/tests: run `pnpm test` (and `pnpm test:coverage` if you changed a lot) - Touching gateway networking / WS protocol / pairing: add `pnpm test:e2e` - Debugging “my bot is down” / provider-specific failures / tool calling: run a narrowed `pnpm test:live` ## Live (network-touching) tests For the live model matrix, CLI backend smokes, ACP smokes, Codex app-server harness, and all media-provider live tests (Deepgram, BytePlus, ComfyUI, image, music, video, media harness) — plus credential handling for live runs — see [Testing — live suites](/help/testing-live). ## Docker runners (optional "works in Linux" checks) These Docker runners split into two buckets: - Live-model runners: `test:docker:live-models` and `test:docker:live-gateway` run only their matching profile-key live file inside the repo Docker image (`src/agents/models.profiles.live.test.ts` and `src/gateway/gateway-models.profiles.live.test.ts`), mounting your local config dir and workspace (and sourcing `~/.profile` if mounted). The matching local entrypoints are `test:live:models-profiles` and `test:live:gateway-profiles`. - Docker live runners default to a smaller smoke cap so a full Docker sweep stays practical: `test:docker:live-models` defaults to `OPENCLAW_LIVE_MAX_MODELS=12`, and `test:docker:live-gateway` defaults to `OPENCLAW_LIVE_GATEWAY_SMOKE=1`, `OPENCLAW_LIVE_GATEWAY_MAX_MODELS=8`, `OPENCLAW_LIVE_GATEWAY_STEP_TIMEOUT_MS=45000`, and `OPENCLAW_LIVE_GATEWAY_MODEL_TIMEOUT_MS=90000`. Override those env vars when you explicitly want the larger exhaustive scan. - `test:docker:all` builds the live Docker image once via `test:docker:live-build`, then reuses it for the two live Docker lanes. It also builds one shared `scripts/e2e/Dockerfile` image via `test:docker:e2e-build` and reuses it for the E2E container smoke runners that exercise the built app. - Container smoke runners: `test:docker:openwebui`, `test:docker:onboard`, `test:docker:npm-onboard-channel-agent`, `test:docker:gateway-network`, `test:docker:mcp-channels`, `test:docker:pi-bundle-mcp-tools`, `test:docker:cron-mcp-cleanup`, `test:docker:plugins`, `test:docker:plugin-update`, and `test:docker:config-reload` boot one or more real containers and verify higher-level integration paths. The live-model Docker runners also bind-mount only the needed CLI auth homes (or all supported ones when the run is not narrowed), then copy them into the container home before the run so external-CLI OAuth can refresh tokens without mutating the host auth store: - Direct models: `pnpm test:docker:live-models` (script: `scripts/test-live-models-docker.sh`) - ACP bind smoke: `pnpm test:docker:live-acp-bind` (script: `scripts/test-live-acp-bind-docker.sh`) - CLI backend smoke: `pnpm test:docker:live-cli-backend` (script: `scripts/test-live-cli-backend-docker.sh`) - Codex app-server harness smoke: `pnpm test:docker:live-codex-harness` (script: `scripts/test-live-codex-harness-docker.sh`) - Gateway + dev agent: `pnpm test:docker:live-gateway` (script: `scripts/test-live-gateway-models-docker.sh`) - Open WebUI live smoke: `pnpm test:docker:openwebui` (script: `scripts/e2e/openwebui-docker.sh`) - Onboarding wizard (TTY, full scaffolding): `pnpm test:docker:onboard` (script: `scripts/e2e/onboard-docker.sh`) - Npm tarball onboarding/channel/agent smoke: `pnpm test:docker:npm-onboard-channel-agent` installs the packed OpenClaw tarball globally in Docker, configures OpenAI via env-ref onboarding plus Telegram by default, verifies doctor repairs activated plugin runtime deps, and runs one mocked OpenAI agent turn. Reuse a prebuilt tarball with `OPENCLAW_NPM_ONBOARD_PACKAGE_TGZ=/path/to/openclaw-*.tgz`, skip the host rebuild with `OPENCLAW_NPM_ONBOARD_HOST_BUILD=0`, or switch channel with `OPENCLAW_NPM_ONBOARD_CHANNEL=discord`. - Bun global install smoke: `bash scripts/e2e/bun-global-install-smoke.sh` packs the current tree, installs it with `bun install -g` in an isolated home, and verifies `openclaw infer image providers --json` returns bundled image providers instead of hanging. Reuse a prebuilt tarball with `OPENCLAW_BUN_GLOBAL_SMOKE_PACKAGE_TGZ=/path/to/openclaw-*.tgz`, skip the host build with `OPENCLAW_BUN_GLOBAL_SMOKE_HOST_BUILD=0`, or copy `dist/` from a built Docker image with `OPENCLAW_BUN_GLOBAL_SMOKE_DIST_IMAGE=openclaw-dockerfile-smoke:local`. - Installer Docker smoke: `bash scripts/test-install-sh-docker.sh` shares one npm cache across its root, update, and direct-npm containers. Update smoke defaults to npm `latest` as the stable baseline before upgrading to the candidate tarball. Non-root installer checks keep an isolated npm cache so root-owned cache entries do not mask user-local install behavior. Set `OPENCLAW_INSTALL_SMOKE_NPM_CACHE_DIR=/path/to/cache` to reuse the root/update/direct-npm cache across local reruns. - Install Smoke CI skips the duplicate direct-npm global update with `OPENCLAW_INSTALL_SMOKE_SKIP_NPM_GLOBAL=1`; run the script locally without that env when direct `npm install -g` coverage is needed. - Gateway networking (two containers, WS auth + health): `pnpm test:docker:gateway-network` (script: `scripts/e2e/gateway-network-docker.sh`) - OpenAI Responses web_search minimal reasoning regression: `pnpm test:docker:openai-web-search-minimal` (script: `scripts/e2e/openai-web-search-minimal-docker.sh`) runs a mocked OpenAI server through Gateway, verifies `web_search` raises `reasoning.effort` from `minimal` to `low`, then forces the provider schema reject and checks the raw detail appears in Gateway logs. - MCP channel bridge (seeded Gateway + stdio bridge + raw Claude notification-frame smoke): `pnpm test:docker:mcp-channels` (script: `scripts/e2e/mcp-channels-docker.sh`) - Pi bundle MCP tools (real stdio MCP server + embedded Pi profile allow/deny smoke): `pnpm test:docker:pi-bundle-mcp-tools` (script: `scripts/e2e/pi-bundle-mcp-tools-docker.sh`) - Cron/subagent MCP cleanup (real Gateway + stdio MCP child teardown after isolated cron and one-shot subagent runs): `pnpm test:docker:cron-mcp-cleanup` (script: `scripts/e2e/cron-mcp-cleanup-docker.sh`) - Plugins (install smoke + `/plugin` alias + Claude-bundle restart semantics): `pnpm test:docker:plugins` (script: `scripts/e2e/plugins-docker.sh`) - Plugin update unchanged smoke: `pnpm test:docker:plugin-update` (script: `scripts/e2e/plugin-update-unchanged-docker.sh`) - Config reload metadata smoke: `pnpm test:docker:config-reload` (script: `scripts/e2e/config-reload-source-docker.sh`) - Bundled plugin runtime deps: `pnpm test:docker:bundled-channel-deps` builds a small Docker runner image by default, builds and packs OpenClaw once on the host, then mounts that tarball into each Linux install scenario. Reuse the image with `OPENCLAW_SKIP_DOCKER_BUILD=1`, skip the host rebuild after a fresh local build with `OPENCLAW_BUNDLED_CHANNEL_HOST_BUILD=0`, or point at an existing tarball with `OPENCLAW_BUNDLED_CHANNEL_PACKAGE_TGZ=/path/to/openclaw-*.tgz`. - Narrow bundled plugin runtime deps while iterating by disabling unrelated scenarios, for example: `OPENCLAW_BUNDLED_CHANNEL_SCENARIOS=0 OPENCLAW_BUNDLED_CHANNEL_UPDATE_SCENARIO=0 OPENCLAW_BUNDLED_CHANNEL_ROOT_OWNED_SCENARIO=0 OPENCLAW_BUNDLED_CHANNEL_SETUP_ENTRY_SCENARIO=0 pnpm test:docker:bundled-channel-deps`. To prebuild and reuse the shared built-app image manually: ```bash OPENCLAW_DOCKER_E2E_IMAGE=openclaw-docker-e2e:local pnpm test:docker:e2e-build OPENCLAW_DOCKER_E2E_IMAGE=openclaw-docker-e2e:local OPENCLAW_SKIP_DOCKER_BUILD=1 pnpm test:docker:mcp-channels ``` Suite-specific image overrides such as `OPENCLAW_GATEWAY_NETWORK_E2E_IMAGE` still win when set. When `OPENCLAW_SKIP_DOCKER_BUILD=1` points at a remote shared image, the scripts pull it if it is not already local. The QR and installer Docker tests keep their own Dockerfiles because they validate package/install behavior rather than the shared built-app runtime. The live-model Docker runners also bind-mount the current checkout read-only and stage it into a temporary workdir inside the container. This keeps the runtime image slim while still running Vitest against your exact local source/config. The staging step skips large local-only caches and app build outputs such as `.pnpm-store`, `.worktrees`, `__openclaw_vitest__`, and app-local `.build` or Gradle output directories so Docker live runs do not spend minutes copying machine-specific artifacts. They also set `OPENCLAW_SKIP_CHANNELS=1` so gateway live probes do not start real Telegram/Discord/etc. channel workers inside the container. `test:docker:live-models` still runs `pnpm test:live`, so pass through `OPENCLAW_LIVE_GATEWAY_*` as well when you need to narrow or exclude gateway live coverage from that Docker lane. `test:docker:openwebui` is a higher-level compatibility smoke: it starts an OpenClaw gateway container with the OpenAI-compatible HTTP endpoints enabled, starts a pinned Open WebUI container against that gateway, signs in through Open WebUI, verifies `/api/models` exposes `openclaw/default`, then sends a real chat request through Open WebUI's `/api/chat/completions` proxy. The first run can be noticeably slower because Docker may need to pull the Open WebUI image and Open WebUI may need to finish its own cold-start setup. This lane expects a usable live model key, and `OPENCLAW_PROFILE_FILE` (`~/.profile` by default) is the primary way to provide it in Dockerized runs. Successful runs print a small JSON payload like `{ "ok": true, "model": "openclaw/default", ... }`. `test:docker:mcp-channels` is intentionally deterministic and does not need a real Telegram, Discord, or iMessage account. It boots a seeded Gateway container, starts a second container that spawns `openclaw mcp serve`, then verifies routed conversation discovery, transcript reads, attachment metadata, live event queue behavior, outbound send routing, and Claude-style channel + permission notifications over the real stdio MCP bridge. The notification check inspects the raw stdio MCP frames directly so the smoke validates what the bridge actually emits, not just what a specific client SDK happens to surface. `test:docker:pi-bundle-mcp-tools` is deterministic and does not need a live model key. It builds the repo Docker image, starts a real stdio MCP probe server inside the container, materializes that server through the embedded Pi bundle MCP runtime, executes the tool, then verifies `coding` and `messaging` keep `bundle-mcp` tools while `minimal` and `tools.deny: ["bundle-mcp"]` filter them. `test:docker:cron-mcp-cleanup` is deterministic and does not need a live model key. It starts a seeded Gateway with a real stdio MCP probe server, runs an isolated cron turn and a `/subagents spawn` one-shot child turn, then verifies the MCP child process exits after each run. Manual ACP plain-language thread smoke (not CI): - `bun scripts/dev/discord-acp-plain-language-smoke.ts --channel ...` - Keep this script for regression/debug workflows. It may be needed again for ACP thread routing validation, so do not delete it. Useful env vars: - `OPENCLAW_CONFIG_DIR=...` (default: `~/.openclaw`) mounted to `/home/node/.openclaw` - `OPENCLAW_WORKSPACE_DIR=...` (default: `~/.openclaw/workspace`) mounted to `/home/node/.openclaw/workspace` - `OPENCLAW_PROFILE_FILE=...` (default: `~/.profile`) mounted to `/home/node/.profile` and sourced before running tests - `OPENCLAW_DOCKER_PROFILE_ENV_ONLY=1` to verify only env vars sourced from `OPENCLAW_PROFILE_FILE`, using temporary config/workspace dirs and no external CLI auth mounts - `OPENCLAW_DOCKER_CLI_TOOLS_DIR=...` (default: `~/.cache/openclaw/docker-cli-tools`) mounted to `/home/node/.npm-global` for cached CLI installs inside Docker - External CLI auth dirs/files under `$HOME` are mounted read-only under `/host-auth...`, then copied into `/home/node/...` before tests start - Default dirs: `.minimax` - Default files: `~/.codex/auth.json`, `~/.codex/config.toml`, `.claude.json`, `~/.claude/.credentials.json`, `~/.claude/settings.json`, `~/.claude/settings.local.json` - Narrowed provider runs mount only the needed dirs/files inferred from `OPENCLAW_LIVE_PROVIDERS` / `OPENCLAW_LIVE_GATEWAY_PROVIDERS` - Override manually with `OPENCLAW_DOCKER_AUTH_DIRS=all`, `OPENCLAW_DOCKER_AUTH_DIRS=none`, or a comma list like `OPENCLAW_DOCKER_AUTH_DIRS=.claude,.codex` - `OPENCLAW_LIVE_GATEWAY_MODELS=...` / `OPENCLAW_LIVE_MODELS=...` to narrow the run - `OPENCLAW_LIVE_GATEWAY_PROVIDERS=...` / `OPENCLAW_LIVE_PROVIDERS=...` to filter providers in-container - `OPENCLAW_SKIP_DOCKER_BUILD=1` to reuse an existing `openclaw:local-live` image for reruns that do not need a rebuild - `OPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1` to ensure creds come from the profile store (not env) - `OPENCLAW_OPENWEBUI_MODEL=...` to choose the model exposed by the gateway for the Open WebUI smoke - `OPENCLAW_OPENWEBUI_PROMPT=...` to override the nonce-check prompt used by the Open WebUI smoke - `OPENWEBUI_IMAGE=...` to override the pinned Open WebUI image tag ## Docs sanity Run docs checks after doc edits: `pnpm check:docs`. Run full Mintlify anchor validation when you need in-page heading checks too: `pnpm docs:check-links:anchors`. ## Offline regression (CI-safe) These are “real pipeline” regressions without real providers: - Gateway tool calling (mock OpenAI, real gateway + agent loop): `src/gateway/gateway.test.ts` (case: "runs a mock OpenAI tool call end-to-end via gateway agent loop") - Gateway wizard (WS `wizard.start`/`wizard.next`, writes config + auth enforced): `src/gateway/gateway.test.ts` (case: "runs wizard over ws and writes auth token config") ## Agent reliability evals (skills) We already have a few CI-safe tests that behave like “agent reliability evals”: - Mock tool-calling through the real gateway + agent loop (`src/gateway/gateway.test.ts`). - End-to-end wizard flows that validate session wiring and config effects (`src/gateway/gateway.test.ts`). What’s still missing for skills (see [Skills](/tools/skills)): - **Decisioning:** when skills are listed in the prompt, does the agent pick the right skill (or avoid irrelevant ones)? - **Compliance:** does the agent read `SKILL.md` before use and follow required steps/args? - **Workflow contracts:** multi-turn scenarios that assert tool order, session history carryover, and sandbox boundaries. Future evals should stay deterministic first: - A scenario runner using mock providers to assert tool calls + order, skill file reads, and session wiring. - A small suite of skill-focused scenarios (use vs avoid, gating, prompt injection). - Optional live evals (opt-in, env-gated) only after the CI-safe suite is in place. ## Contract tests (plugin and channel shape) Contract tests verify that every registered plugin and channel conforms to its interface contract. They iterate over all discovered plugins and run a suite of shape and behavior assertions. The default `pnpm test` unit lane intentionally skips these shared seam and smoke files; run the contract commands explicitly when you touch shared channel or provider surfaces. ### Commands - All contracts: `pnpm test:contracts` - Channel contracts only: `pnpm test:contracts:channels` - Provider contracts only: `pnpm test:contracts:plugins` ### Channel contracts Located in `src/channels/plugins/contracts/*.contract.test.ts`: - **plugin** - Basic plugin shape (id, name, capabilities) - **setup** - Setup wizard contract - **session-binding** - Session binding behavior - **outbound-payload** - Message payload structure - **inbound** - Inbound message handling - **actions** - Channel action handlers - **threading** - Thread ID handling - **directory** - Directory/roster API - **group-policy** - Group policy enforcement ### Provider status contracts Located in `src/plugins/contracts/*.contract.test.ts`. - **status** - Channel status probes - **registry** - Plugin registry shape ### Provider contracts Located in `src/plugins/contracts/*.contract.test.ts`: - **auth** - Auth flow contract - **auth-choice** - Auth choice/selection - **catalog** - Model catalog API - **discovery** - Plugin discovery - **loader** - Plugin loading - **runtime** - Provider runtime - **shape** - Plugin shape/interface - **wizard** - Setup wizard ### When to run - After changing plugin-sdk exports or subpaths - After adding or modifying a channel or provider plugin - After refactoring plugin registration or discovery Contract tests run in CI and do not require real API keys. ## Adding regressions (guidance) When you fix a provider/model issue discovered in live: - Add a CI-safe regression if possible (mock/stub provider, or capture the exact request-shape transformation) - If it’s inherently live-only (rate limits, auth policies), keep the live test narrow and opt-in via env vars - Prefer targeting the smallest layer that catches the bug: - provider request conversion/replay bug → direct models test - gateway session/history/tool pipeline bug → gateway live smoke or CI-safe gateway mock test - SecretRef traversal guardrail: - `src/secrets/exec-secret-ref-id-parity.test.ts` derives one sampled target per SecretRef class from registry metadata (`listSecretTargetRegistryEntries()`), then asserts traversal-segment exec ids are rejected. - If you add a new `includeInPlan` SecretRef target family in `src/secrets/target-registry-data.ts`, update `classifyTargetClass` in that test. The test intentionally fails on unclassified target ids so new classes cannot be skipped silently. ## Related - [Testing live](/help/testing-live) - [CI](/ci)