feat(qa): recreate qa lab docker stack

2026-04-24 07:01:49 +00:00 · 2026-04-05 21:21:44 +01:00
parent 17a324b0de
commit 8e1c81e707
28 changed files with 2088 additions and 890 deletions
--- a/docs/concepts/qa-e2e-automation.md
+++ b/docs/concepts/qa-e2e-automation.md
@@ -1,865 +1,66 @@
 ---
-title: "QA E2E Automation"
-summary: "Design note for a full end-to-end QA system built on a synthetic message-channel plugin, Dockerized OpenClaw, and subagent-driven scenario execution"
+summary: "Private QA automation shape for qa-lab, qa-channel, seeded scenarios, and protocol reports"
 read_when:
-  - You are designing a true end-to-end QA harness for OpenClaw
-  - You want a synthetic message channel for automated feature verification
-  - You want subagents to discover features, run scenarios, and propose fixes
+  - Extending qa-lab or qa-channel
+  - Adding repo-backed QA scenarios
+  - Building higher-realism QA automation around the Gateway dashboard
+title: "QA E2E Automation"
 ---

 # QA E2E Automation

-This note proposes a true end-to-end QA system for OpenClaw built around a
-real channel plugin dedicated to testing.
+The private QA stack is meant to exercise OpenClaw in a more realistic,
+channel-shaped way than a single unit test can.

-The core idea:
+Current pieces:

- run OpenClaw inside Docker in a realistic gateway configuration
- expose a synthetic but full-featured message channel as a normal plugin
- let a QA harness inject inbound traffic and inspect outbound state
- let OpenClaw agents and subagents explore, verify, and report on behavior
- optionally escalate failing scenarios into host-side fix workflows that open PRs
+- `extensions/qa-channel`: synthetic message channel with DM, channel, thread,
+  reaction, edit, and delete surfaces.
+- `extensions/qa-lab`: debugger UI and QA bus for observing the transcript,
+  injecting inbound messages, and exporting a Markdown report.
+- `qa/`: repo-backed seed assets for the kickoff task and baseline QA
+  scenarios.

-This is not a unit-test replacement. It is a product-level system test layer.
+The long-term goal is a two-pane QA site:

-## Chosen direction
+- Left: Gateway dashboard (Control UI) with the agent.
+- Right: QA Lab, showing the Slack-ish transcript and scenario plan.

-The initial direction for this project is:
+That lets an operator or automation loop give the agent a QA mission, observe
+real channel behavior, and record what worked, failed, or stayed blocked.

- build the full system inside this repo
- test against a matrix, not a single model/provider pair
- use Markdown reports as the first output artifact
- defer auto-PR and auto-fix work until later
- treat Slack-class semantics as the MVP transport target
- keep orchestration simple in v1, with a host-side controller that exercises
-  the moving parts directly
- evolve toward OpenClaw becoming the orchestration layer later, once the
-  transport, scenario, and reporting model are proven
+## Repo-backed seeds

-## Goals
+Seed assets live in `qa/`:

- Test OpenClaw through a real messaging-channel boundary, not only `chat.send`
-  or embedded mocks.
- Verify channel semantics that matter for real use:
-  - DMs
-  - channels/groups
-  - threads
-  - edits
-  - deletes
-  - reactions
-  - polls
-  - attachments
- Verify agent behavior across realistic user flows:
-  - memory
-  - thread binding
-  - model switching
-  - cron jobs
-  - subagents
-  - approvals
-  - routing
-  - channel-specific `message` actions
- Make the QA runner capable of feature discovery:
-  - read docs
-  - inspect plugin capability discovery
-  - inspect code and config
-  - generate a scenario protocol
- Support deterministic protocol tests and best-effort real-model tests as
-  separate lanes.
- Allow automated bug triage artifacts that can feed a host-side fix worker.
+- `qa/QA_KICKOFF_TASK.md`
+- `qa/seed-scenarios.json`

-## Non-goals
+These are intentionally in git so the QA plan is visible to both humans and the
+agent. The baseline list should stay broad enough to cover:

- Not a replacement for existing unit, contract, or live tests.
- Not a production channel.
- Not a requirement that all bug fixing happen from inside the Dockerized
-  OpenClaw runtime.
- Not a reason to add test-only core branches for one channel.
+- DM and channel chat
+- thread behavior
+- message action lifecycle
+- cron callbacks
+- memory recall
+- model switching
+- subagent handoff
+- repo-reading and docs-reading
+- one small build task such as Lobster Invaders

-## Why a channel plugin
+## Reporting

-OpenClaw already has the right boundary:
+`qa-lab` exports a Markdown protocol report from the observed bus timeline.
+The report should answer:

- core owns the shared `message` tool, prompt wiring, outer session
-  bookkeeping, and dispatch
- channel plugins own:
-  - config
-  - pairing
-  - security
-  - session grammar
-  - threading
-  - outbound delivery
-  - channel-owned actions and capability discovery
+- What worked
+- What failed
+- What stayed blocked
+- What follow-up scenarios are worth adding

-That means the cleanest design is:
+## Related docs

- a real channel plugin for QA transport semantics
- a separate QA control plane for injection and inspection
-
-This keeps the test transport inside the same architecture used by Slack,
-Discord, Teams, and similar channels.
-
-## System overview
-
-The system has six pieces.
-
-1. `qa-channel` plugin
-
- Bundled extension under `extensions/qa-channel`
- Normal `ChannelPlugin`
- Behaves like a Slack/Discord/Teams-class channel
- Registers channel-owned message actions through the shared `message` tool
-
-2. `qa-bus` sidecar
-
- Small HTTP and/or WS service
- Canonical state store for synthetic conversations, messages, threads,
-  reactions, edits, and event history
- Accepts inbound events from the harness
- Exposes inspection and wait APIs for assertions
-
-3. Dockerized OpenClaw gateway
-
- Runs as close to real deployment as practical
- Loads `qa-channel`
- Uses normal config, routing, session, cron, and plugin loading
-
-4. QA orchestrator
-
- Host-side runner or dedicated OpenClaw-driven controller
- Provisions scenario environments
- Seeds config
- Resets state
- Executes test matrix
- Collects structured outcomes
-
-5. Auto-fix worker
-
- Host-side workflow
- Creates a worktree
- launches a coding agent
- runs scoped verification
- opens a PR
-
-The auto-fix worker should start outside the container. It needs direct repo
-and GitHub access, clean worktree control, and better isolation from the
-runtime under test.
-
-6. `qa-lab` extension
-
- Bundled extension under `extensions/qa-lab`
- Owns the QA harness, Markdown report flow, and private debugger UI
- Registers hidden CLI entrypoints such as `openclaw qa run` and
-  `openclaw qa ui`
- Stays separate from the shipped Control UI bundle
-
-## High-level flow
-
-1. Start `qa-bus`.
-2. Start OpenClaw in Docker with `qa-channel` enabled.
-3. QA orchestrator injects inbound messages into `qa-bus`.
-4. `qa-channel` receives them as normal inbound traffic.
-5. OpenClaw runs the agent loop normally.
-6. Outbound replies and channel actions flow back through `qa-channel` into
-   `qa-bus`.
-7. QA orchestrator inspects state or waits on events.
-8. Orchestrator records pass/fail/flaky/unknown plus artifacts.
-9. Severe failures optionally emit a bug packet for the host-side fix worker.
-
-## Lanes
-
-The system should have two distinct lanes.
-
-### Lane A: deterministic protocol lane
-
-Use a deterministic or tightly controlled model setup.
-
-Preferred options:
-
- a canned provider fixture
- the bundled `synthetic` provider when useful
- fixed prompts with exact assertions
-
-Purpose:
-
- verify transport and product semantics
- keep flakiness low
- catch regressions in routing, memory plumbing, thread binding, cron, and tool
-  invocation
-
-### Lane B: quality lane
-
-Use real providers and real models in a matrix.
-
-Purpose:
-
- verify that the agent can still do good work end to end
- evaluate feature discoverability and instruction following
- surface model-specific breakage or degraded behavior
-
-Expected result type:
-
- best-effort
- rubric-based
- more tolerant of wording variation
-
-Matrix guidance for v1:
-
- start with a small curated matrix, not "everything configured"
- keep deterministic protocol runs separate from quality runs
- report matrix cells independently so one provider/model failure does not hide
-  transport correctness
-
-Do not mix these lanes. Protocol correctness and model quality should fail
-independently.
-
-## Use existing bootstrap seam first
-
-Before the custom channel exists, OpenClaw already has a useful bootstrap path:
-
- admin-scoped synthetic originating-route fields on `chat.send`
- synthetic message-channel headers for HTTP flows
-
-That is enough to build a first QA controller for:
-
- thread/session routing
- ACP bind flows
- subagent delivery
- cron wake paths
- memory persistence checks
-
-This should be Phase 0 because it de-risks the scenario protocol before the
-full channel lands.
-
-## `qa-lab` extension design
-
-`qa-lab` is the private operator-facing half of this system.
-
-Suggested package:
-
- `extensions/qa-lab/`
-
-Suggested responsibilities:
-
- host the synthetic bus state machine
- host the scenario runner
- write Markdown reports
- serve a private debugger UI on a separate local server
- keep that UI entirely outside the shipped Control UI bundle
-
-Suggested UI shape:
-
- left rail for conversations and threads
- center transcript pane
- right rail for event stream and report inspection
- bottom inject-composer for inbound QA traffic
-
-## `qa-channel` plugin design
-
-## Package layout
-
-Suggested package:
-
- `extensions/qa-channel/`
-
-Suggested file layout:
-
- `package.json`
- `openclaw.plugin.json`
- `index.ts`
- `setup-entry.ts`
- `api.ts`
- `runtime-api.ts`
- `src/channel.ts`
- `src/channel-api.ts`
- `src/config-schema.ts`
- `src/setup-core.ts`
- `src/setup-surface.ts`
- `src/runtime.ts`
- `src/channel.runtime.ts`
- `src/inbound.ts`
- `src/outbound.ts`
- `src/state-client.ts`
- `src/targets.ts`
- `src/threading.ts`
- `src/message-actions.ts`
- `src/probe.ts`
- `src/doctor.ts`
- `src/*.test.ts`
-
-Model it after Slack, Discord, Teams, or Google Chat packaging, not as a one-off
-test helper.
-
-## Capabilities
-
-MVP capabilities:
-
- one account
- DMs
- channels
- threads
- send text
- reply in thread
- read
- edit
- delete
- react
- search
- upload-file
- download-file
-
-Phase 2 capabilities:
-
- polls
- member-info
- channel-info
- channel-list
- pin and unpin
- permissions
- topic create and edit
-
-These map naturally onto the shared `message` tool action model already used by
-channel plugins.
-
-## Conversation model
-
-Use a stable synthetic grammar that supports both simplicity and realistic
-coverage.
-
-Suggested ids:
-
- DM conversation: `dm:<user-id>`
- channel: `chan:<space-id>`
- thread: `thread:<space-id>:<thread-id>`
- message id: `msg:<ulid>`
-
-Suggested target forms:
-
- `qa:dm:<user-id>`
- `qa:chan:<space-id>`
- `qa:thread:<space-id>:<thread-id>`
-
-The plugin should own translation between external target strings and canonical
-conversation ids.
-
-## Pairing and security
-
-Even though this is a QA channel, it should still implement real policy
-surfaces:
-
- DM allowlist / pairing flow
- group policy
- mention gating where relevant
- trusted sender ids
-
-Reason:
-
- these are product features and should be testable through the QA transport
- the QA lane should be able to verify policy failures, not only happy paths
-
-## Threading model
-
-Threading is one of the main reasons to build this channel.
-
-Required semantics:
-
- create thread from a top-level message
- reply inside an existing thread
- list thread messages
- preserve parent message linkage
- let OpenClaw thread binding attach a session to a thread
-
-The QA bus must preserve:
-
- conversation id
- thread id
- parent message id
- sender id
- timestamps
-
-## Channel-owned message actions
-
-The plugin should implement `actions.describeMessageTool(...)` and
-`actions.handleAction(...)`.
-
-MVP action list:
-
- `send`
- `read`
- `reply`
- `react`
- `edit`
- `delete`
- `thread-create`
- `thread-reply`
- `search`
- `upload-file`
- `download-file`
-
-This is enough to test the shared `message` tool end to end with real channel
-semantics.
-
-## `qa-bus` design
-
-`qa-bus` is the transport simulator and assertion backend.
-
-It should not know OpenClaw internals. It should know channel state.
-
-For v1, keep `qa-bus` in this repo so:
-
- fixtures and scenarios evolve with product code
- the transport contract can change in lock-step with the plugin
- CI and local dev do not need another repo checkout
-
-## Responsibilities
-
- accept inbound user/platform events
- persist canonical conversation state
- persist append-only event log
- expose inspection APIs
- expose blocking wait APIs
- support reset per scenario or per suite
-
-## Transport
-
-HTTP is enough for MVP.
-
-Suggested endpoints:
-
- `POST /reset`
- `POST /inbound/message`
- `POST /inbound/edit`
- `POST /inbound/delete`
- `POST /inbound/reaction`
- `POST /inbound/thread/create`
- `GET /state/conversations`
- `GET /state/messages`
- `GET /state/threads`
- `GET /events`
- `POST /wait`
-
-Optional WS stream:
-
- `/stream`
-
-Useful for live event taps and debugging.
-
-## State model
-
-Persist three layers.
-
-1. Conversation snapshot
-
- participants
- type
- thread topology
- latest message pointers
-
-2. Message snapshot
-
- sender
- content
- attachments
- edit history
- reactions
- parent and thread linkage
-
-3. Append-only event log
-
- canonical timestamp
- causal ordering
- source: inbound, outbound, action, system
- payload
-
-The append-only log matters because many QA assertions are event-oriented, not
-just state-oriented.
-
-## Assertion API
-
-The harness needs waiters, not just snapshots.
-
-Suggested `POST /wait` contract:
-
- `kind`
- `match`
- `timeoutMs`
-
-Examples:
-
- wait for outbound message matching text regex
- wait for thread creation
- wait for reaction added
- wait for message edit
- wait for no event of type X within Y ms
-
-This gives stable tests without custom polling code in every scenario.
-
-## QA orchestrator design
-
-The orchestrator should own scenario planning and artifact collection.
-
-Start host-side. Later, OpenClaw can orchestrate parts of it.
-
-This is the chosen v1 direction.
-
-Why:
-
- simpler to iterate while the transport and scenario protocol are still moving
- easier access to the repo, logs, Docker, and test fixtures
- easier artifact collection and report generation
- avoids over-coupling the first version to subagent behavior before the QA
-  protocol itself is stable
-
-## Inputs
-
- docs pages
- channel capability discovery
- configured provider/model lane
- scenario catalog
- repo/test metadata
-
-## Outputs
-
- structured protocol report
- scenario transcript
- captured channel state
- gateway logs
- failure packets
-
-For v1, the primary output is a Markdown report.
-
-Suggested report sections:
-
- suite summary
- environment
- provider/model matrix
- scenarios passed
- scenarios failed
- flaky or inconclusive scenarios
- captured evidence links or inline excerpts
- suspected ownership or file hints
- follow-up recommendations
-
-## Scenario format
-
-Use a data-driven scenario spec.
-
-Suggested shape:
-
-```json
-{
-  "id": "thread-memory-recall",
-  "lane": "deterministic",
-  "preconditions": ["qa-channel", "memory-enabled"],
-  "steps": [
-    {
-      "type": "injectMessage",
-      "to": "qa:dm:user-a",
-      "text": "Remember that the deploy key is kiwi."
-    },
-    { "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } },
-    { "type": "injectMessage", "to": "qa:dm:user-a", "text": "What was the deploy key?" },
-    { "type": "waitForOutbound", "match": { "textIncludes": "kiwi" } }
-  ],
-  "assertions": [{ "type": "outboundTextIncludes", "value": "kiwi" }]
-}
-```
-
-Keep the execution engine generic and the scenario catalog declarative.
-
-## Feature discovery
-
-The orchestrator can discover candidate scenarios from three sources.
-
-1. Docs
-
- channel docs
- testing docs
- gateway docs
- subagents docs
- cron docs
-
-2. Runtime capability discovery
-
- channel `message` action discovery
- plugin status and channel capabilities
- configured providers/models
-
-3. Code hints
-
- known action names
- channel-specific feature flags
- config schema
-
-This should produce a proposed protocol with:
-
- must-test
- can-test
- blocked
- unsupported
-
-## Scenario classes
-
-Recommended catalog:
-
- transport basics
-  - DM send and reply
-  - channel send
-  - thread create and reply
-  - reaction add and read
-  - edit and delete
- policy
-  - allowlist
-  - pairing
-  - group mention gating
- shared `message` tool
-  - read
-  - search
-  - reply
-  - react
-  - upload and download
- agent quality
-  - follows channel context
-  - obeys thread semantics
-  - uses memory across turns
-  - switches model when instructed
- automation
-  - cron add and run
-  - cron delivery into channel
-  - scheduled reminders
- subagents
-  - spawn
-  - announce
-  - threaded follow-up
-  - nested orchestration when enabled
- failure handling
-  - unsupported action
-  - timeout
-  - malformed target
-  - policy denial
-
-## OpenClaw as orchestrator
-
-Longer-term, OpenClaw itself can coordinate the QA run.
-
-Suggested architecture:
-
- one controller session
- N worker subagents
- each worker owns one scenario or scenario shard
- workers report structured results back to controller
-
-Good fits for existing OpenClaw primitives:
-
- `sessions_spawn`
- `subagents`
- cron-based wakeups for long-running suites
- thread-bound sessions for scenario-local follow-up
-
-Best near-term use:
-
- controller generates the plan
- workers execute scenarios in parallel
- controller synthesizes report
-
-Avoid making the controller also own host Git operations in the first version.
-
-Chosen direction:
-
- v1: host-side controller
- v2+: OpenClaw-native orchestration once the scenario protocol and transport
-  model are stable
-
-## Auto-fix workflow
-
-The system should emit a structured bug packet when a scenario fails.
-
-Suggested bug packet:
-
- scenario id
- lane
- failure kind
- minimal repro steps
- channel event transcript
- gateway transcript
- logs
- suspected files
- confidence
-
-Host-side fix worker flow:
-
-1. receive bug packet
-2. create detached worktree
-3. launch coding agent in worktree
-4. write failing regression first when practical
-5. implement fix
-6. run scoped verification
-7. open PR
-
-This should remain host-side at first because it needs:
-
- repo write access
- worktree hygiene
- git credentials
- GitHub auth
-
-Chosen direction:
-
- do not auto-open PRs in v1
- emit Markdown reports and structured failure packets first
- add host-side worktree + PR automation later
-
-## Rollout plan
-
-## Phase 0: bootstrap on existing synthetic ingress
-
-Build a first QA runner without a new channel:
-
- use `chat.send` with admin-scoped synthetic originating-route fields
- run deterministic scenarios against routing, memory, cron, subagents, and ACP
- validate protocol format and artifact collection
-
-Exit criteria:
-
- scenario runner exists
- structured protocol report exists
- failure artifacts exist
-
-## Phase 1: MVP `qa-channel`
-
-Build the plugin and bus with:
-
- DM
- channels
- threads
- read
- reply
- react
- edit
- delete
- search
-
-Target semantics:
-
- Slack-class transport behavior
- not full Teams-class parity yet
-
-Exit criteria:
-
- OpenClaw in Docker can talk to `qa-bus`
- harness can inject + inspect
- one green end-to-end suite across message transport and agent behavior
-
-## Phase 2: protocol expansion
-
-Add:
-
- attachments
- polls
- pins
- richer policy tests
- quality lane with real provider/model matrix
-
-Exit criteria:
-
- scenario matrix covers major built-in features
- deterministic and quality lanes are separated
-
-## Phase 3: subagent-driven QA
-
-Add:
-
- controller agent
- worker subagents
- scenario discovery from docs + capability discovery
- parallel execution
-
-Exit criteria:
-
- one controller can fan out and synthesize a suite report
-
-## Phase 4: auto-fix loop
-
-Add:
-
- bug packet emission
- host-side worktree runner
- PR creation
-
-Exit criteria:
-
- selected failures can auto-produce draft PRs
-
-## Risks
-
-## Risk: too much magic in one layer
-
-If the QA channel, bus, and orchestrator all become smart at once, debugging
-will be painful.
-
-Mitigation:
-
- keep `qa-channel` transport-focused
- keep `qa-bus` state-focused
- keep orchestrator separate
-
-## Risk: flaky assertions from model variance
-
-Mitigation:
-
- deterministic lane
- quality lane
- different pass criteria
-
-## Risk: test-only branches leaking into core
-
-Mitigation:
-
- no core special cases for `qa-channel`
- use normal plugin seams
- use admin synthetic ingress only as bootstrap
-
-## Risk: auto-fix overreach
-
-Mitigation:
-
- keep fix worker host-side
- require explicit policy for when PRs can open automatically
- gate with scoped tests
-
-## Risk: building a fake platform nobody uses
-
-Mitigation:
-
- emulate Slack/Discord/Teams semantics, not an abstract transport
- prioritize features that stress shared OpenClaw boundaries
-
-## MVP recommendation
-
-If building this now, start with this exact order.
-
-1. Host-side scenario runner using existing synthetic originating-route support.
-2. `qa-bus` sidecar with state, events, reset, and wait APIs.
-3. `extensions/qa-channel` MVP with DMs, channels, threads, reply, read, react,
-   edit, delete, and search.
-4. Markdown report generator for suite + matrix output.
-5. One deterministic end-to-end suite:
-   - inject inbound DM
-   - verify reply
-   - create thread
-   - verify follow-up in thread
-   - verify memory recall on later turn
-6. Add curated real-model matrix quality lane.
-7. Add controller subagent orchestration.
-8. Add host-side auto-fix worktree runner.
-
-This order gets real value quickly without requiring the full grand design to
-land before the first useful signal appears.
-
-## Current product decisions
-
- `qa-bus` lives inside this repo
- the first controller is host-side
- Slack-class behavior is the MVP target
- the quality lane uses a curated matrix
- first version produces Markdown reports, not PRs
- OpenClaw-native orchestration is a later phase, not a v1 requirement
+- [Testing](/help/testing)
+- [QA Channel](/channels/qa-channel)
+- [Dashboard](/web/dashboard)