mirror of
https://github.com/moltbot/moltbot.git
synced 2026-05-11 04:48:05 +00:00
658 lines
33 KiB
Markdown
658 lines
33 KiB
Markdown
---
|
|
summary: "Plan for reducing OpenClaw's dependency on external PI packages while moving agent state toward SQLite, VFS scratch storage, and worker isolation"
|
|
title: "Refactoring"
|
|
read_when:
|
|
- Planning work to internalize PI runtime pieces
|
|
- Moving session, transcript, or agent scratch state from JSON files to SQLite
|
|
- Designing agent filesystem boundaries or VFS-backed scratch storage
|
|
- Evaluating Node workers for agent runtime isolation or parallelism
|
|
---
|
|
|
|
This is a planning document for issue
|
|
[openclaw/openclaw#78096](https://github.com/openclaw/openclaw/issues/78096).
|
|
|
|
The goal is not to delete PI in one large rewrite. The goal is to make OpenClaw
|
|
own the runtime boundary, state model, filesystem capabilities, and parallel
|
|
execution shape so PI can become an implementation detail and eventually be
|
|
internalized or replaced in slices.
|
|
|
|
## Current Shape
|
|
|
|
OpenClaw currently embeds PI directly. The main loop still imports
|
|
`@mariozechner/pi-coding-agent`, `@mariozechner/pi-agent-core`, `@mariozechner/pi-ai`,
|
|
and `@mariozechner/pi-tui` across agent runtime, tool, provider, transcript, and
|
|
TUI paths. See [PI integration architecture](/pi).
|
|
|
|
Before this refactor, session and runtime state was split across several
|
|
persistence mechanisms:
|
|
|
|
- Gateway session index: `sessions.json`
|
|
- Session transcripts: `*.jsonl`
|
|
- Auth profiles: `auth-profiles.json`
|
|
- Config: `openclaw.json`
|
|
- Task registry: SQLite
|
|
- Plugin state: SQLite
|
|
- Memory indexes: SQLite or QMD-owned SQLite
|
|
- Plugin-specific JSON and JSONL sidecars
|
|
|
|
That mix was workable, but it created duplicated read, write, migration,
|
|
locking, maintenance, and diagnostics code. The branch now moves canonical
|
|
runtime state into the shared SQLite database and treats old JSON files as
|
|
doctor migration inputs, not runtime compatibility stores.
|
|
|
|
## Current Implementation Status
|
|
|
|
This plan has started landing in slices:
|
|
|
|
- Shared state database exists at `~/.openclaw/state/openclaw.sqlite` with
|
|
WAL, shared schema migration, session, transcript, VFS, and artifact tables.
|
|
The shared `kv` table now has a small typed helper for scoped JSON-compatible
|
|
values so low-risk JSON sidecars can move behind the same SQLite connection
|
|
without each feature reimplementing read/write/delete glue.
|
|
- Canonical per-agent session stores use SQLite by default. The `openclaw doctor`
|
|
fix mode imports legacy `sessions.json` indexes into SQLite and removes the
|
|
JSON index after import, instead of keeping a startup migration or parallel
|
|
compatibility/export store. Runtime session reads and writes normalize and
|
|
persist only: no JSON import, row pruning, capping, archive cleanup, or
|
|
disk-budget cleanup runs on the hot path. The old maintenance write options
|
|
and explicit session cleanup command have been removed from the session-store
|
|
API; doctor owns legacy import. Status and discovery now use the primary session-store loader instead of a duplicated
|
|
read-only JSON parser, and SQLite-backed agent session directories remain
|
|
discoverable after doctor deletes the legacy `sessions.json` file. The legacy
|
|
JSON session-store object/serialized cache is gone; JSON fallback reads now
|
|
parse directly while canonical SQLite stores avoid that path. The cron timer
|
|
no longer runs a dedicated session reaper.
|
|
- Transcript events are SQLite-primary. OpenClaw-owned append paths require
|
|
agent/session scope and write `transcript_events` directly; `*.jsonl` is no
|
|
longer a runtime mirror for those paths. JSONL is now an explicit
|
|
import/export/debug boundary shape only. The OpenClaw transcript session manager,
|
|
Gateway-injected assistant messages, CLI transcript persistence, Codex
|
|
app-server mirroring, compaction successor transcripts, manual compaction
|
|
boundary rewrites, and reset/header creation all persist through SQLite.
|
|
Scoped latest/tail assistant reads, delivery-mirror idempotency/latest-match
|
|
checks, `/export-session`, `before_reset` hook payloads, silent rotation
|
|
replay, chat/TUI history, restart/subagent recovery, managed media indexing,
|
|
token estimation, title/preview/usage helpers, runtime transcript repair,
|
|
bootstrap completion checks, and bounded inspection all use the scoped SQLite
|
|
transcript. Legacy JSONL import is doctor/import/debug only:
|
|
`openclaw doctor --fix` builds the transcript database from old files and removes the JSONL
|
|
sources after successful import. Runtime paths do not import, prune, or repair
|
|
JSONL files. Pre-compaction checkpoints are SQLite transcript snapshots, not
|
|
`.checkpoint.*.jsonl` copies; branch/restore and checkpoint pruning now work
|
|
against snapshot rows. The old PI session-manager cache/prewarm layer is gone.
|
|
- `AgentFilesystem` and `SqliteVirtualAgentFs` exist for scratch storage, with
|
|
`disk`, `vfs-scratch`, and `vfs-only` filesystem modes at the runtime
|
|
boundary. VFS contents can be listed and exported for support bundles. When
|
|
child-process execution is available, VFS-only `exec` projects scratch
|
|
contents into a temporary disk view, runs foreground commands there, and syncs
|
|
created, edited, and deleted files back into SQLite scratch storage.
|
|
Worker-backed PI runs now receive the mode-aware `AgentFilesystem` through
|
|
the rehydrated run params, and the PI attempt consumes the runtime-provided
|
|
artifact store before falling back to the legacy inline SQLite constructor.
|
|
When that runtime filesystem has no host workspace capability, read, write,
|
|
edit, apply_patch, and foreground exec operate on the SQLite scratch VFS when
|
|
allowed; process stays unavailable because background sessions still require a
|
|
real process registry and follow-up polling path.
|
|
- `tool_artifacts` has a SQLite store primitive for generated artifact staging,
|
|
export, and per-run cleanup. Runtime trajectory capture now mirrors the
|
|
bounded `*.trajectory.jsonl` sidecar into run-scoped SQLite artifacts while
|
|
retaining the disk sidecar for compatibility. Tool execution now records
|
|
media-result manifests for generated or captured tool media in the same
|
|
run-scoped artifact store while keeping delivery files on disk.
|
|
- Managed outgoing image attachment metadata now uses the shared SQLite `kv`
|
|
store as the primary record path. Older per-attachment JSON files are imported
|
|
and removed by `openclaw doctor --fix`; runtime media reads only SQLite.
|
|
- Cron job definitions, runtime schedule state, and run history now use the
|
|
shared SQLite state database. `openclaw doctor --fix` imports legacy
|
|
`jobs.json`, `jobs-state.json`, and `cron/runs/*.jsonl` files into SQLite and
|
|
removes those file sources after a successful import. Runtime cron paths no
|
|
longer write job-definition, schedule-state, or run-history JSON files.
|
|
- The subagent run registry now uses the shared SQLite `kv` store as the
|
|
primary record path. `openclaw doctor --fix` imports legacy
|
|
`subagents/runs.json` files into SQLite and removes them after import.
|
|
Runtime paths no longer import or delete that JSON file.
|
|
- Sandbox container and browser registries now use the shared SQLite
|
|
`sandbox_registry_entries` table as the primary record path. Legacy
|
|
monolithic and sharded registry JSON migrates only through
|
|
`openclaw doctor --fix`; runtime reads and writes no longer touch registry
|
|
JSON.
|
|
- OpenRouter model capability cache now uses the shared SQLite `kv` store as
|
|
the primary persistent cache. The older
|
|
`cache/openrouter-models.json` file is imported and removed by
|
|
`openclaw doctor --fix`, not by runtime cache reads.
|
|
- Codex app-server thread bindings now use the shared SQLite `kv` store as the
|
|
only runtime record path. The old per-session
|
|
`.codex-app-server.json` sidecar reader/writer has been removed from runtime
|
|
and tests now seed the binding store directly. `openclaw doctor --fix`
|
|
imports old sidecars into SQLite and removes the JSON source.
|
|
- TUI last-session restore pointers now use the shared SQLite `kv` store as the
|
|
primary record path. The older `tui/last-session.json` file is imported and
|
|
removed by `openclaw doctor --fix`; runtime TUI reads only SQLite.
|
|
- Auth profile runtime routing state now uses the shared SQLite `kv` store as
|
|
the primary record path. Older per-agent `auth-state.json` files are imported
|
|
and removed by `openclaw doctor --fix`; `auth-profiles.json` still owns
|
|
credentials and stays file-backed.
|
|
- Device identity, local device auth tokens, bootstrap tokens, device/node
|
|
pairing ledgers, channel pairing requests/allowlists, inferred commitment
|
|
records, subagent run records, TUI restore pointers, auth routing state,
|
|
OpenRouter model cache, web push subscriptions/VAPID keys, APNs registration
|
|
state, and update-check state now
|
|
use the shared SQLite `kv` store. `openclaw doctor --fix` imports the legacy
|
|
`identity/*.json`, `devices/*.json`, `nodes/*.json`,
|
|
`credentials/*-pairing.json`, `credentials/*-allowFrom.json`,
|
|
`commitments/commitments.json`, `subagents/runs.json`,
|
|
`tui/last-session.json`, per-agent `auth-state.json`,
|
|
`cache/openrouter-models.json`, `push/*.json`, and `update-check.json` files
|
|
into SQLite and removes those files after a successful import. Runtime paths
|
|
no longer read or write those JSON ledgers.
|
|
- `AgentRuntimeBackend`, `PreparedAgentRun`, and the Node worker runner exist
|
|
for serializable prepared runs. `RunEventBus` owns serial parent event
|
|
delivery for worker event streams. The worker runner enforces prepared-run
|
|
timeouts, terminates on parent abort signals, and flushes async parent event
|
|
handlers in worker message order before resolving the result. The worker entry
|
|
constructs mode-aware filesystem capabilities: `disk` and `vfs-scratch` keep
|
|
host workspace access, while `vfs-only` exposes only SQLite scratch/artifact
|
|
storage. The harness layer can reduce a live attempt into a
|
|
structured-cloneable `PreparedAgentRun` descriptor with prepared delivery
|
|
policy decisions, and the same reducer now works at the higher-level
|
|
`runEmbeddedPiAgent` params boundary before model/auth/registry setup creates
|
|
live objects. That high-level reducer also keeps a sanitized serializable
|
|
`runParams` snapshot so channel routing, sender metadata, images, prompt/tool
|
|
policy, and other data-only fields can cross the worker boundary without
|
|
cloning parent callbacks, abort refs, enqueue functions, or reply-operation
|
|
handles. A worker-side rehydration helper turns that snapshot back into
|
|
`runEmbeddedPiAgent` params and installs callback shims that emit worker
|
|
events for the parent bridge. A PI worker backend module now exists as the
|
|
runnable worker target for that rehydrated high-level path, and a parent-side
|
|
runner can execute that backend through the generic worker runner while
|
|
preserving the full embedded run result. Parent-owned streaming callbacks,
|
|
reply refs, user-message persistence callbacks, and abort signals now have a
|
|
worker event bridge so those functions can stay in the Gateway process instead
|
|
of crossing the worker boundary. Both late harness attempts and higher-level
|
|
`runEmbeddedPiAgent` params now build a single worker-launch request that
|
|
bundles the prepared run, parent event sink, abort signal, and permission
|
|
profile. `runEmbeddedPiAgent` now has a guarded high-level launch point before
|
|
queueing: unset mode defaults to `auto`, explicit `inline` keeps production
|
|
inline, `auto` uses the worker when the run is serializable and falls back
|
|
inline when parent-only blockers remain, and forced `worker` mode dispatches
|
|
through the high-level PI worker backend or fails closed. Worker dispatch runs
|
|
under the existing parent session/global queue envelope. Parent-owned
|
|
reply operations attach a parent backend handle while the worker runs, so
|
|
cancellation, streaming-state checks, and steering messages stay in the
|
|
Gateway process while the live reply-operation object itself is not sent to
|
|
the worker. The worker entry also installs a child-owned abort signal in the
|
|
runtime context and aborts it when parent control sends a cancel message, so
|
|
rehydrated PI run params receive a real local signal instead of an undefined
|
|
placeholder. The PI worker runner is covered by an actual worker-thread smoke
|
|
that exercises the launch request, event bridge, and embedded result
|
|
extraction together. Default production PI runs now prefer workers for
|
|
serializable turns and keep the inline fallback for blocked turns while live
|
|
parity coverage expands.
|
|
- Worker permission profile construction exists as a disabled-by-default
|
|
Node-permission seatbelt helper. It grants runtime and SQLite state access,
|
|
grants workspace access only for disk-backed filesystem modes, and does not
|
|
allow nested workers, child processes, native addons, or WASI unless explicitly
|
|
requested. High-level PI worker launches keep permissions off by default for
|
|
disk-backed modes, but `OPENCLAW_AGENT_WORKER_FILESYSTEM_MODE=vfs-only`
|
|
defaults the worker permission mode to `enforce` unless
|
|
`OPENCLAW_AGENT_WORKER_PERMISSION_MODE=off|audit|enforce` overrides it.
|
|
- `OPENCLAW_AGENT_WORKER_MODE=inline|auto|worker` controls the worker launch
|
|
path. The default is `auto`, which runs serializable high-level PI turns in a
|
|
worker and falls back inline for blocked turns; explicit `inline` preserves
|
|
the legacy path; forced worker mode fails closed until the high-level PI run
|
|
params are serializable and all live parent-owned callbacks are either
|
|
stripped or bridged.
|
|
- Common transcript, model registry, and agent-core types have OpenClaw-owned
|
|
facades. `@mariozechner/pi-coding-agent` package-root imports now route
|
|
through `src/agents/pi-coding-agent-contract.ts` outside test mocks and module
|
|
augmentation. `@mariozechner/pi-agent-core` imports now route through
|
|
`src/agents/agent-core-contract.ts` and the public
|
|
`openclaw/plugin-sdk/agent-core` type facade outside module augmentation.
|
|
The agent-core facade now also carries the small runtime values still needed
|
|
by compatibility tests, such as `Agent` and `runAgentLoop`, so those tests no
|
|
longer import the PI package directly. `@mariozechner/pi-ai` OpenAI response
|
|
stream subpaths have narrow OpenClaw-owned facades for the remaining thinking
|
|
contract coverage.
|
|
`@mariozechner/pi-ai` package-root imports across core now route through
|
|
`src/agents/pi-ai-contract.ts` outside test mocks; production OAuth and
|
|
OpenAI completion conversion subpaths route through narrow OpenClaw facades.
|
|
TUI imports route through `src/agents/pi-tui-contract.ts`, with
|
|
`src/tui/pi-tui-contract.ts` left as a local compatibility re-export.
|
|
- Transcript header, entry, tree, parser, legacy migration, context
|
|
builder, and session-manager structural types are now defined by OpenClaw's
|
|
transcript contract. The parser, migration, and context builder runtime
|
|
helpers have one OpenClaw-owned implementation under `src/agents/transcript`
|
|
instead of duplicated facade/file-state logic. OpenClaw also owns a
|
|
synchronous SQLite-backed transcript session manager that implements the live
|
|
`SessionManager` shape over `TranscriptState`, including header creation,
|
|
append persistence, tree, label, branch, session name, branch-summary,
|
|
in-memory, create/open, list/listAll, and fork APIs. Live embedded runs,
|
|
compaction, compatibility tests, and gateway checkpoint helpers now use that
|
|
OpenClaw-owned manager instead of PI's concrete `SessionManager` value. CLI
|
|
budget compaction reads transcript branches through the OpenClaw-owned
|
|
transcript state instead of opening PI `SessionManager` for read-only
|
|
branch extraction. The PI coding-agent facade no longer re-exports transcript
|
|
parser, migration, context, version, entry, or `SessionManager` symbols; those
|
|
now come from the OpenClaw transcript contract.
|
|
- Extension, session, tool-definition, and skill structural types are now
|
|
defined by OpenClaw's agent extension contract. Context pruning, compaction
|
|
hooks, embedded subscription, system-prompt assembly, skill formatting, and
|
|
client/tool adapters no longer type against PI's coding-agent package for
|
|
those shapes. The PI coding-agent facade is now limited to runtime values
|
|
still provided by PI plus the `CreateAgentSessionOptions` compatibility type.
|
|
- Bundled provider plugin production code now imports provider AI helpers via
|
|
OpenClaw-owned Plugin SDK facades (`openclaw/plugin-sdk/provider-ai` and
|
|
`openclaw/plugin-sdk/provider-ai-oauth`) instead of importing PI packages
|
|
directly.
|
|
- The core extension facade boundary test now prevents new direct PI package
|
|
imports from production `src/**` files outside the OpenClaw-owned facade and
|
|
module-augmentation files.
|
|
- Provider runtime contract, compaction hook, OAuth profile, BTW, CLI, gateway,
|
|
media, trajectory, tool, token-estimation, and spawn workspace tests now mock
|
|
or type against OpenClaw facades instead of PI packages directly. The facade
|
|
boundary test now scans core PI package-name strings so new direct test mocks
|
|
fail unless they live in a facade, module augmentation, package-graph test, or
|
|
explicit PI compatibility test.
|
|
|
|
## Target Shape
|
|
|
|
Use three explicit layers:
|
|
|
|
```text
|
|
agent runtime boundary OpenClaw-owned interface, PI as one backend
|
|
agent state database SQLite primary store, doctor-only legacy JSON import
|
|
agent filesystem boundary VFS scratch plus host capability filesystem
|
|
```
|
|
|
|
Workers sit around the runtime boundary:
|
|
|
|
```text
|
|
Gateway process
|
|
owns config, channels, HTTP, routing, state DB, policy
|
|
|
|
Agent worker
|
|
owns one turn or one runtime session lane
|
|
receives a prepared run request
|
|
emits lifecycle, stream, tool, usage, and final events
|
|
```
|
|
|
|
Node permission flags may be useful as defense in depth, but they are not the
|
|
security boundary. Node's permission model is process launch policy, not a
|
|
rooted filesystem capability API, and it has documented limitations around
|
|
workers, symlinks, existing file descriptors, native modules, and loadable
|
|
extensions.
|
|
|
|
## Non Goals
|
|
|
|
- Do not replace `fs-safe` or pinned filesystem helpers with Node permissions.
|
|
- Do not make VFS the only model for workspace edits.
|
|
- Do not migrate all agent execution to Platformatic, Regina, or another
|
|
external orchestrator.
|
|
- Do not remove Python helper paths until an equally safe portable replacement
|
|
exists.
|
|
- Do not hide config and credentials in SQLite before export, doctor, backup,
|
|
and manual repair flows are strong.
|
|
|
|
## Workstream 0: Remove Duplicate Ownership
|
|
|
|
Treat duplicated code as a symptom of unclear ownership. The first refactor
|
|
should not move bytes between files; it should decide which layer owns each
|
|
operation.
|
|
|
|
Consolidate these repeated patterns behind shared primitives:
|
|
|
|
- JSON read, write, atomic replace, backup, import, and export helpers.
|
|
- Session index lookup, locking, cleanup, and diagnostics.
|
|
- Transcript event append, replay, compaction, and support bundle export.
|
|
- PI message, tool result, and provider adapter shapes.
|
|
- Tool scratch file creation, artifact staging, and cleanup.
|
|
|
|
Target primitives:
|
|
|
|
```text
|
|
StateStore durable Gateway and agent state
|
|
SessionStoreBackend session index and metadata ownership
|
|
TranscriptStore append-only event history plus export
|
|
AgentRuntimeBackend PI or future runtime implementation
|
|
AgentFilesystem host capability filesystem plus VFS scratch
|
|
RunEventBus serializable worker to parent event stream
|
|
```
|
|
|
|
Measure progress by deleting repeated helper code, not by adding wrappers. Each
|
|
phase should name the old code path it replaces and keep at most one adapter for
|
|
compatibility.
|
|
|
|
## Workstream 1: Own The PI Boundary
|
|
|
|
Start by shrinking direct PI imports, not by forking PI.
|
|
|
|
1. Add an OpenClaw-owned runtime facade above `src/agents/harness/*`.
|
|
2. Move PI imports into a small adapter package or directory.
|
|
3. Keep `agentRuntime.id: "pi"` stable and compatible.
|
|
4. Convert common OpenClaw code to use OpenClaw types instead of PI types.
|
|
5. Internalize PI functionality in this order:
|
|
- Tool result and message types.
|
|
- Tool adapter and tool loop contracts.
|
|
- Session manager and transcript mutation.
|
|
- Model registry and provider abstractions.
|
|
- TUI pieces, only if still needed after Control UI and CLI paths settle.
|
|
|
|
Early success means most files outside the adapter no longer import
|
|
`@mariozechner/pi-*` directly.
|
|
|
|
## Workstream 2: Consolidate State In SQLite
|
|
|
|
OpenClaw already has a shared SQLite state layer. Task, Task Flow, and plugin
|
|
state runtime writes use `~/.openclaw/state/openclaw.sqlite`; legacy sidecars
|
|
are doctor-import inputs.
|
|
|
|
- `node:sqlite`
|
|
- WAL mode
|
|
- `synchronous = NORMAL`
|
|
- `busy_timeout`
|
|
- `0o700` directory mode
|
|
- `0o600` database and sidecar mode
|
|
- explicit close paths for tests and Windows cleanup
|
|
|
|
Create one shared state layer for agent and gateway state. Suggested path:
|
|
`~/.openclaw/state/openclaw.sqlite`.
|
|
|
|
Suggested tables:
|
|
|
|
```text
|
|
schema_migrations(version, applied_at)
|
|
kv(scope, key, value_json, updated_at)
|
|
agents(agent_id, config_json, created_at, updated_at)
|
|
session_entries(agent_id, session_key, entry_json, updated_at)
|
|
transcript_events(agent_id, session_id, seq, event_json, created_at)
|
|
transcript_files(agent_id, session_id, path, imported_at, exported_at)
|
|
vfs_entries(agent_id, namespace, path, kind, content_blob, metadata_json, updated_at)
|
|
tool_artifacts(agent_id, run_id, artifact_id, kind, metadata_json, blob, created_at)
|
|
```
|
|
|
|
Migration order:
|
|
|
|
1. Add shared SQLite connection and migration helpers. Done.
|
|
2. Move task registry, Task Flow, and plugin state into shared SQLite. Runtime
|
|
writes are done; legacy sidecar import remains in doctor.
|
|
3. Move `sessions.json` behind a `SessionStoreBackend` interface. Done for
|
|
canonical per-agent stores.
|
|
4. Make SQLite primary for session entries. Done for canonical per-agent
|
|
stores.
|
|
5. Import old `sessions.json` only from `openclaw doctor --fix`, then remove the
|
|
JSON index after SQLite has the rows. Done for session indexes.
|
|
6. Import old `*.jsonl` transcripts only from `openclaw doctor --fix`, then
|
|
remove the JSONL source after SQLite has the events. Done for canonical
|
|
transcript files.
|
|
7. Keep JSONL export as explicit debug/support output only.
|
|
|
|
Keep `openclaw.json` and `auth-profiles.json` file-backed until operator
|
|
repair, secret audit, and backup flows can handle the SQLite layout naturally.
|
|
|
|
## Workstream 3: Add VFS Scratch Storage
|
|
|
|
The filesystem model should distinguish scratch state from real host files.
|
|
|
|
```text
|
|
VirtualAgentFs
|
|
SQLite-backed scratch filesystem
|
|
used for temporary tool files, generated artifacts, staging, diagnostics
|
|
|
|
HostCapabilityFs
|
|
real host filesystem access
|
|
backed by fs-safe or pinned helpers
|
|
used for workspace edits, media imports, archive extraction, user files
|
|
```
|
|
|
|
Agent tools should receive capability objects, not raw path strings where
|
|
possible:
|
|
|
|
```ts
|
|
type AgentFilesystem = {
|
|
scratch: VirtualAgentFs;
|
|
workspace?: HostCapabilityFs;
|
|
};
|
|
```
|
|
|
|
Default policy:
|
|
|
|
- `read`, `write`, `edit`, and `apply_patch` continue to operate on the real
|
|
workspace unless the run is explicitly VFS-only.
|
|
- Scratch artifacts use VFS by default.
|
|
- Shell commands run on disk when host workspace or sandbox access is granted.
|
|
- In VFS-only mode, foreground `exec` may run against an explicit projected
|
|
temporary disk view and sync the result back into VFS. `process` stays
|
|
disk/sandbox-only until background sessions have a VFS-aware lifecycle.
|
|
|
|
Runtime filesystem modes:
|
|
|
|
| Mode | Workspace writes | Scratch writes | Shell working directory | Primary use |
|
|
| ------------- | ---------------------------------------- | -------------- | ----------------------------------------- | ------------------------------------------ |
|
|
| `disk` | Host capability FS | SQLite VFS | Real workspace or sandbox root | Current default with safer scratch storage |
|
|
| `vfs-scratch` | Host capability FS | SQLite VFS | Real workspace or sandbox root | Default target after VFS lands |
|
|
| `vfs-only` | SQLite VFS unless host grant is explicit | SQLite VFS | Projected temporary disk view or no shell | Isolated agents, previews, replay, tests |
|
|
|
|
The parent process chooses the mode before worker launch and records it in the
|
|
run policy. Workers should not be able to upgrade themselves from VFS-only to
|
|
host filesystem access.
|
|
|
|
Good first candidates for VFS:
|
|
|
|
- Tool temporary files.
|
|
- Model diagnostic payloads. Runtime trajectory capture now has a SQLite
|
|
artifact mirror.
|
|
- Generated artifact staging. Tool media result manifests now land in SQLite;
|
|
binary delivery files remain on disk until channel delivery supports
|
|
claim-check reads from VFS/artifacts.
|
|
- Memory upload batches.
|
|
- QA and scenario summaries.
|
|
- Plugin scratch state that does not need operator editing.
|
|
|
|
Poor first candidates:
|
|
|
|
- User workspaces.
|
|
- Git repositories.
|
|
- Media files users expect to find on disk.
|
|
- Config and credentials.
|
|
- Any integration whose dependency requires real paths.
|
|
|
|
## Workstream 4: Run Agents In Workers
|
|
|
|
Workerization should improve isolation and parallelism without moving Gateway
|
|
ownership into workers.
|
|
|
|
Initial architecture:
|
|
|
|
1. Parent Gateway builds a `PreparedAgentRun`.
|
|
2. Parent records session routing and policy in SQLite.
|
|
3. Parent starts or leases an agent worker.
|
|
4. Worker runs the selected harness attempt.
|
|
5. Worker streams events back to parent.
|
|
6. Parent persists state, delivers channel replies, and enforces lifecycle.
|
|
|
|
Worker payloads must be serializable. Do not pass live DB handles, plugin API
|
|
objects, process handles, or mutable config references into workers.
|
|
|
|
Start with one worker per active agent run. Later, add a pool keyed by:
|
|
|
|
- runtime id
|
|
- agent id
|
|
- model provider
|
|
- workspace or sandbox root
|
|
- permission profile
|
|
|
|
Use worker threads first for lower overhead. Add process mode when the run needs
|
|
stronger isolation, different Node permission flags, native module separation,
|
|
or cleaner crash containment.
|
|
|
|
## Node Permissions Policy
|
|
|
|
Use Node permissions only as a seatbelt:
|
|
|
|
- grant read access to code and required runtime files
|
|
- grant read/write to the agent workspace or sandbox root when needed
|
|
- grant worker creation only in trusted parent code
|
|
- avoid exposing worker creation to model-controlled tools
|
|
- keep subprocess and native addon permissions disabled unless the runtime
|
|
profile needs them
|
|
|
|
Do not treat Node permissions as a substitute for `HostCapabilityFs`.
|
|
|
|
## Dependency Policy
|
|
|
|
Before adding `@platformatic/vfs`, Platformatic Runtime, `@cocalc/openat2`, or
|
|
similar dependencies:
|
|
|
|
1. Prototype behind a feature flag.
|
|
2. Measure install size and native surface.
|
|
3. Check package health, license, and release cadence.
|
|
4. Keep dependency ownership local to the feature owner.
|
|
5. Avoid root dependencies unless core imports the package at runtime.
|
|
|
|
Likely choices:
|
|
|
|
- SQLite VFS can start as an OpenClaw-owned minimal implementation.
|
|
- `@platformatic/vfs` can be evaluated as an adapter, not adopted as the core
|
|
contract immediately.
|
|
- `@cocalc/openat2` can be an optional Linux fast path inside `fs-safe`, not the
|
|
portable baseline.
|
|
|
|
## Test Plan
|
|
|
|
Add tests before each migration step:
|
|
|
|
- Duplicate adapter deletion checks for PI imports, JSON state helpers, and
|
|
filesystem scratch helpers.
|
|
- Session store JSON import to SQLite.
|
|
- SQLite to JSON export for support bundles.
|
|
- Scoped JSON-compatible KV helper read, list, write, and delete behavior.
|
|
- Concurrent session entry updates from multiple workers.
|
|
- WAL recovery after simulated crash.
|
|
- Transcript JSONL compatibility while PI still owns transcripts.
|
|
- VFS path normalization, read, write, rename, remove, and directory listing.
|
|
- VFS projection to temporary disk and sync-back of command-side creates,
|
|
edits, deletes, and nested workdirs.
|
|
- Host filesystem traversal, symlink, hardlink, rename, copy, remove, and
|
|
time-of-check to time-of-use races.
|
|
- Worker lifecycle, cancellation, stream event ordering, and crash recovery.
|
|
- Worker prepared-run timeout enforcement, abort handling, and parent event
|
|
flush ordering.
|
|
- Worker parent callback bridge for streaming replies, tool output, generic
|
|
agent events, aborts, and reply refs.
|
|
- High-level run-param snapshot and worker rehydration for preserving
|
|
serializable channel/tool/prompt policy across the worker boundary.
|
|
- Parent-side PI worker runner that preserves `EmbeddedPiRunResult` instead of
|
|
collapsing worker completion to plain text.
|
|
- Run-level worker dispatch that preserves parent queue ordering and parent
|
|
reply-operation cancellation, streaming state, and steering messages without
|
|
cloning the live operation into the worker.
|
|
- Worker-entry cancellation signal rehydration from parent control messages.
|
|
- Worker permission profile construction, including VFS-only path denial.
|
|
- Disk, VFS scratch, and VFS-only filesystem mode behavior.
|
|
- Plugin state and task registry coexistence with the shared state DB.
|
|
- Managed outgoing media record import from legacy JSON, legacy file removal
|
|
after import, plus SQLite-primary serving without JSON exports.
|
|
- Subagent run registry import from legacy `subagents/runs.json` during doctor,
|
|
legacy file removal after import, and restore from SQLite without JSON
|
|
exports.
|
|
- Sandbox container and browser registry reads from SQLite, while legacy
|
|
monolithic registry migration stays an explicit doctor repair operation.
|
|
- OpenRouter model capability cache reads from SQLite, with old cache JSON
|
|
imported and removed only by doctor.
|
|
- TUI last-session restore pointers read from SQLite without JSON exports,
|
|
import legacy JSON only through doctor, and clear stale pointers from SQLite.
|
|
- Auth profile runtime state reads from SQLite, imports legacy JSON only through
|
|
doctor, and deletes SQLite state when runtime state is empty.
|
|
|
|
## Rollout Plan
|
|
|
|
Phase 0: inventory and contracts
|
|
|
|
- Count direct PI imports by package.
|
|
- Count duplicate JSON, transcript, and scratch helper implementations.
|
|
- Inventory JSON and JSONL state files.
|
|
- Define `AgentRuntimeBackend`, `SessionStoreBackend`, and `AgentFilesystem`.
|
|
- Document host path versus VFS-only operations.
|
|
|
|
Phase 1: SQLite session index
|
|
|
|
- Add shared state DB helper.
|
|
- Add a doctor migration that imports `sessions.json` into SQLite and removes
|
|
the JSON index.
|
|
- Move canonical session entries to SQLite by default.
|
|
- Prove current session list, patch, reset, cleanup, and UI flows.
|
|
- Remove load-time/startup session JSON migration, write-time pruning, and
|
|
migration-era maintenance options from the runtime store path.
|
|
- Remove the duplicate status-only session JSON reader and stop requiring a
|
|
physical `sessions.json` file for discovered SQLite-backed agent stores.
|
|
- Remove the legacy JSON session-store cache layer.
|
|
- Remove the dedicated cron timer session reaper and `cron.sessionRetention`
|
|
config; explicit session cleanup owns row pruning.
|
|
|
|
Phase 2: VFS scratch
|
|
|
|
- Add SQLite-backed VFS for scratch artifacts.
|
|
- Move low-risk scratch files first.
|
|
- Keep real workspace tools on host capability FS.
|
|
- Add support bundle export for VFS contents.
|
|
|
|
Phase 3: PI adapter shrink
|
|
|
|
- Centralize PI imports.
|
|
- Replace PI-exposed types across core with OpenClaw-owned types.
|
|
- Keep PI as the implementation of the default harness.
|
|
|
|
Phase 4: workerized runs
|
|
|
|
- Run one PI harness attempt inside a worker behind a feature flag.
|
|
- Stream events back through the parent.
|
|
- Keep parent-owned session and delivery writes authoritative.
|
|
- Add cancellation and crash recovery.
|
|
|
|
Phase 5: transcript ownership
|
|
|
|
- Move transcript mutation behind OpenClaw APIs.
|
|
- Store transcript events in SQLite.
|
|
- Import legacy JSONL through doctor only; export JSONL for debugging/support.
|
|
- Remove direct PI `SessionManager` usage from non-adapter code.
|
|
- Remove file-backed compaction checkpoint copies and the session-manager
|
|
cache/prewarm layer.
|
|
- Move Codex app-server binding state from per-session JSON sidecars to the
|
|
shared SQLite `kv` table.
|
|
|
|
Phase 6: internalize or replace PI pieces
|
|
|
|
- Internalize the pieces that still force root PI dependencies.
|
|
- Keep public runtime behavior and docs stable.
|
|
- Remove PI packages only when all runtime, TUI, provider, and transcript users
|
|
have migrated.
|
|
|
|
## Open Questions
|
|
|
|
- Which current JSON files must remain human-editable long term?
|
|
- Should a VFS-only agent be a separate runtime profile or a per-run filesystem
|
|
mode?
|
|
- Should shell commands ever run directly against VFS, or only against projected
|
|
temporary disk views?
|
|
- How much transcript history should stay queryable through SQL versus exported
|
|
support bundles?
|
|
- What is the minimum useful worker boundary: per turn, per session, or per
|
|
agent?
|
|
- Which plugin SDK APIs should expose filesystem capabilities first?
|
|
|
|
## Done Criteria
|
|
|
|
This refactor is successful when:
|
|
|
|
- Core code no longer imports PI packages outside the runtime adapter.
|
|
- Repeated JSON, transcript, PI adapter, and scratch filesystem logic has one
|
|
owner each.
|
|
- `sessions.json` is a doctor-migrated legacy input, not a compatibility store.
|
|
- Scratch state and tool artifacts can live in SQLite-backed VFS.
|
|
- Agents can run in disk, VFS scratch, and VFS-only filesystem modes.
|
|
- Real workspace writes still use capability-safe host filesystem operations.
|
|
- Agent turns can run in workers with preserved streaming, cancellation,
|
|
compaction, tool hooks, and channel delivery.
|
|
- Existing users can upgrade without losing sessions, config, credentials, or
|
|
workspaces.
|