Files
moltbot/docs/refactor/piless.md
2026-05-08 10:29:07 +01:00

33 KiB

summary, title, read_when
summary title read_when
Plan for reducing OpenClaw's dependency on external PI packages while moving agent state toward SQLite, VFS scratch storage, and worker isolation Refactoring
Planning work to internalize PI runtime pieces
Moving session, transcript, or agent scratch state from JSON files to SQLite
Designing agent filesystem boundaries or VFS-backed scratch storage
Evaluating Node workers for agent runtime isolation or parallelism

This is a planning document for issue openclaw/openclaw#78096.

The goal is not to delete PI in one large rewrite. The goal is to make OpenClaw own the runtime boundary, state model, filesystem capabilities, and parallel execution shape so PI can become an implementation detail and eventually be internalized or replaced in slices.

Current Shape

OpenClaw currently embeds PI directly. The main loop still imports @mariozechner/pi-coding-agent, @mariozechner/pi-agent-core, @mariozechner/pi-ai, and @mariozechner/pi-tui across agent runtime, tool, provider, transcript, and TUI paths. See PI integration architecture.

Before this refactor, session and runtime state was split across several persistence mechanisms:

  • Gateway session index: sessions.json
  • Session transcripts: *.jsonl
  • Auth profiles: auth-profiles.json
  • Config: openclaw.json
  • Task registry: SQLite
  • Plugin state: SQLite
  • Memory indexes: SQLite or QMD-owned SQLite
  • Plugin-specific JSON and JSONL sidecars

That mix was workable, but it created duplicated read, write, migration, locking, maintenance, and diagnostics code. The branch now moves canonical runtime state into the shared SQLite database and treats old JSON files as doctor migration inputs, not runtime compatibility stores.

Current Implementation Status

This plan has started landing in slices:

  • Shared state database exists at ~/.openclaw/state/openclaw.sqlite with WAL, shared schema migration, session, transcript, VFS, and artifact tables. The shared kv table now has a small typed helper for scoped JSON-compatible values so low-risk JSON sidecars can move behind the same SQLite connection without each feature reimplementing read/write/delete glue.
  • Canonical per-agent session stores use SQLite by default. The openclaw doctor fix mode imports legacy sessions.json indexes into SQLite and removes the JSON index after import, instead of keeping a startup migration or parallel compatibility/export store. Runtime session reads and writes normalize and persist only: no JSON import, row pruning, capping, archive cleanup, or disk-budget cleanup runs on the hot path. The old maintenance write options and explicit session cleanup command have been removed from the session-store API; doctor owns legacy import. Status and discovery now use the primary session-store loader instead of a duplicated read-only JSON parser, and SQLite-backed agent session directories remain discoverable after doctor deletes the legacy sessions.json file. The legacy JSON session-store object/serialized cache is gone; JSON fallback reads now parse directly while canonical SQLite stores avoid that path. The cron timer no longer runs a dedicated session reaper.
  • Transcript events are SQLite-primary. OpenClaw-owned append paths require agent/session scope and write transcript_events directly; *.jsonl is no longer a runtime mirror for those paths. JSONL is now an explicit import/export/debug boundary shape only. The OpenClaw transcript session manager, Gateway-injected assistant messages, CLI transcript persistence, Codex app-server mirroring, compaction successor transcripts, manual compaction boundary rewrites, and reset/header creation all persist through SQLite. Scoped latest/tail assistant reads, delivery-mirror idempotency/latest-match checks, /export-session, before_reset hook payloads, silent rotation replay, chat/TUI history, restart/subagent recovery, managed media indexing, token estimation, title/preview/usage helpers, runtime transcript repair, bootstrap completion checks, and bounded inspection all use the scoped SQLite transcript. Legacy JSONL import is doctor/import/debug only: openclaw doctor --fix builds the transcript database from old files and removes the JSONL sources after successful import. Runtime paths do not import, prune, or repair JSONL files. Pre-compaction checkpoints are SQLite transcript snapshots, not .checkpoint.*.jsonl copies; branch/restore and checkpoint pruning now work against snapshot rows. The old PI session-manager cache/prewarm layer is gone.
  • AgentFilesystem and SqliteVirtualAgentFs exist for scratch storage, with disk, vfs-scratch, and vfs-only filesystem modes at the runtime boundary. VFS contents can be listed and exported for support bundles. When child-process execution is available, VFS-only exec projects scratch contents into a temporary disk view, runs foreground commands there, and syncs created, edited, and deleted files back into SQLite scratch storage. Worker-backed PI runs now receive the mode-aware AgentFilesystem through the rehydrated run params, and the PI attempt consumes the runtime-provided artifact store before falling back to the legacy inline SQLite constructor. When that runtime filesystem has no host workspace capability, read, write, edit, apply_patch, and foreground exec operate on the SQLite scratch VFS when allowed; process stays unavailable because background sessions still require a real process registry and follow-up polling path.
  • tool_artifacts has a SQLite store primitive for generated artifact staging, export, and per-run cleanup. Runtime trajectory capture now mirrors the bounded *.trajectory.jsonl sidecar into run-scoped SQLite artifacts while retaining the disk sidecar for compatibility. Tool execution now records media-result manifests for generated or captured tool media in the same run-scoped artifact store while keeping delivery files on disk.
  • Managed outgoing image attachment metadata now uses the shared SQLite kv store as the primary record path. Older per-attachment JSON files are imported and removed by openclaw doctor --fix; runtime media reads only SQLite.
  • Cron job definitions, runtime schedule state, and run history now use the shared SQLite state database. openclaw doctor --fix imports legacy jobs.json, jobs-state.json, and cron/runs/*.jsonl files into SQLite and removes those file sources after a successful import. Runtime cron paths no longer write job-definition, schedule-state, or run-history JSON files.
  • The subagent run registry now uses the shared SQLite kv store as the primary record path. openclaw doctor --fix imports legacy subagents/runs.json files into SQLite and removes them after import. Runtime paths no longer import or delete that JSON file.
  • Sandbox container and browser registries now use the shared SQLite sandbox_registry_entries table as the primary record path. Legacy monolithic and sharded registry JSON migrates only through openclaw doctor --fix; runtime reads and writes no longer touch registry JSON.
  • OpenRouter model capability cache now uses the shared SQLite kv store as the primary persistent cache. The older cache/openrouter-models.json file is imported and removed by openclaw doctor --fix, not by runtime cache reads.
  • Codex app-server thread bindings now use the shared SQLite kv store as the only runtime record path. The old per-session .codex-app-server.json sidecar reader/writer has been removed from runtime and tests now seed the binding store directly. openclaw doctor --fix imports old sidecars into SQLite and removes the JSON source.
  • TUI last-session restore pointers now use the shared SQLite kv store as the primary record path. The older tui/last-session.json file is imported and removed by openclaw doctor --fix; runtime TUI reads only SQLite.
  • Auth profile runtime routing state now uses the shared SQLite kv store as the primary record path. Older per-agent auth-state.json files are imported and removed by openclaw doctor --fix; auth-profiles.json still owns credentials and stays file-backed.
  • Device identity, local device auth tokens, bootstrap tokens, device/node pairing ledgers, channel pairing requests/allowlists, inferred commitment records, subagent run records, TUI restore pointers, auth routing state, OpenRouter model cache, web push subscriptions/VAPID keys, APNs registration state, and update-check state now use the shared SQLite kv store. openclaw doctor --fix imports the legacy identity/*.json, devices/*.json, nodes/*.json, credentials/*-pairing.json, credentials/*-allowFrom.json, commitments/commitments.json, subagents/runs.json, tui/last-session.json, per-agent auth-state.json, cache/openrouter-models.json, push/*.json, and update-check.json files into SQLite and removes those files after a successful import. Runtime paths no longer read or write those JSON ledgers.
  • AgentRuntimeBackend, PreparedAgentRun, and the Node worker runner exist for serializable prepared runs. RunEventBus owns serial parent event delivery for worker event streams. The worker runner enforces prepared-run timeouts, terminates on parent abort signals, and flushes async parent event handlers in worker message order before resolving the result. The worker entry constructs mode-aware filesystem capabilities: disk and vfs-scratch keep host workspace access, while vfs-only exposes only SQLite scratch/artifact storage. The harness layer can reduce a live attempt into a structured-cloneable PreparedAgentRun descriptor with prepared delivery policy decisions, and the same reducer now works at the higher-level runEmbeddedPiAgent params boundary before model/auth/registry setup creates live objects. That high-level reducer also keeps a sanitized serializable runParams snapshot so channel routing, sender metadata, images, prompt/tool policy, and other data-only fields can cross the worker boundary without cloning parent callbacks, abort refs, enqueue functions, or reply-operation handles. A worker-side rehydration helper turns that snapshot back into runEmbeddedPiAgent params and installs callback shims that emit worker events for the parent bridge. A PI worker backend module now exists as the runnable worker target for that rehydrated high-level path, and a parent-side runner can execute that backend through the generic worker runner while preserving the full embedded run result. Parent-owned streaming callbacks, reply refs, user-message persistence callbacks, and abort signals now have a worker event bridge so those functions can stay in the Gateway process instead of crossing the worker boundary. Both late harness attempts and higher-level runEmbeddedPiAgent params now build a single worker-launch request that bundles the prepared run, parent event sink, abort signal, and permission profile. runEmbeddedPiAgent now has a guarded high-level launch point before queueing: unset mode defaults to auto, explicit inline keeps production inline, auto uses the worker when the run is serializable and falls back inline when parent-only blockers remain, and forced worker mode dispatches through the high-level PI worker backend or fails closed. Worker dispatch runs under the existing parent session/global queue envelope. Parent-owned reply operations attach a parent backend handle while the worker runs, so cancellation, streaming-state checks, and steering messages stay in the Gateway process while the live reply-operation object itself is not sent to the worker. The worker entry also installs a child-owned abort signal in the runtime context and aborts it when parent control sends a cancel message, so rehydrated PI run params receive a real local signal instead of an undefined placeholder. The PI worker runner is covered by an actual worker-thread smoke that exercises the launch request, event bridge, and embedded result extraction together. Default production PI runs now prefer workers for serializable turns and keep the inline fallback for blocked turns while live parity coverage expands.
  • Worker permission profile construction exists as a disabled-by-default Node-permission seatbelt helper. It grants runtime and SQLite state access, grants workspace access only for disk-backed filesystem modes, and does not allow nested workers, child processes, native addons, or WASI unless explicitly requested. High-level PI worker launches keep permissions off by default for disk-backed modes, but OPENCLAW_AGENT_WORKER_FILESYSTEM_MODE=vfs-only defaults the worker permission mode to enforce unless OPENCLAW_AGENT_WORKER_PERMISSION_MODE=off|audit|enforce overrides it.
  • OPENCLAW_AGENT_WORKER_MODE=inline|auto|worker controls the worker launch path. The default is auto, which runs serializable high-level PI turns in a worker and falls back inline for blocked turns; explicit inline preserves the legacy path; forced worker mode fails closed until the high-level PI run params are serializable and all live parent-owned callbacks are either stripped or bridged.
  • Common transcript, model registry, and agent-core types have OpenClaw-owned facades. @mariozechner/pi-coding-agent package-root imports now route through src/agents/pi-coding-agent-contract.ts outside test mocks and module augmentation. @mariozechner/pi-agent-core imports now route through src/agents/agent-core-contract.ts and the public openclaw/plugin-sdk/agent-core type facade outside module augmentation. The agent-core facade now also carries the small runtime values still needed by compatibility tests, such as Agent and runAgentLoop, so those tests no longer import the PI package directly. @mariozechner/pi-ai OpenAI response stream subpaths have narrow OpenClaw-owned facades for the remaining thinking contract coverage. @mariozechner/pi-ai package-root imports across core now route through src/agents/pi-ai-contract.ts outside test mocks; production OAuth and OpenAI completion conversion subpaths route through narrow OpenClaw facades. TUI imports route through src/agents/pi-tui-contract.ts, with src/tui/pi-tui-contract.ts left as a local compatibility re-export.
  • Transcript header, entry, tree, parser, legacy migration, context builder, and session-manager structural types are now defined by OpenClaw's transcript contract. The parser, migration, and context builder runtime helpers have one OpenClaw-owned implementation under src/agents/transcript instead of duplicated facade/file-state logic. OpenClaw also owns a synchronous SQLite-backed transcript session manager that implements the live SessionManager shape over TranscriptState, including header creation, append persistence, tree, label, branch, session name, branch-summary, in-memory, create/open, list/listAll, and fork APIs. Live embedded runs, compaction, compatibility tests, and gateway checkpoint helpers now use that OpenClaw-owned manager instead of PI's concrete SessionManager value. CLI budget compaction reads transcript branches through the OpenClaw-owned transcript state instead of opening PI SessionManager for read-only branch extraction. The PI coding-agent facade no longer re-exports transcript parser, migration, context, version, entry, or SessionManager symbols; those now come from the OpenClaw transcript contract.
  • Extension, session, tool-definition, and skill structural types are now defined by OpenClaw's agent extension contract. Context pruning, compaction hooks, embedded subscription, system-prompt assembly, skill formatting, and client/tool adapters no longer type against PI's coding-agent package for those shapes. The PI coding-agent facade is now limited to runtime values still provided by PI plus the CreateAgentSessionOptions compatibility type.
  • Bundled provider plugin production code now imports provider AI helpers via OpenClaw-owned Plugin SDK facades (openclaw/plugin-sdk/provider-ai and openclaw/plugin-sdk/provider-ai-oauth) instead of importing PI packages directly.
  • The core extension facade boundary test now prevents new direct PI package imports from production src/** files outside the OpenClaw-owned facade and module-augmentation files.
  • Provider runtime contract, compaction hook, OAuth profile, BTW, CLI, gateway, media, trajectory, tool, token-estimation, and spawn workspace tests now mock or type against OpenClaw facades instead of PI packages directly. The facade boundary test now scans core PI package-name strings so new direct test mocks fail unless they live in a facade, module augmentation, package-graph test, or explicit PI compatibility test.

Target Shape

Use three explicit layers:

agent runtime boundary       OpenClaw-owned interface, PI as one backend
agent state database         SQLite primary store, doctor-only legacy JSON import
agent filesystem boundary    VFS scratch plus host capability filesystem

Workers sit around the runtime boundary:

Gateway process
  owns config, channels, HTTP, routing, state DB, policy

Agent worker
  owns one turn or one runtime session lane
  receives a prepared run request
  emits lifecycle, stream, tool, usage, and final events

Node permission flags may be useful as defense in depth, but they are not the security boundary. Node's permission model is process launch policy, not a rooted filesystem capability API, and it has documented limitations around workers, symlinks, existing file descriptors, native modules, and loadable extensions.

Non Goals

  • Do not replace fs-safe or pinned filesystem helpers with Node permissions.
  • Do not make VFS the only model for workspace edits.
  • Do not migrate all agent execution to Platformatic, Regina, or another external orchestrator.
  • Do not remove Python helper paths until an equally safe portable replacement exists.
  • Do not hide config and credentials in SQLite before export, doctor, backup, and manual repair flows are strong.

Workstream 0: Remove Duplicate Ownership

Treat duplicated code as a symptom of unclear ownership. The first refactor should not move bytes between files; it should decide which layer owns each operation.

Consolidate these repeated patterns behind shared primitives:

  • JSON read, write, atomic replace, backup, import, and export helpers.
  • Session index lookup, locking, cleanup, and diagnostics.
  • Transcript event append, replay, compaction, and support bundle export.
  • PI message, tool result, and provider adapter shapes.
  • Tool scratch file creation, artifact staging, and cleanup.

Target primitives:

StateStore              durable Gateway and agent state
SessionStoreBackend     session index and metadata ownership
TranscriptStore         append-only event history plus export
AgentRuntimeBackend     PI or future runtime implementation
AgentFilesystem         host capability filesystem plus VFS scratch
RunEventBus             serializable worker to parent event stream

Measure progress by deleting repeated helper code, not by adding wrappers. Each phase should name the old code path it replaces and keep at most one adapter for compatibility.

Workstream 1: Own The PI Boundary

Start by shrinking direct PI imports, not by forking PI.

  1. Add an OpenClaw-owned runtime facade above src/agents/harness/*.
  2. Move PI imports into a small adapter package or directory.
  3. Keep agentRuntime.id: "pi" stable and compatible.
  4. Convert common OpenClaw code to use OpenClaw types instead of PI types.
  5. Internalize PI functionality in this order:
    • Tool result and message types.
    • Tool adapter and tool loop contracts.
    • Session manager and transcript mutation.
    • Model registry and provider abstractions.
    • TUI pieces, only if still needed after Control UI and CLI paths settle.

Early success means most files outside the adapter no longer import @mariozechner/pi-* directly.

Workstream 2: Consolidate State In SQLite

OpenClaw already has a shared SQLite state layer. Task, Task Flow, and plugin state runtime writes use ~/.openclaw/state/openclaw.sqlite; legacy sidecars are doctor-import inputs.

  • node:sqlite
  • WAL mode
  • synchronous = NORMAL
  • busy_timeout
  • 0o700 directory mode
  • 0o600 database and sidecar mode
  • explicit close paths for tests and Windows cleanup

Create one shared state layer for agent and gateway state. Suggested path: ~/.openclaw/state/openclaw.sqlite.

Suggested tables:

schema_migrations(version, applied_at)
kv(scope, key, value_json, updated_at)
agents(agent_id, config_json, created_at, updated_at)
session_entries(agent_id, session_key, entry_json, updated_at)
transcript_events(agent_id, session_id, seq, event_json, created_at)
transcript_files(agent_id, session_id, path, imported_at, exported_at)
vfs_entries(agent_id, namespace, path, kind, content_blob, metadata_json, updated_at)
tool_artifacts(agent_id, run_id, artifact_id, kind, metadata_json, blob, created_at)

Migration order:

  1. Add shared SQLite connection and migration helpers. Done.
  2. Move task registry, Task Flow, and plugin state into shared SQLite. Runtime writes are done; legacy sidecar import remains in doctor.
  3. Move sessions.json behind a SessionStoreBackend interface. Done for canonical per-agent stores.
  4. Make SQLite primary for session entries. Done for canonical per-agent stores.
  5. Import old sessions.json only from openclaw doctor --fix, then remove the JSON index after SQLite has the rows. Done for session indexes.
  6. Import old *.jsonl transcripts only from openclaw doctor --fix, then remove the JSONL source after SQLite has the events. Done for canonical transcript files.
  7. Keep JSONL export as explicit debug/support output only.

Keep openclaw.json and auth-profiles.json file-backed until operator repair, secret audit, and backup flows can handle the SQLite layout naturally.

Workstream 3: Add VFS Scratch Storage

The filesystem model should distinguish scratch state from real host files.

VirtualAgentFs
  SQLite-backed scratch filesystem
  used for temporary tool files, generated artifacts, staging, diagnostics

HostCapabilityFs
  real host filesystem access
  backed by fs-safe or pinned helpers
  used for workspace edits, media imports, archive extraction, user files

Agent tools should receive capability objects, not raw path strings where possible:

type AgentFilesystem = {
  scratch: VirtualAgentFs;
  workspace?: HostCapabilityFs;
};

Default policy:

  • read, write, edit, and apply_patch continue to operate on the real workspace unless the run is explicitly VFS-only.
  • Scratch artifacts use VFS by default.
  • Shell commands run on disk when host workspace or sandbox access is granted.
  • In VFS-only mode, foreground exec may run against an explicit projected temporary disk view and sync the result back into VFS. process stays disk/sandbox-only until background sessions have a VFS-aware lifecycle.

Runtime filesystem modes:

Mode Workspace writes Scratch writes Shell working directory Primary use
disk Host capability FS SQLite VFS Real workspace or sandbox root Current default with safer scratch storage
vfs-scratch Host capability FS SQLite VFS Real workspace or sandbox root Default target after VFS lands
vfs-only SQLite VFS unless host grant is explicit SQLite VFS Projected temporary disk view or no shell Isolated agents, previews, replay, tests

The parent process chooses the mode before worker launch and records it in the run policy. Workers should not be able to upgrade themselves from VFS-only to host filesystem access.

Good first candidates for VFS:

  • Tool temporary files.
  • Model diagnostic payloads. Runtime trajectory capture now has a SQLite artifact mirror.
  • Generated artifact staging. Tool media result manifests now land in SQLite; binary delivery files remain on disk until channel delivery supports claim-check reads from VFS/artifacts.
  • Memory upload batches.
  • QA and scenario summaries.
  • Plugin scratch state that does not need operator editing.

Poor first candidates:

  • User workspaces.
  • Git repositories.
  • Media files users expect to find on disk.
  • Config and credentials.
  • Any integration whose dependency requires real paths.

Workstream 4: Run Agents In Workers

Workerization should improve isolation and parallelism without moving Gateway ownership into workers.

Initial architecture:

  1. Parent Gateway builds a PreparedAgentRun.
  2. Parent records session routing and policy in SQLite.
  3. Parent starts or leases an agent worker.
  4. Worker runs the selected harness attempt.
  5. Worker streams events back to parent.
  6. Parent persists state, delivers channel replies, and enforces lifecycle.

Worker payloads must be serializable. Do not pass live DB handles, plugin API objects, process handles, or mutable config references into workers.

Start with one worker per active agent run. Later, add a pool keyed by:

  • runtime id
  • agent id
  • model provider
  • workspace or sandbox root
  • permission profile

Use worker threads first for lower overhead. Add process mode when the run needs stronger isolation, different Node permission flags, native module separation, or cleaner crash containment.

Node Permissions Policy

Use Node permissions only as a seatbelt:

  • grant read access to code and required runtime files
  • grant read/write to the agent workspace or sandbox root when needed
  • grant worker creation only in trusted parent code
  • avoid exposing worker creation to model-controlled tools
  • keep subprocess and native addon permissions disabled unless the runtime profile needs them

Do not treat Node permissions as a substitute for HostCapabilityFs.

Dependency Policy

Before adding @platformatic/vfs, Platformatic Runtime, @cocalc/openat2, or similar dependencies:

  1. Prototype behind a feature flag.
  2. Measure install size and native surface.
  3. Check package health, license, and release cadence.
  4. Keep dependency ownership local to the feature owner.
  5. Avoid root dependencies unless core imports the package at runtime.

Likely choices:

  • SQLite VFS can start as an OpenClaw-owned minimal implementation.
  • @platformatic/vfs can be evaluated as an adapter, not adopted as the core contract immediately.
  • @cocalc/openat2 can be an optional Linux fast path inside fs-safe, not the portable baseline.

Test Plan

Add tests before each migration step:

  • Duplicate adapter deletion checks for PI imports, JSON state helpers, and filesystem scratch helpers.
  • Session store JSON import to SQLite.
  • SQLite to JSON export for support bundles.
  • Scoped JSON-compatible KV helper read, list, write, and delete behavior.
  • Concurrent session entry updates from multiple workers.
  • WAL recovery after simulated crash.
  • Transcript JSONL compatibility while PI still owns transcripts.
  • VFS path normalization, read, write, rename, remove, and directory listing.
  • VFS projection to temporary disk and sync-back of command-side creates, edits, deletes, and nested workdirs.
  • Host filesystem traversal, symlink, hardlink, rename, copy, remove, and time-of-check to time-of-use races.
  • Worker lifecycle, cancellation, stream event ordering, and crash recovery.
  • Worker prepared-run timeout enforcement, abort handling, and parent event flush ordering.
  • Worker parent callback bridge for streaming replies, tool output, generic agent events, aborts, and reply refs.
  • High-level run-param snapshot and worker rehydration for preserving serializable channel/tool/prompt policy across the worker boundary.
  • Parent-side PI worker runner that preserves EmbeddedPiRunResult instead of collapsing worker completion to plain text.
  • Run-level worker dispatch that preserves parent queue ordering and parent reply-operation cancellation, streaming state, and steering messages without cloning the live operation into the worker.
  • Worker-entry cancellation signal rehydration from parent control messages.
  • Worker permission profile construction, including VFS-only path denial.
  • Disk, VFS scratch, and VFS-only filesystem mode behavior.
  • Plugin state and task registry coexistence with the shared state DB.
  • Managed outgoing media record import from legacy JSON, legacy file removal after import, plus SQLite-primary serving without JSON exports.
  • Subagent run registry import from legacy subagents/runs.json during doctor, legacy file removal after import, and restore from SQLite without JSON exports.
  • Sandbox container and browser registry reads from SQLite, while legacy monolithic registry migration stays an explicit doctor repair operation.
  • OpenRouter model capability cache reads from SQLite, with old cache JSON imported and removed only by doctor.
  • TUI last-session restore pointers read from SQLite without JSON exports, import legacy JSON only through doctor, and clear stale pointers from SQLite.
  • Auth profile runtime state reads from SQLite, imports legacy JSON only through doctor, and deletes SQLite state when runtime state is empty.

Rollout Plan

Phase 0: inventory and contracts

  • Count direct PI imports by package.
  • Count duplicate JSON, transcript, and scratch helper implementations.
  • Inventory JSON and JSONL state files.
  • Define AgentRuntimeBackend, SessionStoreBackend, and AgentFilesystem.
  • Document host path versus VFS-only operations.

Phase 1: SQLite session index

  • Add shared state DB helper.
  • Add a doctor migration that imports sessions.json into SQLite and removes the JSON index.
  • Move canonical session entries to SQLite by default.
  • Prove current session list, patch, reset, cleanup, and UI flows.
  • Remove load-time/startup session JSON migration, write-time pruning, and migration-era maintenance options from the runtime store path.
  • Remove the duplicate status-only session JSON reader and stop requiring a physical sessions.json file for discovered SQLite-backed agent stores.
  • Remove the legacy JSON session-store cache layer.
  • Remove the dedicated cron timer session reaper and cron.sessionRetention config; explicit session cleanup owns row pruning.

Phase 2: VFS scratch

  • Add SQLite-backed VFS for scratch artifacts.
  • Move low-risk scratch files first.
  • Keep real workspace tools on host capability FS.
  • Add support bundle export for VFS contents.

Phase 3: PI adapter shrink

  • Centralize PI imports.
  • Replace PI-exposed types across core with OpenClaw-owned types.
  • Keep PI as the implementation of the default harness.

Phase 4: workerized runs

  • Run one PI harness attempt inside a worker behind a feature flag.
  • Stream events back through the parent.
  • Keep parent-owned session and delivery writes authoritative.
  • Add cancellation and crash recovery.

Phase 5: transcript ownership

  • Move transcript mutation behind OpenClaw APIs.
  • Store transcript events in SQLite.
  • Import legacy JSONL through doctor only; export JSONL for debugging/support.
  • Remove direct PI SessionManager usage from non-adapter code.
  • Remove file-backed compaction checkpoint copies and the session-manager cache/prewarm layer.
  • Move Codex app-server binding state from per-session JSON sidecars to the shared SQLite kv table.

Phase 6: internalize or replace PI pieces

  • Internalize the pieces that still force root PI dependencies.
  • Keep public runtime behavior and docs stable.
  • Remove PI packages only when all runtime, TUI, provider, and transcript users have migrated.

Open Questions

  • Which current JSON files must remain human-editable long term?
  • Should a VFS-only agent be a separate runtime profile or a per-run filesystem mode?
  • Should shell commands ever run directly against VFS, or only against projected temporary disk views?
  • How much transcript history should stay queryable through SQL versus exported support bundles?
  • What is the minimum useful worker boundary: per turn, per session, or per agent?
  • Which plugin SDK APIs should expose filesystem capabilities first?

Done Criteria

This refactor is successful when:

  • Core code no longer imports PI packages outside the runtime adapter.
  • Repeated JSON, transcript, PI adapter, and scratch filesystem logic has one owner each.
  • sessions.json is a doctor-migrated legacy input, not a compatibility store.
  • Scratch state and tool artifacts can live in SQLite-backed VFS.
  • Agents can run in disk, VFS scratch, and VFS-only filesystem modes.
  • Real workspace writes still use capability-safe host filesystem operations.
  • Agent turns can run in workers with preserved streaming, cancellation, compaction, tool hooks, and channel delivery.
  • Existing users can upgrade without losing sessions, config, credentials, or workspaces.