* fix(process): skip kill-tree group kill when child wasn't detached (#71662)
When the supervisor spawns a child with detached:false (service-managed
runtime under launchd/systemd), the child shares the gateway's process
group. On session abort or SIGKILL, killProcessTree was unconditionally
issuing process.kill(-pid, 'SIGTERM') — which targets the entire process
GROUP (negative pid is POSIX group-kill semantics) and therefore
SIGTERMs the gateway parent along with the child.
Reporter saw this on macOS (LaunchAgent + KeepAlive=true): aborting a
claude-cli/claude-opus-4-7 session caused the gateway to receive
SIGTERM, then auto-restart, dropping all in-flight sessions. Switching
the primary model to a non-cli provider eliminated it because the
non-cli paths don't go through this kill-tree call. Did not occur on
Linux VPS where the gateway runs detached, because there
useDetached === true and the child got its own process group.
Fix:
- killProcessTree now accepts opts.detached?: boolean. When detached:false,
killProcessTreeUnix skips the `-pid` group-kill and goes straight to
direct-pid SIGTERM/SIGKILL. Group-kill default (detached:true) is
preserved so all existing callers behave exactly as before.
- supervisor/adapters/child.ts:286 now threads the spawn-time `useDetached`
flag into killProcessTree, so the kill-tree path matches the spawn-time
detachment decision (line 45 of the same file already computes
useDetached = process.platform !== 'win32' && !isServiceManagedRuntime()).
Tests:
- new: detached:false skips group kill and uses direct pid SIGTERM only.
- new: default behaviour (detached:true) still uses group kill (regression
guard so the existing test case isn't accidentally weakened).
Existing tests still pass (6/6 in kill-tree.test.ts). Lint clean.
Out of scope: other killProcessTree callers (mcp-stdio-transport,
bash-tools.process, etc.) keep the default group-kill behaviour because
those processes are typically detached from the gateway. Only the
supervisor/adapters/child.ts path threads `detached` through, since it's
the path that knows whether the child was actually spawned detached.
* fixup(process): also gate kill-tree group-kill on the no-detach spawn fallback (#71662)
Greptile review on the original PR caught a P1 gap: when
spawnWithFallback's initial detached spawn fails and it retries with the
no-detach fallback (label: "no-detach", options.detached: false), the
child runs detached:false but my variable useDetached was still true.
The kill closure then passed `detached: useDetached` = true to
killProcessTree, which still group-killed the gateway — same bug, just
on the fallback path.
Compute the actual detachment as
`useDetached && !spawned.usedFallback` after spawn returns, and pass
that through. This closes the gap: the kill path now correctly skips
group-kill in BOTH:
1. Service-managed runtime (useDetached=false from the start, original case)
2. Detached-spawn fallback to no-detach (useDetached=true at intent
time but spawned.usedFallback=true)
Tests:
- existing 'uses process-tree kill for default SIGKILL' updated to
assert the new {detached} option is forwarded.
- new: passes detached:false to killProcessTree when spawn fell back.
- new: passes detached:false in service-managed mode (regression guard
for the original fix).
11/11 tests pass in child.test.ts. 6/6 in kill-tree.test.ts.
Raise eligible Linux child processes own oom_score_adj from a child-side /bin/sh exec shim so cgroup memory pressure prefers transient workers over the long-lived gateway. Cover supervisor children, PTY shells, MCP stdio servers, and OpenClaw-launched browser processes through the shared process runtime seam.
Harden the wrapper for distroless images, shell startup env, per-child and process-level opt-outs, dash-compatible exec, and leading-dash command names. Document Linux verification and OOM behavior.
Fixes#70404.
Co-authored-by: Neerav Makwana <261249544+neeravmakwana@users.noreply.github.com>
After a SIGUSR1 in-process restart following an npm upgrade from v2026.4.2
to v2026.4.5, the globalThis singleton created by the old code version
lacks the activeTaskWaiters field added in v2026.4.5. resolveGlobalSingleton
returns the stale object as-is, causing notifyActiveTaskWaiters() to call
Array.from(undefined) and crash the gateway in a loop.
Add a schema migration step in getQueueState() that patches the missing
field on legacy singleton objects. Add a regression test that plants a
v2026.4.2-shaped state object and verifies resetAllLanes() and
waitForActiveTasks() succeed without throwing.
Fixes#61905