Files
moltbot/src/process/kill-tree.ts
hcl 4a72e1b990 fix(process): skip kill-tree group kill when child wasn't detached (#71662) (#71681)
* fix(process): skip kill-tree group kill when child wasn't detached (#71662)

When the supervisor spawns a child with detached:false (service-managed
runtime under launchd/systemd), the child shares the gateway's process
group. On session abort or SIGKILL, killProcessTree was unconditionally
issuing process.kill(-pid, 'SIGTERM') — which targets the entire process
GROUP (negative pid is POSIX group-kill semantics) and therefore
SIGTERMs the gateway parent along with the child.

Reporter saw this on macOS (LaunchAgent + KeepAlive=true): aborting a
claude-cli/claude-opus-4-7 session caused the gateway to receive
SIGTERM, then auto-restart, dropping all in-flight sessions. Switching
the primary model to a non-cli provider eliminated it because the
non-cli paths don't go through this kill-tree call. Did not occur on
Linux VPS where the gateway runs detached, because there
useDetached === true and the child got its own process group.

Fix:
- killProcessTree now accepts opts.detached?: boolean. When detached:false,
  killProcessTreeUnix skips the `-pid` group-kill and goes straight to
  direct-pid SIGTERM/SIGKILL. Group-kill default (detached:true) is
  preserved so all existing callers behave exactly as before.
- supervisor/adapters/child.ts:286 now threads the spawn-time `useDetached`
  flag into killProcessTree, so the kill-tree path matches the spawn-time
  detachment decision (line 45 of the same file already computes
  useDetached = process.platform !== 'win32' && !isServiceManagedRuntime()).

Tests:
- new: detached:false skips group kill and uses direct pid SIGTERM only.
- new: default behaviour (detached:true) still uses group kill (regression
  guard so the existing test case isn't accidentally weakened).

Existing tests still pass (6/6 in kill-tree.test.ts). Lint clean.

Out of scope: other killProcessTree callers (mcp-stdio-transport,
bash-tools.process, etc.) keep the default group-kill behaviour because
those processes are typically detached from the gateway. Only the
supervisor/adapters/child.ts path threads `detached` through, since it's
the path that knows whether the child was actually spawned detached.

* fixup(process): also gate kill-tree group-kill on the no-detach spawn fallback (#71662)

Greptile review on the original PR caught a P1 gap: when
spawnWithFallback's initial detached spawn fails and it retries with the
no-detach fallback (label: "no-detach", options.detached: false), the
child runs detached:false but my variable useDetached was still true.
The kill closure then passed `detached: useDetached` = true to
killProcessTree, which still group-killed the gateway — same bug, just
on the fallback path.

Compute the actual detachment as
`useDetached && !spawned.usedFallback` after spawn returns, and pass
that through. This closes the gap: the kill path now correctly skips
group-kill in BOTH:
1. Service-managed runtime (useDetached=false from the start, original case)
2. Detached-spawn fallback to no-detach (useDetached=true at intent
   time but spawned.usedFallback=true)

Tests:
- existing 'uses process-tree kill for default SIGKILL' updated to
  assert the new {detached} option is forwarded.
- new: passes detached:false to killProcessTree when spawn fell back.
- new: passes detached:false in service-managed mode (regression guard
  for the original fix).

11/11 tests pass in child.test.ts. 6/6 in kill-tree.test.ts.
2026-04-25 17:08:53 -04:00

125 lines
3.4 KiB
TypeScript

import { spawn } from "node:child_process";
const DEFAULT_GRACE_MS = 3000;
const MAX_GRACE_MS = 60_000;
/**
* Best-effort process-tree termination with graceful shutdown.
* - Windows: use taskkill /T to include descendants. Sends SIGTERM-equivalent
* first (without /F), then force-kills if process survives.
* - Unix: send SIGTERM to process group first, wait grace period, then SIGKILL.
*
* This gives child processes a chance to clean up (close connections, remove
* temp files, terminate their own children) before being hard-killed.
*
* When the child was spawned with `detached: false` (e.g. service-managed
* runtime under launchd/systemd), pass `detached: false` to skip the Unix
* `process.kill(-pid, ...)` group-kill — it would otherwise target the
* gateway's own process group and SIGTERM the gateway itself. (#71662)
*/
export function killProcessTree(
pid: number,
opts?: { graceMs?: number; detached?: boolean },
): void {
if (!Number.isFinite(pid) || pid <= 0) {
return;
}
const graceMs = normalizeGraceMs(opts?.graceMs);
if (process.platform === "win32") {
killProcessTreeWindows(pid, graceMs);
return;
}
killProcessTreeUnix(pid, graceMs, opts?.detached !== false);
}
function normalizeGraceMs(value?: number): number {
if (typeof value !== "number" || !Number.isFinite(value)) {
return DEFAULT_GRACE_MS;
}
return Math.max(0, Math.min(MAX_GRACE_MS, Math.floor(value)));
}
function isProcessAlive(pid: number): boolean {
try {
process.kill(pid, 0);
return true;
} catch {
return false;
}
}
function killProcessTreeUnix(pid: number, graceMs: number, useGroupKill: boolean): void {
// Step 1: Try graceful SIGTERM. Prefer process-group kill (`-pid`) when the
// child was spawned detached so it has its own group; otherwise stick to the
// direct pid to avoid SIGTERMing our own process group (the gateway). (#71662)
if (useGroupKill) {
try {
process.kill(-pid, "SIGTERM");
} catch {
// Process group doesn't exist or we lack permission - try direct
try {
process.kill(pid, "SIGTERM");
} catch {
// Already gone
return;
}
}
} else {
try {
process.kill(pid, "SIGTERM");
} catch {
// Already gone
return;
}
}
// Step 2: Wait grace period, then SIGKILL if still alive
setTimeout(() => {
if (useGroupKill && isProcessAlive(-pid)) {
try {
process.kill(-pid, "SIGKILL");
return;
} catch {
// Fall through to direct pid kill
}
}
if (!isProcessAlive(pid)) {
return;
}
try {
process.kill(pid, "SIGKILL");
} catch {
// Process exited between liveness check and kill
}
}, graceMs).unref(); // Don't block event loop exit
}
function runTaskkill(args: string[]): void {
try {
spawn("taskkill", args, {
stdio: "ignore",
detached: true,
windowsHide: true,
});
} catch {
// Ignore taskkill spawn failures
}
}
function killProcessTreeWindows(pid: number, graceMs: number): void {
// Step 1: Try graceful termination (taskkill without /F)
runTaskkill(["/T", "/PID", String(pid)]);
// Step 2: Wait grace period, then force kill only if pid still exists.
// This avoids unconditional delayed /F kills after graceful shutdown.
setTimeout(() => {
if (!isProcessAlive(pid)) {
return;
}
runTaskkill(["/F", "/T", "/PID", String(pid)]);
}, graceMs).unref(); // Don't block event loop exit
}