Commit Graph

236 Commits

Author SHA1 Message Date
Peter Steinberger
2f0c9358b1 refactor: hide shared constants 2026-05-02 08:29:21 +01:00
Peter Steinberger
ad1e14af53 refactor: delete unused test helper code 2026-05-01 13:11:42 +01:00
Alex Knight
e1a7c5b860 fix: handle EPIPE errors on child process stdin writes (#75602)
Fix three child-process stdin write paths that let async EPIPE errors
escape to uncaughtException and crash the gateway.

extensions/imessage/src/client.ts (the actual #75438 crash path):
- Add child.stdin.on('error') listener in start() to catch async EPIPE
  and reject all pending requests via failAll().
- Add write callback to request() stdin.write() that rejects the
  specific pending request on error, instead of leaving it hanging
  until timeout.

src/agents/mcp-stdio-transport.ts:
- Fix write callback race in send(): previously resolved the promise
  immediately when write() returned true, then the write callback with
  EPIPE would fire after the promise was already fulfilled. Now always
  settles the promise from the write callback so the outcome is known
  before resolving.

src/process/exec.ts:
- Add stdin.on('error') before writing input so EPIPE from a
  prematurely-exited child is swallowed — the process exit handler
  reports the real status.

One reporter observed a gateway crash after 10.5 hours of stable
uptime — a single EPIPE on an iMessage RPC child process stdin write
killed the gateway with code 1.

Fixes: #75438
2026-05-01 21:45:12 +10:00
Peter Steinberger
42d73fd955 refactor: remove dead private helpers 2026-05-01 06:55:26 +01:00
Peter Steinberger
470098bd26 fix: keep embedded run lanes from wedging 2026-04-29 21:37:17 +01:00
Peter Steinberger
f5e7557c70 fix(heartbeat): defer during cron and nested lane pressure 2026-04-29 10:08:48 +01:00
Peter Steinberger
14e8a2d00b chore: remove unused internal dead code 2026-04-29 09:34:40 +01:00
Peter Steinberger
c500e8704f fix(gateway): recover stale session lanes 2026-04-28 20:37:29 +01:00
Peter Steinberger
e1acb61317 refactor: expose SDK test helper subpaths 2026-04-28 03:28:17 +01:00
Peter Steinberger
7d74c29dcc fix: isolate cron nested lane concurrency 2026-04-27 09:39:10 +01:00
Vincent Koc
e6d2c9b080 fix(process): decode Windows command output with console codepage awareness (#72393)
* fix(process): decode Windows command output with console codepage awareness

* fix(clownfish): address review for ghcrawl-199248-agentic-merge (1)
2026-04-26 23:10:59 -07:00
hcl
4a72e1b990 fix(process): skip kill-tree group kill when child wasn't detached (#71662) (#71681)
* fix(process): skip kill-tree group kill when child wasn't detached (#71662)

When the supervisor spawns a child with detached:false (service-managed
runtime under launchd/systemd), the child shares the gateway's process
group. On session abort or SIGKILL, killProcessTree was unconditionally
issuing process.kill(-pid, 'SIGTERM') — which targets the entire process
GROUP (negative pid is POSIX group-kill semantics) and therefore
SIGTERMs the gateway parent along with the child.

Reporter saw this on macOS (LaunchAgent + KeepAlive=true): aborting a
claude-cli/claude-opus-4-7 session caused the gateway to receive
SIGTERM, then auto-restart, dropping all in-flight sessions. Switching
the primary model to a non-cli provider eliminated it because the
non-cli paths don't go through this kill-tree call. Did not occur on
Linux VPS where the gateway runs detached, because there
useDetached === true and the child got its own process group.

Fix:
- killProcessTree now accepts opts.detached?: boolean. When detached:false,
  killProcessTreeUnix skips the `-pid` group-kill and goes straight to
  direct-pid SIGTERM/SIGKILL. Group-kill default (detached:true) is
  preserved so all existing callers behave exactly as before.
- supervisor/adapters/child.ts:286 now threads the spawn-time `useDetached`
  flag into killProcessTree, so the kill-tree path matches the spawn-time
  detachment decision (line 45 of the same file already computes
  useDetached = process.platform !== 'win32' && !isServiceManagedRuntime()).

Tests:
- new: detached:false skips group kill and uses direct pid SIGTERM only.
- new: default behaviour (detached:true) still uses group kill (regression
  guard so the existing test case isn't accidentally weakened).

Existing tests still pass (6/6 in kill-tree.test.ts). Lint clean.

Out of scope: other killProcessTree callers (mcp-stdio-transport,
bash-tools.process, etc.) keep the default group-kill behaviour because
those processes are typically detached from the gateway. Only the
supervisor/adapters/child.ts path threads `detached` through, since it's
the path that knows whether the child was actually spawned detached.

* fixup(process): also gate kill-tree group-kill on the no-detach spawn fallback (#71662)

Greptile review on the original PR caught a P1 gap: when
spawnWithFallback's initial detached spawn fails and it retries with the
no-detach fallback (label: "no-detach", options.detached: false), the
child runs detached:false but my variable useDetached was still true.
The kill closure then passed `detached: useDetached` = true to
killProcessTree, which still group-killed the gateway — same bug, just
on the fallback path.

Compute the actual detachment as
`useDetached && !spawned.usedFallback` after spawn returns, and pass
that through. This closes the gap: the kill path now correctly skips
group-kill in BOTH:
1. Service-managed runtime (useDetached=false from the start, original case)
2. Detached-spawn fallback to no-detach (useDetached=true at intent
   time but spawned.usedFallback=true)

Tests:
- existing 'uses process-tree kill for default SIGKILL' updated to
  assert the new {detached} option is forwarded.
- new: passes detached:false to killProcessTree when spawn fell back.
- new: passes detached:false in service-managed mode (regression guard
  for the original fix).

11/11 tests pass in child.test.ts. 6/6 in kill-tree.test.ts.
2026-04-25 17:08:53 -04:00
Vincent Koc
ec1f72b6c5 fix(gateway): preserve restart drain for active runs
Fixes https://github.com/openclaw/openclaw/issues/65485
2026-04-25 01:35:47 -07:00
Peter Steinberger
01bc49c88c test: move pty cleanup coverage to adapter 2026-04-24 11:09:55 +01:00
Peter Steinberger
cc9dcd3d69 fix(gateway): prefer linux child OOM victims
Raise eligible Linux child processes own oom_score_adj from a child-side /bin/sh exec shim so cgroup memory pressure prefers transient workers over the long-lived gateway. Cover supervisor children, PTY shells, MCP stdio servers, and OpenClaw-launched browser processes through the shared process runtime seam.

Harden the wrapper for distroless images, shell startup env, per-child and process-level opt-outs, dash-compatible exec, and leading-dash command names. Document Linux verification and OOM behavior.

Fixes #70404.

Co-authored-by: Neerav Makwana <261249544+neeravmakwana@users.noreply.github.com>
2026-04-23 05:23:40 +01:00
Peter Steinberger
0195da6b0e refactor: cache optional runtime imports 2026-04-18 20:45:26 +01:00
Peter Steinberger
4fa961d4f1 refactor(lint): enable map spread rule 2026-04-18 20:37:12 +01:00
Peter Steinberger
c035c5c0d2 refactor: cache lazy runtime imports 2026-04-18 16:18:26 +01:00
Vincent Koc
b22bbf5660 test(process): share shimmed windows success assertions 2026-04-12 09:37:06 +01:00
Vincent Koc
e1e20c424b test(process): share supervisor sigkill wait assertions 2026-04-12 04:52:29 +01:00
Vincent Koc
d262b1c688 fix(logging): split queue diagnostic runtime 2026-04-12 03:45:35 +01:00
Vincent Koc
b9a0052dd0 fix(cycles): split embedded runner and setup leaf types 2026-04-11 14:49:48 +01:00
Peter Steinberger
ebfd468ee0 refactor: simplify typed conversions 2026-04-11 01:01:30 +01:00
Vincent Koc
78d2e9e2a8 fix(ci): repair main type drift 2026-04-10 08:13:02 +01:00
Ayaan Zaidi
c003e982a2 fix(process): drain Windows stdio before exit fallback settle 2026-04-10 10:09:25 +05:30
Ayaan Zaidi
063049c0d4 fix(process): wait for close after Windows exit fallback 2026-04-10 10:09:25 +05:30
Ayaan Zaidi
4b6b1a3ed3 fix(process): settle Windows supervisor waits from exit state 2026-04-10 10:09:25 +05:30
Peter Steinberger
552b5d3859 test: speed up cli and process tests 2026-04-08 00:30:22 +01:00
Peter Steinberger
c3074bd513 refactor: dedupe path lowercase helpers 2026-04-07 15:53:50 +01:00
Peter Steinberger
a20d96ae31 test: stabilize isolated runtime and config suites 2026-04-07 11:41:02 +01:00
Peter Steinberger
371c4147f3 fix: restore ci after rebase drift 2026-04-07 07:36:11 +01:00
Peter Steinberger
0a6fd459f9 refactor: dedupe channel and cli readers 2026-04-07 07:36:11 +01:00
openperf
e777a2b230 fix(process ): migrate legacy command-queue singleton missing activeTaskWaiters
After a SIGUSR1 in-process restart following an npm upgrade from v2026.4.2
to v2026.4.5, the globalThis singleton created by the old code version
lacks the activeTaskWaiters field added in v2026.4.5.  resolveGlobalSingleton
returns the stale object as-is, causing notifyActiveTaskWaiters() to call
Array.from(undefined) and crash the gateway in a loop.

Add a schema migration step in getQueueState() that patches the missing
field on legacy singleton objects.  Add a regression test that plants a
v2026.4.2-shaped state object and verifies resetAllLanes() and
waitForActiveTasks() succeed without throwing.

Fixes #61905
2026-04-06 15:41:14 +01:00
Peter Steinberger
edab013e51 fix: support corepack cmd shim on windows 2026-04-06 03:48:47 +01:00
Peter Steinberger
f4fa53de3f fix(ci): repair zalouser sdk path and exec timeout kill 2026-04-04 04:51:33 +01:00
Peter Steinberger
ab318de8b7 test(plugins): finish moving contract coverage 2026-04-04 00:11:39 +01:00
Peter Steinberger
5b29483ab1 test(ci): type-safe exec timeout stub 2026-04-03 22:14:59 +01:00
Peter Steinberger
5a94909654 test(ci): stabilize exec timeout tests 2026-04-03 22:12:08 +01:00
Peter Steinberger
0204b8dd28 fix: stabilize live and docker test lanes 2026-04-03 21:43:36 +01:00
Peter Steinberger
fa6e6603fa test(ci): harden cli and exec tests for shared workers 2026-04-03 21:30:47 +01:00
Vincent Koc
0464435777 fix(ci): align windows builtin mock types 2026-04-04 03:57:48 +09:00
Peter Steinberger
bc23db501b test: trim more core importOriginal usage 2026-04-03 19:49:43 +01:00
Peter Steinberger
3edfc494df test: expand builtin mock helper usage 2026-04-03 18:53:34 +01:00
Peter Steinberger
636a23b73e test: extract node builtin mock helpers 2026-04-03 18:40:28 +01:00
Peter Steinberger
e0580e6863 test: harden shared-worker runtime setup 2026-04-03 18:18:56 +01:00
Vincent Koc
f575bc2bfe test(ci): harden proxy-sensitive and timeout unit tests 2026-04-04 02:12:00 +09:00
Shakker
2fa3a09137 test: harden command queue timer cleanup 2026-04-04 01:07:28 +09:00
Peter Steinberger
ffd34f8896 test: reduce agent test import churn 2026-04-03 04:41:09 +01:00
Peter Steinberger
f03d7c5a4c refactor: centralize Windows exec invocation 2026-04-02 18:27:53 +01:00
lawrence3699
2fd7f7ca52 fix(exec): hide windows console windows 2026-04-03 02:19:32 +09:00