OpenClaw merged a P1 reliability fix tonight for a failure mode that could make group and channel sessions go silent after a provider-side stream abort. PR #90908 changes how model fallback treats AbortError responses from LLM providers, while also forwarding caller abort signals through compaction and cron execution paths.
The short version: OpenClaw now separates "the provider connection died" from "the user or gateway intentionally cancelled this run." That distinction is small in code and large in production behavior.
The Failure Mode
The bug appeared when an LLM API closed the connection mid-stream and the fetch layer surfaced AbortError("This operation was aborted") without an external abort signal. OpenClaw's previous guard did not cleanly separate that provider-side abort from a caller-driven cancellation.
According to the PR, those errors entered the fallback path but were not retried correctly. In group or channel sessions, the failure could bubble into a silent reply token, leaving the topic with no visible response.
That is especially rough for shared channels. A user sees activity begin, the provider connection drops, and then the conversation simply stops producing a useful reply. If fallback models are configured, provider-side stream aborts should be one of the cases where fallback earns its keep.
What Changed
The core fix lands in OpenClaw's model fallback logic. Before normalizing the error for failover, OpenClaw now checks whether the error is an abort and whether the caller's abort signal is actually aborted.
If the caller really cancelled the run, OpenClaw stops immediately. If the abort came from the provider side and no external abort signal is active, the error can fall through to the next fallback candidate.
The PR also threads abort signals into two important call paths:
- Compaction runs now pass their abort signal into
runWithModelFallback. - Cron executor runs now pass their abort signal into the same fallback path.
That second change matters for scheduled agents. A cron timeout or explicit cancellation should not keep retrying across fallback candidates as if it were a flaky provider. Conversely, a provider-side abort during a cron run should still be eligible for recovery when fallback is configured.
Why It Matters
OpenClaw increasingly runs in long-lived channels, scheduled jobs, background workflows, and multi-provider setups. In that environment, abort semantics are not just a code hygiene issue. They decide whether a failed provider stream becomes a recoverable retry, a correct cancellation, or a confusing silent stop.
The new behavior is easier to reason about:
- Provider-side abort with no external signal: try the next fallback candidate.
- User or gateway cancellation: stop immediately.
- Cron timeout or cancellation: stop instead of silently cycling models.
- Compaction cancellation: respect the caller's abort state.
For operators, that should reduce one of the more frustrating classes of failures: a channel appears active, consumes work, and then says nothing.
Validation
The PR reports 152 additions across five changed files, including regression coverage in model-fallback.test.ts and cron model override forwarding tests. The real-behavior proof was run against a patched production distribution module on macOS after restarting the OpenClaw gateway.
The proof confirmed that provider-side aborts fall through for retry, user cancellations rethrow immediately, compaction forwards abort signals, and cron forwards abort signals. The PR notes that a full live reverse-proxy idle eviction was not driven, but the guard logic was verified against the production module code.
This is the kind of reliability fix that may not show up as a flashy feature, but it tightens a critical runtime promise: fallback should help when a provider fails, and cancellation should still mean stop.
