Error Handling & Crash Recovery
A production-grade architectural pattern for intercepting, isolating, and recovering from Web Worker failures without blocking the main thread or corrupting application state. This is part of the broader Debugging, Profiling & Production Optimization discipline — effective crash recovery starts with the same instrumentation you use for profiling.
When workers fail silently in production, the missing link is usually structured telemetry. Once you have the recovery patterns here in place, Production Error Telemetry covers how to route those serialized errors to Sentry and custom endpoints so failures surface before users report them.
Prerequisites
- Workers created with
{ type: 'module' }for strict scope isolation and source-map support. onerrorandunhandledrejectionlisteners registered inside every worker script before any async code runs.- A state-tracking variable on the main thread to prevent duplicate recovery sequences.
1. Architecting the Resilient Worker Lifecycle
Establish a fault-tolerant initialization sequence that decouples worker creation from execution. Unlike traditional synchronous error handling, cross-thread failures require explicit message routing and lifecycle hooks.
Implementation Steps:
- Define a worker factory with explicit state tracking (
IDLE,RUNNING,RECOVERING,TERMINATED). - Attach global
onerrorandunhandledrejectionlisteners in the worker script before executing any payload. - Implement a heartbeat mechanism to detect silent hangs before they escalate into crashes.
// main-thread: worker-factory.js
const WORKER_STATES = Object.freeze({
IDLE: 'IDLE',
RUNNING: 'RUNNING',
RECOVERING: 'RECOVERING',
TERMINATED: 'TERMINATED'
});
export class ResilientWorker {
constructor(workerUrl) {
this.workerUrl = workerUrl;
this.state = WORKER_STATES.IDLE;
this.id = crypto.randomUUID();
this.retries = 0;
this.heartbeatInterval = null;
this.worker = new Worker(workerUrl, { type: 'module' });
this.worker.onmessage = this.handleMessage.bind(this);
this.worker.onerror = this.handleError.bind(this);
this.worker.onmessageerror = this.handleDeserializationError.bind(this);
}
startHeartbeat(intervalMs = 2000) {
this.heartbeatInterval = setInterval(() => {
if (this.state === WORKER_STATES.RUNNING) {
this.worker.postMessage({ type: 'PING' });
}
}, intervalMs);
}
terminate() {
this.state = WORKER_STATES.TERMINATED;
clearInterval(this.heartbeatInterval);
this.worker.terminate();
}
handleMessage(event) {
// Override in subclass or pass a handler
console.log('[ResilientWorker] Message:', event.data);
}
handleError(event) {
console.error('[ResilientWorker] Error:', event.message);
}
handleDeserializationError(event) {
console.warn('[ResilientWorker] Deserialization failure:', event);
}
}
The `onmessageerror` handler fires when the browser cannot deserialize an incoming message — for example, when a non-transferable or non-cloneable object is sent. Without this handler, deserialization failures are completely silent. Always bind it alongside `onerror`.
2. Intercepting & Routing Uncaught Exceptions
Standard try/catch blocks cannot capture asynchronous or top-level thread crashes. You must explicitly bind to the worker’s error event and parse the structured error payload. For dedicated execution contexts, Fixing Uncaught Exceptions in Dedicated Workers provides the exact event mapping required to prevent silent thread termination.
Implementation Steps:
- Capture
event.message,event.filename, andevent.linenofrom the error event. - Serialize the stack trace as a plain object (not an
Errorinstance) and transmit it via a dedicated error channel. - Trigger a graceful teardown sequence to release allocated buffers.
// main-thread: error-routing.js
class WorkerManager extends ResilientWorker {
handleError(event) {
event.preventDefault(); // Prevent default browser console output in production
const errorPayload = {
type: 'FATAL',
workerId: this.id,
message: event.message,
filename: event.filename,
lineno: event.lineno,
colno: event.colno,
timestamp: performance.now()
};
console.error('[Worker Crash]', errorPayload);
this.state = WORKER_STATES.RECOVERING;
this.worker.terminate();
this.initiateRecovery(errorPayload);
}
handleDeserializationError(event) {
console.warn('[Worker Deserialization Failure]', event);
this.worker.terminate();
}
initiateRecovery(errorPayload) {
// Implement recovery pipeline (see Section 3)
console.log('Initiating recovery after:', errorPayload.message);
}
}
Calling event.preventDefault() on the `ErrorEvent` stops Chrome from printing the error to the console, which is desirable in production. In development, omit it so DevTools still shows the original error with its source location.
3. Automatic Restart & State Hydration
Crash recovery requires deterministic state restoration. Implement an exponential backoff retry loop that rehydrates the worker with a serialized snapshot of the last known good state. Monitor heap allocations during restart cycles to avoid compounding memory pressure, as detailed in Identifying Memory Leaks in Workers.
Implementation Steps:
- Maintain a circular buffer of the last N computation checkpoints.
- On crash, spawn a replacement worker and inject the latest checkpoint.
- Validate state integrity before resuming message processing.
// main-thread: recovery-manager.js
async function spawnWithHydration(workerUrl, lastKnownState, retryCount = 0) {
const MAX_RETRIES = 5;
if (retryCount >= MAX_RETRIES) throw new Error('Worker recovery exhausted');
const worker = new Worker(workerUrl, { type: 'module' });
// Deep clone to ensure thread-safe isolation of the main thread state
const checkpoint = structuredClone(lastKnownState);
return new Promise((resolve, reject) => {
const timeout = setTimeout(() => {
worker.terminate();
reject(new Error('Hydration timeout'));
}, 5000);
worker.onmessage = (e) => {
if (e.data.type === 'HYDRATION_ACK') {
clearTimeout(timeout);
resolve(worker);
}
};
worker.onerror = (err) => {
clearTimeout(timeout);
worker.terminate();
reject(err);
};
worker.postMessage({ type: 'HYDRATE', payload: checkpoint });
});
}
// Exponential backoff wrapper
async function retrySpawn(workerUrl, state, retries = 0) {
try {
return await spawnWithHydration(workerUrl, state, retries);
} catch (err) {
const MAX_RETRIES = 5;
if (retries >= MAX_RETRIES) throw new Error('Max retries exceeded');
const delay = Math.min(1000 * Math.pow(2, retries), 10000);
console.warn(`Retry ${retries + 1} in ${delay}ms`);
await new Promise(r => setTimeout(r, delay));
return retrySpawn(workerUrl, state, retries + 1);
}
}
4. Debugging Recovery Flows in Production
Isolating the exact failure point during automated restarts requires targeted profiling. Use the browser’s multi-threaded inspector to trace message queues and monitor thread suspension events. Refer to Chrome DevTools Worker Debugging for configuring source maps and pausing on unhandled rejections across detached threads.
Implementation Steps:
- Enable “Pause on caught exceptions” in the Sources panel.
- Attach a remote debugger to the worker thread by selecting it in the Threads dropdown.
- Record timeline metrics to correlate restart latency with main thread jank.
// worker-thread: debug-instrumentation.js
//# sourceMappingURL=worker.js.map
self.addEventListener('message', async (e) => {
if (e.data.type === 'RECOVERY_START') {
performance.mark('worker-recovery-start');
}
try {
await processPayload(e.data);
} catch (err) {
// In DevTools with "Pause on caught exceptions" enabled, this halts here
throw err; // Re-throw so the unhandledrejection boundary can catch it
} finally {
const marks = performance.getEntriesByName('worker-recovery-start');
if (marks.length > 0) {
performance.mark('worker-recovery-end');
performance.measure('recovery-latency', 'worker-recovery-start', 'worker-recovery-end');
performance.clearMarks('worker-recovery-start');
performance.clearMarks('worker-recovery-end');
}
}
});
5. Sandboxing & Boundary Isolation Strategies
When integrating external computation scripts, untrusted code can trigger cascading failures. Wrap third-party logic in a strict execution boundary that validates inputs and caps resource consumption.
Implementation Steps:
- Validate incoming
postMessagepayloads against a strict schema before execution. - Enforce execution timeouts using
AbortControllerandsetTimeout. - Route boundary violations to a quarantine queue instead of crashing.
// worker-thread: boundary-guard.js
const MAX_EXECUTION_MS = 8000;
function validatePayload(data) {
if (!data || typeof data !== 'object' || !('type' in data)) {
throw new TypeError('Invalid payload structure');
}
return data;
}
self.addEventListener('message', (e) => {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), MAX_EXECUTION_MS);
try {
const payload = validatePayload(e.data);
executeTask(payload, controller.signal)
.then(result => {
clearTimeout(timeout);
self.postMessage({ type: 'SUCCESS', result });
})
.catch(err => {
clearTimeout(timeout);
const reason = err.name === 'AbortError' ? 'EXECUTION_TIMEOUT' : err.message;
self.postMessage({ type: 'QUARANTINE', reason });
});
} catch (validationErr) {
clearTimeout(timeout);
self.postMessage({ type: 'QUARANTINE', reason: validationErr.message });
}
});
6. Performance & Serialization Trade-offs
Robust error handling introduces measurable overhead. Checkpointing state requires deep cloning or transferring buffers, which impacts both latency and memory. Structured cloning is safe but CPU-intensive for large datasets. Transferable objects eliminate copy costs but render the original buffer unusable on the sender thread. Balance recovery granularity with serialization costs by implementing incremental snapshots.
// worker-thread: delta-checkpoint.js
let previousState = null;
function computeDiff(current, previous) {
if (!previous) return current;
const delta = {};
for (const key in current) {
if (current[key] !== previous[key]) {
delta[key] = current[key];
}
}
return delta;
}
let lastCheckpointTime = 0;
const CHECKPOINT_INTERVAL_MS = 500;
self.addEventListener('message', (e) => {
const now = performance.now();
const result = processData(e.data);
if (now - lastCheckpointTime > CHECKPOINT_INTERVAL_MS) {
const delta = computeDiff(result, previousState);
self.postMessage({ type: 'DELTA_SNAPSHOT', data: delta });
previousState = structuredClone(result);
lastCheckpointTime = now;
}
});
Browser Compatibility
| Feature | Chrome | Firefox | Safari | Edge |
|---|---|---|---|---|
worker.onerror |
4+ | 3.5+ | 4+ | 12+ |
worker.onmessageerror |
60+ | 57+ | 12+ | 18+ |
unhandledrejection in worker |
66+ | 58+ | 11+ | 79+ |
AbortController in worker |
66+ | 57+ | 11.1+ | 16+ |
structuredClone() |
98+ | 94+ | 15.4+ | 98+ |
performance.mark() in worker |
43+ | 40+ | 11+ | 79+ |