Error Handling & Crash Recovery

A production-grade architectural pattern for intercepting, isolating, and recovering from Web Worker failures without blocking the main thread or corrupting application state. This is part of the broader Debugging, Profiling & Production Optimization discipline — effective crash recovery starts with the same instrumentation you use for profiling.

When workers fail silently in production, the missing link is usually structured telemetry. Once you have the recovery patterns here in place, Production Error Telemetry covers how to route those serialized errors to Sentry and custom endpoints so failures surface before users report them.

Prerequisites

  • Workers created with { type: 'module' } for strict scope isolation and source-map support.
  • onerror and unhandledrejection listeners registered inside every worker script before any async code runs.
  • A state-tracking variable on the main thread to prevent duplicate recovery sequences.

1. Architecting the Resilient Worker Lifecycle

Establish a fault-tolerant initialization sequence that decouples worker creation from execution. Unlike traditional synchronous error handling, cross-thread failures require explicit message routing and lifecycle hooks.

Implementation Steps:

  1. Define a worker factory with explicit state tracking (IDLE, RUNNING, RECOVERING, TERMINATED).
  2. Attach global onerror and unhandledrejection listeners in the worker script before executing any payload.
  3. Implement a heartbeat mechanism to detect silent hangs before they escalate into crashes.
// main-thread: worker-factory.js
const WORKER_STATES = Object.freeze({
  IDLE: 'IDLE',
  RUNNING: 'RUNNING',
  RECOVERING: 'RECOVERING',
  TERMINATED: 'TERMINATED'
});

export class ResilientWorker {
  constructor(workerUrl) {
    this.workerUrl = workerUrl;
    this.state = WORKER_STATES.IDLE;
    this.id = crypto.randomUUID();
    this.retries = 0;
    this.heartbeatInterval = null;

    this.worker = new Worker(workerUrl, { type: 'module' });
    this.worker.onmessage = this.handleMessage.bind(this);
    this.worker.onerror = this.handleError.bind(this);
    this.worker.onmessageerror = this.handleDeserializationError.bind(this);
  }

  startHeartbeat(intervalMs = 2000) {
    this.heartbeatInterval = setInterval(() => {
      if (this.state === WORKER_STATES.RUNNING) {
        this.worker.postMessage({ type: 'PING' });
      }
    }, intervalMs);
  }

  terminate() {
    this.state = WORKER_STATES.TERMINATED;
    clearInterval(this.heartbeatInterval);
    this.worker.terminate();
  }

  handleMessage(event) {
    // Override in subclass or pass a handler
    console.log('[ResilientWorker] Message:', event.data);
  }

  handleError(event) {
    console.error('[ResilientWorker] Error:', event.message);
  }

  handleDeserializationError(event) {
    console.warn('[ResilientWorker] Deserialization failure:', event);
  }
}
onmessageerror is often overlooked

The `onmessageerror` handler fires when the browser cannot deserialize an incoming message — for example, when a non-transferable or non-cloneable object is sent. Without this handler, deserialization failures are completely silent. Always bind it alongside `onerror`.

2. Intercepting & Routing Uncaught Exceptions

Standard try/catch blocks cannot capture asynchronous or top-level thread crashes. You must explicitly bind to the worker’s error event and parse the structured error payload. For dedicated execution contexts, Fixing Uncaught Exceptions in Dedicated Workers provides the exact event mapping required to prevent silent thread termination.

Implementation Steps:

  1. Capture event.message, event.filename, and event.lineno from the error event.
  2. Serialize the stack trace as a plain object (not an Error instance) and transmit it via a dedicated error channel.
  3. Trigger a graceful teardown sequence to release allocated buffers.
// main-thread: error-routing.js
class WorkerManager extends ResilientWorker {
  handleError(event) {
    event.preventDefault(); // Prevent default browser console output in production

    const errorPayload = {
      type: 'FATAL',
      workerId: this.id,
      message: event.message,
      filename: event.filename,
      lineno: event.lineno,
      colno: event.colno,
      timestamp: performance.now()
    };

    console.error('[Worker Crash]', errorPayload);

    this.state = WORKER_STATES.RECOVERING;
    this.worker.terminate();
    this.initiateRecovery(errorPayload);
  }

  handleDeserializationError(event) {
    console.warn('[Worker Deserialization Failure]', event);
    this.worker.terminate();
  }

  initiateRecovery(errorPayload) {
    // Implement recovery pipeline (see Section 3)
    console.log('Initiating recovery after:', errorPayload.message);
  }
}
Performance

Calling event.preventDefault() on the `ErrorEvent` stops Chrome from printing the error to the console, which is desirable in production. In development, omit it so DevTools still shows the original error with its source location.

3. Automatic Restart & State Hydration

Crash recovery requires deterministic state restoration. Implement an exponential backoff retry loop that rehydrates the worker with a serialized snapshot of the last known good state. Monitor heap allocations during restart cycles to avoid compounding memory pressure, as detailed in Identifying Memory Leaks in Workers.

Implementation Steps:

  1. Maintain a circular buffer of the last N computation checkpoints.
  2. On crash, spawn a replacement worker and inject the latest checkpoint.
  3. Validate state integrity before resuming message processing.
// main-thread: recovery-manager.js
async function spawnWithHydration(workerUrl, lastKnownState, retryCount = 0) {
  const MAX_RETRIES = 5;
  if (retryCount >= MAX_RETRIES) throw new Error('Worker recovery exhausted');

  const worker = new Worker(workerUrl, { type: 'module' });
  // Deep clone to ensure thread-safe isolation of the main thread state
  const checkpoint = structuredClone(lastKnownState);

  return new Promise((resolve, reject) => {
    const timeout = setTimeout(() => {
      worker.terminate();
      reject(new Error('Hydration timeout'));
    }, 5000);

    worker.onmessage = (e) => {
      if (e.data.type === 'HYDRATION_ACK') {
        clearTimeout(timeout);
        resolve(worker);
      }
    };

    worker.onerror = (err) => {
      clearTimeout(timeout);
      worker.terminate();
      reject(err);
    };

    worker.postMessage({ type: 'HYDRATE', payload: checkpoint });
  });
}

// Exponential backoff wrapper
async function retrySpawn(workerUrl, state, retries = 0) {
  try {
    return await spawnWithHydration(workerUrl, state, retries);
  } catch (err) {
    const MAX_RETRIES = 5;
    if (retries >= MAX_RETRIES) throw new Error('Max retries exceeded');
    const delay = Math.min(1000 * Math.pow(2, retries), 10000);
    console.warn(`Retry ${retries + 1} in ${delay}ms`);
    await new Promise(r => setTimeout(r, delay));
    return retrySpawn(workerUrl, state, retries + 1);
  }
}

4. Debugging Recovery Flows in Production

Isolating the exact failure point during automated restarts requires targeted profiling. Use the browser’s multi-threaded inspector to trace message queues and monitor thread suspension events. Refer to Chrome DevTools Worker Debugging for configuring source maps and pausing on unhandled rejections across detached threads.

Implementation Steps:

  1. Enable “Pause on caught exceptions” in the Sources panel.
  2. Attach a remote debugger to the worker thread by selecting it in the Threads dropdown.
  3. Record timeline metrics to correlate restart latency with main thread jank.
// worker-thread: debug-instrumentation.js
//# sourceMappingURL=worker.js.map

self.addEventListener('message', async (e) => {
  if (e.data.type === 'RECOVERY_START') {
    performance.mark('worker-recovery-start');
  }

  try {
    await processPayload(e.data);
  } catch (err) {
    // In DevTools with "Pause on caught exceptions" enabled, this halts here
    throw err; // Re-throw so the unhandledrejection boundary can catch it
  } finally {
    const marks = performance.getEntriesByName('worker-recovery-start');
    if (marks.length > 0) {
      performance.mark('worker-recovery-end');
      performance.measure('recovery-latency', 'worker-recovery-start', 'worker-recovery-end');
      performance.clearMarks('worker-recovery-start');
      performance.clearMarks('worker-recovery-end');
    }
  }
});

5. Sandboxing & Boundary Isolation Strategies

When integrating external computation scripts, untrusted code can trigger cascading failures. Wrap third-party logic in a strict execution boundary that validates inputs and caps resource consumption.

Implementation Steps:

  1. Validate incoming postMessage payloads against a strict schema before execution.
  2. Enforce execution timeouts using AbortController and setTimeout.
  3. Route boundary violations to a quarantine queue instead of crashing.
// worker-thread: boundary-guard.js
const MAX_EXECUTION_MS = 8000;

function validatePayload(data) {
  if (!data || typeof data !== 'object' || !('type' in data)) {
    throw new TypeError('Invalid payload structure');
  }
  return data;
}

self.addEventListener('message', (e) => {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), MAX_EXECUTION_MS);

  try {
    const payload = validatePayload(e.data);

    executeTask(payload, controller.signal)
      .then(result => {
        clearTimeout(timeout);
        self.postMessage({ type: 'SUCCESS', result });
      })
      .catch(err => {
        clearTimeout(timeout);
        const reason = err.name === 'AbortError' ? 'EXECUTION_TIMEOUT' : err.message;
        self.postMessage({ type: 'QUARANTINE', reason });
      });
  } catch (validationErr) {
    clearTimeout(timeout);
    self.postMessage({ type: 'QUARANTINE', reason: validationErr.message });
  }
});

6. Performance & Serialization Trade-offs

Robust error handling introduces measurable overhead. Checkpointing state requires deep cloning or transferring buffers, which impacts both latency and memory. Structured cloning is safe but CPU-intensive for large datasets. Transferable objects eliminate copy costs but render the original buffer unusable on the sender thread. Balance recovery granularity with serialization costs by implementing incremental snapshots.

// worker-thread: delta-checkpoint.js
let previousState = null;

function computeDiff(current, previous) {
  if (!previous) return current;
  const delta = {};
  for (const key in current) {
    if (current[key] !== previous[key]) {
      delta[key] = current[key];
    }
  }
  return delta;
}

let lastCheckpointTime = 0;
const CHECKPOINT_INTERVAL_MS = 500;

self.addEventListener('message', (e) => {
  const now = performance.now();
  const result = processData(e.data);

  if (now - lastCheckpointTime > CHECKPOINT_INTERVAL_MS) {
    const delta = computeDiff(result, previousState);
    self.postMessage({ type: 'DELTA_SNAPSHOT', data: delta });
    previousState = structuredClone(result);
    lastCheckpointTime = now;
  }
});

Browser Compatibility

Feature Chrome Firefox Safari Edge
worker.onerror 4+ 3.5+ 4+ 12+
worker.onmessageerror 60+ 57+ 12+ 18+
unhandledrejection in worker 66+ 58+ 11+ 79+
AbortController in worker 66+ 57+ 11.1+ 16+
structuredClone() 98+ 94+ 15.4+ 98+
performance.mark() in worker 43+ 40+ 11+ 79+
Worker crash and recovery state machine State transitions from IDLE through RUNNING, into RECOVERING on crash with exponential backoff, and finally TERMINATED if retries are exhausted. IDLE RUNNING RECOVERING TERMINATED postMessage onerror / crash backoff + respawn max retries terminate()
Worker lifecycle state machine: crashes transition to RECOVERING where exponential backoff spawns a replacement; exhausting retries moves to TERMINATED.

Frequently Asked Questions

Why don't unhandled promise rejections in workers propagate to the main thread?
Workers run in isolated event loops. An unhandled rejection inside a worker only fires the worker’s own unhandledrejection event. The main thread never sees it unless you explicitly listen with self.addEventListener('unhandledrejection', ...) inside the worker and forward the serialized error over postMessage.
How do I implement exponential backoff for worker respawning?
Keep a retry counter incremented on each crash. Compute delay as Math.min(1000 * Math.pow(2, retries), 10000) so delays cap at 10 seconds. After exceeding a maximum retry count, throw or emit an unrecoverable error event to the main thread rather than looping indefinitely.
What is the right way to serialize an Error object across a thread boundary?
Do not pass Error instances directly — structured clone strips the stack trace and prototype. Instead send a plain object: { message: err.message, stack: err.stack, name: err.name, timestamp: performance.now() }. The receiving thread reconstructs an Error if needed.
How can I ship worker error details to Sentry or another APM tool?
Serialize the error payload as a plain object in the worker and route it over postMessage to the main thread, where your telemetry SDK runs. For structured patterns including stack normalization, see Production Error Telemetry.

See also