Skip to Content
CloudIncidents

Incidents

When a single check fails, WatchDeck doesn’t immediately page anyone. It waits for the failure streak to reach the endpoint’s threshold, then opens an incident — and notifications fire off the incident, not off every individual failed check. The thresholds keep a single transient blip from waking anyone up.

Lifecycle

Incidents have just two states:

StatusMeaning
activeOpen and still firing. The endpoint’s current_incident_id points at it.
resolvedClosed — the recovery streak met its threshold. resolved_at and duration_seconds are populated.

There’s no acknowledged state and no manual resolve. Incidents are opened and closed entirely by the check engine — see What’s not in the product.

How failures are grouped

A Postgres trigger watches every new check row and decides whether to open, append to, or close an incident.

Opening

When consecutive_failures reaches the endpoint’s failure_threshold (default 3) and there is no active incident:

  • A new mx_incidents row is inserted with status='active'.
  • cause is set to endpoint_down (status was down) or endpoint_degraded (status was degraded).
  • cause_detail captures the probe’s error_message or status_reason — what you’ll see as the headline on the detail page.
  • Initial timeline event: { event: 'opened', detail: <cause_detail>, at: <now> }.
  • The endpoint’s current_incident_id is set to the new incident.

Appending

While an incident is open, every subsequent check appends a 'check' event to the incident’s timeline array:

Probe outcomeTimeline detail
Failure<status> — <error_message> or <status> — <code> — <ms>ms
Healthyhealthy — <code> — <ms>ms (X/Y) — shows recovery progress

Healthy probes appear in the timeline so you can see the recovery streak building.

Resolving

When consecutive_healthy reaches the endpoint’s recovery_threshold (default 2):

  • status flips to resolved.
  • resolved_at and duration_seconds are populated.
  • A final 'resolved' timeline event is appended: Recovered after N consecutive healthy checks.
  • The endpoint’s current_incident_id is cleared.

Cause types

Only two values are emitted today:

CauseTriggered by
endpoint_downProbe verdict was down — connection error, port refused, status mismatch.
endpoint_degradedProbe verdict was degraded — slow response or a degraded-severity assertion fired.

Notification severity follows the cause: endpoint_down becomes critical, endpoint_degraded becomes warning. A resolution becomes success.

Recovery alerts

By default, an incident’s resolution fires its own notification (severity success). If you’d rather only hear about openings, switch off Recovery alert under the endpoint’s Alerts section. The dispatcher short-circuits before any per-channel processing, so no log row is written either.

Detail page

Open any incident from the history list to see:

  • Header — endpoint name, status pill, “down for” counter (live ticks every second on active incidents).
  • Summary strip — duration, alerts sent, related incident count.
  • Why it opened — the cause_detail from the trigger plus the endpoint’s threshold context.
  • Response chart — response-time area chart for the window [startedAt − 30m, (resolvedAt | now) + 30m], capped at 2000 points.
  • Timeline — every event written to the incident’s timeline JSONB array.
  • Notifications log — every dispatch attempt for this incident (sent / failed / suppressed).
  • Checks log — the underlying check rows, filterable.
  • Side rail — endpoint config snapshot, related incidents (last 30 days, capped at 20).
Walk through an incident detail page

Incident history

/incidents is the chronological list. Filters:

FilterOptions
Statusactive · resolved · all
Severitycritical · warning · success · all
EndpointAny endpoint in your account
Causeendpoint_down · endpoint_degraded
Range24h · 7d · 30d · all
SearchEndpoint name / URL, cause label, cause detail

The list subscribes to realtime Postgres changes — new incidents and resolutions appear without a refresh.

Filter the incident history

Retention

Incidents are kept indefinitely. There’s no scheduled cleanup. Related data has its own clock:

DataRetention
IncidentsIndefinite
Daily summariesIndefinite
Hourly summaries30 days
Notification log90 days
Check rows48 hours

So while an incident from a year ago is still in the history list, the underlying check rows and the per-incident notification log entries may have aged out. The timeline events are part of the incident row itself, so they survive.

What’s not in the product

Things people often look for, with explicit “not there” so you don’t go hunting:

  • Acknowledgement. Incidents go directly from active to resolved once the recovery streak fires. There’s no manual ack.
  • Manual resolve. Same reason — only the streak engine can close an incident. If your endpoint is genuinely back but is being marked degraded by an assertion you no longer want, fix the assertion (or pause the endpoint) rather than trying to close the incident by hand.
  • Escalation. The endpoint form has an Escalation channel + delay and the schema has a mx_scheduled_escalations table — but no code path fires escalation messages today. Anything you configure there is stored but inert. Treat it as “coming soon.”
  • Coalescing. The notification log schema has coalesced_into_log_id and the UI can render a coalesced summary, but no dispatcher writes coalesced rows.

What’s next

Last updated on