Incidents

When a single check fails, WatchDeck doesn’t immediately page anyone. It waits for the failure streak to reach the endpoint’s threshold, then opens an incident — and notifications fire off the incident, not off every individual failed check. The thresholds keep a single transient blip from waking anyone up.

Lifecycle

Incidents have just two states:

Status	Meaning
`active`	Open and still firing. The endpoint’s `current_incident_id` points at it.
`resolved`	Closed — the recovery streak met its threshold. `resolved_at` and `duration_seconds` are populated.

There’s no acknowledged state and no manual resolve. Incidents are opened and closed entirely by the check engine — see What’s not in the product.

How failures are grouped

A Postgres trigger watches every new check row and decides whether to open, append to, or close an incident.

Opening

When consecutive_failures reaches the endpoint’s failure_threshold (default 3) and there is no active incident:

A new mx_incidents row is inserted with status='active'.
cause is set to endpoint_down (status was down) or endpoint_degraded (status was degraded).
cause_detail captures the probe’s error_message or status_reason — what you’ll see as the headline on the detail page.
Initial timeline event: { event: 'opened', detail: <cause_detail>, at: <now> }.
The endpoint’s current_incident_id is set to the new incident.

Appending

While an incident is open, every subsequent check appends a 'check' event to the incident’s timeline array:

Probe outcome	Timeline detail
Failure	`<status> — <error_message>` or `<status> — <code> — <ms>ms`
Healthy	`healthy — <code> — <ms>ms (X/Y)` — shows recovery progress

Healthy probes appear in the timeline so you can see the recovery streak building.

Resolving

When consecutive_healthy reaches the endpoint’s recovery_threshold (default 2):

status flips to resolved.
resolved_at and duration_seconds are populated.
A final 'resolved' timeline event is appended: Recovered after N consecutive healthy checks.
The endpoint’s current_incident_id is cleared.

Cause types

Only two values are emitted today:

Cause	Triggered by
`endpoint_down`	Probe verdict was `down` — connection error, port refused, status mismatch.
`endpoint_degraded`	Probe verdict was `degraded` — slow response or a degraded-severity assertion fired.

Notification severity follows the cause: endpoint_down becomes critical, endpoint_degraded becomes warning. A resolution becomes success.

Recovery alerts

By default, an incident’s resolution fires its own notification (severity success). If you’d rather only hear about openings, switch off Recovery alert under the endpoint’s Alerts section. The dispatcher short-circuits before any per-channel processing, so no log row is written either.

Detail page

Open any incident from the history list to see:

Header — endpoint name, status pill, “down for” counter (live ticks every second on active incidents).
Summary strip — duration, alerts sent, related incident count.
Why it opened — the cause_detail from the trigger plus the endpoint’s threshold context.
Response chart — response-time area chart for the window [startedAt − 30m, (resolvedAt | now) + 30m], capped at 2000 points.
Timeline — every event written to the incident’s timeline JSONB array.
Notifications log — every dispatch attempt for this incident (sent / failed / suppressed).
Checks log — the underlying check rows, filterable.
Side rail — endpoint config snapshot, related incidents (last 30 days, capped at 20).

Walk through an incident detail pageSilent screencast — coming soon

Walk through an incident detail page

Incident history

/incidents is the chronological list. Filters:

Filter	Options
Status	`active` · `resolved` · `all`
Severity	`critical` · `warning` · `success` · `all`
Endpoint	Any endpoint in your account
Cause	`endpoint_down` · `endpoint_degraded`
Range	24h · 7d · 30d · all
Search	Endpoint name / URL, cause label, cause detail

The list subscribes to realtime Postgres changes — new incidents and resolutions appear without a refresh.

Filter the incident historySilent screencast — coming soon

Filter the incident history

Retention

Incidents are kept indefinitely. There’s no scheduled cleanup. Related data has its own clock:

Data	Retention
Incidents	Indefinite
Daily summaries	Indefinite
Hourly summaries	30 days
Notification log	90 days
Check rows	48 hours

So while an incident from a year ago is still in the history list, the underlying check rows and the per-incident notification log entries may have aged out. The timeline events are part of the incident row itself, so they survive.

What’s not in the product

Things people often look for, with explicit “not there” so you don’t go hunting:

Acknowledgement. Incidents go directly from active to resolved once the recovery streak fires. There’s no manual ack.
Manual resolve. Same reason — only the streak engine can close an incident. If your endpoint is genuinely back but is being marked degraded by an assertion you no longer want, fix the assertion (or pause the endpoint) rather than trying to close the incident by hand.
Escalation. The endpoint form has an Escalation channel + delay and the schema has a mx_scheduled_escalations table — but no code path fires escalation messages today. Anything you configure there is stored but inert. Treat it as “coming soon.”
Coalescing. The notification log schema has coalesced_into_log_id and the UI can render a coalesced summary, but no dispatcher writes coalesced rows.

What’s next

Notifications Concepts → Incident Checks Endpoints Troubleshooting