Incidents
When a single check fails, WatchDeck doesn’t immediately page anyone. It waits for the failure streak to reach the endpoint’s threshold, then opens an incident — and notifications fire off the incident, not off every individual failed check. The thresholds keep a single transient blip from waking anyone up.
Lifecycle
Incidents have just two states:
| Status | Meaning |
|---|---|
active | Open and still firing. The endpoint’s current_incident_id points at it. |
resolved | Closed — the recovery streak met its threshold. resolved_at and duration_seconds are populated. |
There’s no acknowledged state and no manual resolve. Incidents are opened and closed entirely by the check engine — see What’s not in the product.
How failures are grouped
A Postgres trigger watches every new check row and decides whether to open, append to, or close an incident.
Opening
When consecutive_failures reaches the endpoint’s failure_threshold (default 3) and there is no active incident:
- A new
mx_incidentsrow is inserted withstatus='active'. causeis set toendpoint_down(status wasdown) orendpoint_degraded(status wasdegraded).cause_detailcaptures the probe’serror_messageorstatus_reason— what you’ll see as the headline on the detail page.- Initial
timelineevent:{ event: 'opened', detail: <cause_detail>, at: <now> }. - The endpoint’s
current_incident_idis set to the new incident.
Appending
While an incident is open, every subsequent check appends a 'check' event to the incident’s timeline array:
| Probe outcome | Timeline detail |
|---|---|
| Failure | <status> — <error_message> or <status> — <code> — <ms>ms |
| Healthy | healthy — <code> — <ms>ms (X/Y) — shows recovery progress |
Healthy probes appear in the timeline so you can see the recovery streak building.
Resolving
When consecutive_healthy reaches the endpoint’s recovery_threshold (default 2):
statusflips toresolved.resolved_atandduration_secondsare populated.- A final
'resolved'timeline event is appended:Recovered after N consecutive healthy checks. - The endpoint’s
current_incident_idis cleared.
Cause types
Only two values are emitted today:
| Cause | Triggered by |
|---|---|
endpoint_down | Probe verdict was down — connection error, port refused, status mismatch. |
endpoint_degraded | Probe verdict was degraded — slow response or a degraded-severity assertion fired. |
Notification severity follows the cause: endpoint_down becomes critical, endpoint_degraded becomes warning. A resolution becomes success.
Recovery alerts
By default, an incident’s resolution fires its own notification (severity success). If you’d rather only hear about openings, switch off Recovery alert under the endpoint’s Alerts section. The dispatcher short-circuits before any per-channel processing, so no log row is written either.
Detail page
Open any incident from the history list to see:
- Header — endpoint name, status pill, “down for” counter (live ticks every second on active incidents).
- Summary strip — duration, alerts sent, related incident count.
- Why it opened — the
cause_detailfrom the trigger plus the endpoint’s threshold context. - Response chart — response-time area chart for the window
[startedAt − 30m, (resolvedAt | now) + 30m], capped at 2000 points. - Timeline — every event written to the incident’s
timelineJSONB array. - Notifications log — every dispatch attempt for this incident (sent / failed / suppressed).
- Checks log — the underlying check rows, filterable.
- Side rail — endpoint config snapshot, related incidents (last 30 days, capped at 20).
Incident history
/incidents is the chronological list. Filters:
| Filter | Options |
|---|---|
| Status | active · resolved · all |
| Severity | critical · warning · success · all |
| Endpoint | Any endpoint in your account |
| Cause | endpoint_down · endpoint_degraded |
| Range | 24h · 7d · 30d · all |
| Search | Endpoint name / URL, cause label, cause detail |
The list subscribes to realtime Postgres changes — new incidents and resolutions appear without a refresh.
Retention
Incidents are kept indefinitely. There’s no scheduled cleanup. Related data has its own clock:
| Data | Retention |
|---|---|
| Incidents | Indefinite |
| Daily summaries | Indefinite |
| Hourly summaries | 30 days |
| Notification log | 90 days |
| Check rows | 48 hours |
So while an incident from a year ago is still in the history list, the underlying check rows and the per-incident notification log entries may have aged out. The timeline events are part of the incident row itself, so they survive.
What’s not in the product
Things people often look for, with explicit “not there” so you don’t go hunting:
- Acknowledgement. Incidents go directly from
activetoresolvedonce the recovery streak fires. There’s no manual ack. - Manual resolve. Same reason — only the streak engine can close an incident. If your endpoint is genuinely back but is being marked degraded by an assertion you no longer want, fix the assertion (or pause the endpoint) rather than trying to close the incident by hand.
- Escalation. The endpoint form has an Escalation channel + delay and the schema has a
mx_scheduled_escalationstable — but no code path fires escalation messages today. Anything you configure there is stored but inert. Treat it as “coming soon.” - Coalescing. The notification log schema has
coalesced_into_log_idand the UI can render a coalesced summary, but no dispatcher writes coalesced rows.