Concepts
WatchDeck has a small vocabulary. Five nouns cover almost everything you’ll see in the UI. Learn them once and the rest of the product reads cleanly.
Endpoint
The thing you’re monitoring, and its configuration. A single row in mx_endpoints holds:
- Identity: name, optional description, type (
httporport). - Target:
urlfor HTTP, orhost+portfor TCP. - Probe config: method, expected status codes, custom headers, assertions.
- Cadence & thresholds: check interval, timeout, latency threshold, SSL warning days, failure / recovery thresholds.
- Routing: which notification channels to use, plus an optional escalation channel and delay.
- Live state: last status, last check time, last response time, consecutive failure / healthy streaks, and the current open incident (if any).
There’s no separate “check config” record. Adding an endpoint is configuring its check.
Lifecycle status
An endpoint’s status is one of:
| Status | Behaviour |
|---|---|
active | Default. The scheduler runs probes on the configured interval. |
paused | Scheduler skips it. No probes, no incidents, no notifications. History kept. |
archived | Hidden from the list view. Reserved for future use. |
Check
A single recorded run of an endpoint’s probe. Each row in mx_checks captures one verdict:
status:healthy,degraded,down, orinconclusive.status_reason: short human string (e.g."HTTP 502 — expected 200").response_time,status_code,ssl_days_remaining,port_open,body_bytes.assertion_result: per-rule pass/fail breakdown.
Checks are append-only. Once written, a check row never changes.
In conversation people use “check” loosely, sometimes the probe configuration, sometimes the run. The database treats them as distinct: the endpoint row holds the config; each mx_checks row holds one run.
Run statuses
| Status | When it happens |
|---|---|
healthy | Probe succeeded and every assertion passed. |
degraded | Probe succeeded but a degraded-severity rule fired: latency over budget, SSL inside the warning window, soft-fail assertion. |
down | Status code mismatch, network or connection error, port refused, or a down-severity assertion failed. |
inconclusive | The run couldn’t reach a verdict (rare, typically a scheduling or storage error). Doesn’t move the streak counters. |
Incident
A grouped failure window. When an endpoint’s consecutive_failures reaches its failure_threshold, an incident opens. When consecutive_healthy reaches recovery_threshold, it resolves. The thresholds keep a single transient blip from paging anyone.
Incident status
Incidents have just two states:
| Status | Meaning |
|---|---|
active | Open and still firing. The endpoint’s current_incident_id points at it. |
resolved | Closed: the recovery streak met its threshold. resolved_at and duration_seconds are populated. |
There’s no acknowledged state. Incidents go straight from active to resolved once the recovery streak fires.
Each incident carries a cause (endpoint_down or endpoint_degraded), a timeline JSONB array of every check that contributed to it, and a notifications-sent counter. See Incidents for the full lifecycle.
Channel
A destination for notifications. Channels live independently of endpoints. You create them once under Notifications and attach them to as many endpoints as you like.
| Type | Notes |
|---|---|
email | One or more recipients per channel. |
slack | Webhook into a Slack channel. |
discord | Webhook into a Discord channel. |
webhook | Generic JSON POST to any URL. |
Each channel carries its own filters (severity, event type), delivery priority, quiet hours, and rate limit. See Notifications for how those compose with per-endpoint routing.
Mute
A scoped silence. Mutes suppress channel delivery without changing endpoint state. Incidents still open and resolve, but no message goes out.
| Scope | Effect |
|---|---|
endpoint | Silences notifications for one endpoint. |
channel | Silences one channel across all endpoints it serves. |
global | Silences everything. |
Each mute has an optional expires_at and a free-text reason, useful for planned maintenance windows.
How they fit together
Endpoint ──── runs ────► Check (1:many)
│ (one row per probe)
│
├──── may open ────► Incident (1:many)
│ (grouped failure window)
│
└──── routes to ────► Channel (many:many)
│
└──── may be silenced by ──► MuteEverything else in the product (the dashboard tiles, the catalogue, the notification log) is built on top of these five nouns.