Skip to Content
CloudConcepts

Concepts

WatchDeck has a small vocabulary. Five nouns cover almost everything you’ll see in the UI. Learn them once and the rest of the product reads cleanly.

Endpoint

The thing you’re monitoring — and its configuration. A single row in mx_endpoints holds:

  • Identity — name, optional description, type (http or port).
  • Targeturl for HTTP, or host + port for TCP.
  • Probe config — method, expected status codes, custom headers, assertions.
  • Cadence & thresholds — check interval, timeout, latency threshold, SSL warning days, failure / recovery thresholds.
  • Routing — which notification channels to use, plus an optional escalation channel and delay.
  • Live state — last status, last check time, last response time, consecutive failure / healthy streaks, and the current open incident (if any).

There’s no separate “check config” record — adding an endpoint is configuring its check.

Lifecycle status

An endpoint’s status is one of:

StatusBehaviour
activeDefault. The scheduler runs probes on the configured interval.
pausedScheduler skips it. No probes, no incidents, no notifications. History kept.
archivedHidden from the list view. Reserved for future use.

Check

A single recorded run of an endpoint’s probe. Each row in mx_checks captures one verdict:

  • statushealthy, degraded, down, or inconclusive.
  • status_reason — short human string (e.g. "HTTP 502 — expected 200").
  • response_time, status_code, ssl_days_remaining, port_open, body_bytes.
  • assertion_result — per-rule pass/fail breakdown.

Checks are append-only. Once written, a check row never changes.

In conversation people use “check” loosely — sometimes the probe configuration, sometimes the run. The database treats them as distinct: the endpoint row holds the config; each mx_checks row holds one run.

Run statuses

StatusWhen it happens
healthyProbe succeeded and every assertion passed.
degradedProbe succeeded but a degraded-severity rule fired — latency over budget, SSL inside the warning window, soft-fail assertion.
downStatus code mismatch, network or connection error, port refused, or a down-severity assertion failed.
inconclusiveThe run couldn’t reach a verdict (rare — typically a scheduling or storage error). Doesn’t move the streak counters.

Incident

A grouped failure window. When an endpoint’s consecutive_failures reaches its failure_threshold, an incident opens. When consecutive_healthy reaches recovery_threshold, it resolves. The thresholds keep a single transient blip from paging anyone.

Incident status

Incidents have just two states:

StatusMeaning
activeOpen and still firing. The endpoint’s current_incident_id points at it.
resolvedClosed — the recovery streak met its threshold. resolved_at and duration_seconds are populated.

There’s no acknowledged state — incidents go straight from active to resolved once the recovery streak fires.

Each incident carries a cause (endpoint_down or endpoint_degraded), a timeline JSONB array of every check that contributed to it, and a notifications-sent counter. See Incidents for the full lifecycle.

Channel

A destination for notifications. Channels live independently of endpoints — you create them once under Notifications and attach them to as many endpoints as you like.

TypeNotes
emailOne or more recipients per channel.
slackWebhook into a Slack channel.
discordWebhook into a Discord channel.
webhookGeneric JSON POST to any URL.

Each channel carries its own filters (severity, event type), delivery priority, quiet hours, and rate limit — see Notifications for how those compose with per-endpoint routing.

Mute

A scoped silence. Mutes suppress channel delivery without changing endpoint state — incidents still open and resolve, but no message goes out.

ScopeEffect
endpointSilences notifications for one endpoint.
channelSilences one channel across all endpoints it serves.
globalSilences everything.

Each mute has an optional expires_at and a free-text reason — useful for planned maintenance windows.

How they fit together

Endpoint ──── runs ────► Check (1:many) │ (one row per probe) ├──── may open ────► Incident (1:many) │ (grouped failure window) └──── routes to ────► Channel (many:many) └──── may be silenced by ──► Mute

Everything else in the product — the dashboard tiles, the catalogue, the notification log — is built on top of these five nouns.

Last updated on