Troubleshooting
A reference page of the most common “wait, why is it doing this?” situations. Skim until you find yours.
Checks
”My check says down but I can hit the endpoint from my browser”
Likely culprits, roughly by frequency:
- Different network path. WatchDeck probes from cloud infrastructure, not your laptop. If your endpoint is behind an IP allowlist or only resolves on your VPN, the probe will fail. Whitelist WatchDeck’s egress or expose a public health route.
- Status code outside the expected list. Your endpoint might return
204while your check expects[200]. Open the endpoint’s check log and look at thestatus_codecolumn. - A
HostorUser-Agentfilter on your side. WatchDeck sendsUser-Agent: WatchDeck/1.0by default. Some WAFs block unrecognised UAs; whitelist the WatchDeck UA or override it via a custom header — see HTTP checks. - Redirect loop or >5 redirects. WatchDeck follows up to 5 redirects then gives up. If your endpoint chains through too many, configure a shorter chain or point the check at the final URL.
- The latency threshold is firing as
degraded. A response that arrives just past your latency budget reads asdegraded, notdown. Check theresponseTimeagainst yourlatencyThreshold— see Checks → Defaults.
Use Test now in the Assertions section to see exactly what WatchDeck sees — status code, latency, headers, body — without writing to history. It’s the single fastest way to debug.
”Test now passes but the live check fails”
The probe paths are identical, so the divergence is almost always one of:
- The check ran at a different time. Maybe your service is flaky and Test happened to land between failures.
- Assertions are evaluated on the live check but skipped on Test if the status check failed first. Open the failing live run in the check log and look at
status_reason— it will name which evaluator killed it. - The endpoint behaves differently when called repeatedly — rate-limit, session, cold cache.
”SSL warning fires but my cert is fine”
WatchDeck only checks days until expiry, derived from the cert’s validTo field. If a warning fires:
- The default warning window is 14 days. If your cert renews automatically a week before expiry, you’ll see the warning appear ~21 days before renewal. Lower the window to 7 days (or to Off) on endpoints with tight auto-renewal.
- WatchDeck does not validate the chain, the hostname, or revocation. So an SSL warning here is purely about expiry — see SSL & certificates.
- An
sslassertion overrides the Monitoring-tab warning window. If you’ve added one and it’s tighter than the warning window, that’s where the warning is coming from.
”The check ran late or hasn’t run yet”
- The endpoint may be paused (
status='paused') — see Endpoints → Pausing. The dashboard’s status column shows a paused chip. - The minimum interval is 60 seconds, enforced both in the form and at the database. If a third-party tool wrote a row with a shorter interval, the trigger will reject it.
- Cron isn’t perfectly punctual — runs land within a small jitter window of the scheduled time, not on the exact second.
”The check log is missing rows from a few days ago”
Check rows are pruned at 48 hours by the cleanup cron. Beyond that, you have hourly summaries (30 days) and daily summaries (indefinite) — but not the per-probe rows. The corresponding incident is kept indefinitely if one was opened, but the underlying check rows for it may be gone.
Notifications
”I’m not getting notifications”
Walk the dispatch gates in order — the first skip short-circuits everything after it. The full chain is documented at Notifications → What gates a dispatch. Quick checklist:
- Channel
enabled? A disabled channel is silently skipped. - Endpoint paused this channel? Open the endpoint’s Notifications tab — each channel has Wired and Paused toggles. A channel that’s wired-but-paused on this endpoint won’t fire here even if the channel itself is enabled and reachable.
- Recovery alert off? If you’re missing the all-clear but getting the open, the endpoint’s
recovery_alerttoggle is off. - Global mute set? Settings → Notifications → Global mute. While this is in the future, every dispatch is suppressed (and a
delivery_status='suppressed'row is written to the log). - Severity filter excludes it? A channel set to
criticalwon’t fire onwarning-severity opens. - Event filter excludes it? A channel with
sendOpenoff won’t fire on opens; same forsendResolved. - Provider rejected it? Check the Delivery log under Notifications — it shows every dispatch attempt with
failure_reasonfor failures.
If your channel was working last week and isn’t now, the Test button on the channel card is the fastest way to confirm the provider’s still reachable. A failed test sets is_connected=false on the card.
”I configured rate limit / cooldown / quiet hours / scoped mutes — they don’t seem to do anything”
Correct — they’re stored but not yet enforced today. See Notifications → Stored-but-inert fields. Setting them is harmless; they’ll begin to take effect once the dispatcher learns to consult them.
For now:
- To silence one endpoint — use the Pause toggle on the endpoint’s Notifications tab.
- To silence one channel everywhere — set the channel’s
enabledflag to false. - To silence everything for a window — set the Global mute under Settings → Notifications.
”Recovery alert never arrived”
Recovery alerts use severity success and bypass the channel’s severity filter. So if you’re not getting them:
- The endpoint’s
recovery_alerttoggle is off (the dispatcher short-circuits before any per-channel processing — there won’t even be a log row). sendResolvedis off on the channel.- The incident is still active — recovery only fires when
consecutive_healthy >= recovery_threshold. Check the incident’s timeline for the recovery progress (healthy — <code> — <ms>ms (X/Y)).
”My Slack / Discord message landed in the wrong channel”
Incoming webhooks are bound to one channel at creation time. To retarget, generate a new webhook in the destination channel and update the URL on the WatchDeck channel.
”Webhook receiver returns 401 / 403”
WatchDeck does not sign payloads today (see Webhooks → Signing). If your receiver requires a token, paste it into the channel’s Headers map as Authorization: Bearer <token>. Headers are stored in plaintext, so treat any token here as a CI-style secret.
”I see suppressed entries in the Delivery log but no failure”
suppressed rows are written by the dispatcher when a gate fires before the provider was called — global mute, severity filter, event filter, channel disabled. The suppressed_reason column tells you which gate. These are working as designed; they exist so you can audit why a notification didn’t go out.
Incidents
”I want to acknowledge an incident — where’s the button?”
There isn’t one. Incidents only have two states: active and resolved. They open and close based on the failure / recovery streaks. See Incidents → Lifecycle for the full mechanics.
”My endpoint is fixed but the incident won’t close”
consecutive_healthy has to reach recovery_threshold (default 2) before the trigger closes the incident. Open the incident detail page → the timeline shows recovery progress (healthy — ... (X/Y)). If you’ve just deployed a fix, wait for two more probe intervals (default 60s each).
If recovery is stuck because an assertion is failing on an otherwise-healthy probe, the assertion is the culprit — not the endpoint. Adjust the assertion (or temporarily soften its severity) and the next probe pair will close it.
”Old incident pages show no chart data”
The response chart pulls from the underlying check rows in [startedAt − 30m, (resolvedAt | now) + 30m], capped at 2000 points. Those rows are pruned at 48 hours, so an incident from a week ago will have an empty chart even though the incident itself is still in the history list with its full timeline preserved.
Dashboard
”The dashboard hasn’t updated”
The Overview page polls — it doesn’t use realtime. Three things refresh independently:
- Charts refetch when you change the range or endpoint selection — and via the Refresh button.
- In-window memos recompute every minute (right edge of charts slides forward).
- “Updated Xs ago” ticks every 5 seconds.
For the most-current snapshot, hit Refresh in the filter bar.
”An endpoint shows as healthy on the dashboard but I just got a down alert”
The dashboard reads each endpoint’s last_status mirror column on the endpoint row. The down alert was triggered by an incident-open event from the trigger. There can be a small lag between the check row being written (which fires the trigger and emits the alert) and the last_status column being updated for the dashboard. A page refresh will reconcile.
Quotas, plans, and demo mode
”Adding an endpoint shows error PT403”
You’re at your endpoint cap. The Add button should already be disabled — if you got here through a script, the database raised PT403 from the quota trigger. See Pricing for tier limits and how to lift them.
”Everything in the demo account is greyed out”
That’s by design. The public demo account is read-only at three layers (database RLS, server actions, UI button gating). You’re meant to browse, not write. Sign in with your own account to make changes — see Settings → Demo accounts.
”Where do I see what plan I’m on?”
Plan tier is surfaced in:
- The Quota banner on the Endpoints and Notifications list pages, which shows your current count against your cap.
- A small tier note on the Settings → Retention panel.
There’s no dedicated plan page in the UI today. See Pricing for the published tiers.
Catalogue
”I clicked a code in an incident and a side panel opened — what is this?”
Anywhere a status code or error code appears in the app (incidents, check logs, dashboard), clicking it opens the Catalogue side panel with the full reference entry — without leaving the page you were on. The same data backs the Catalogue page itself.
”A code I encountered isn’t in the catalogue”
If toggling the curated only filter off shows nothing either, the code is genuinely missing. Open an issue on the WatchDeck repository with the code, the family it belongs to, and what triggered it.
Still stuck?
Open an issue on the WatchDeck repository with:
- The endpoint URL (or a redacted version).
- The check ID or incident ID if relevant.
- A short description of expected vs actual.
- Whether Test now reproduces the issue.
The faster you give us a reproduction surface, the faster we can debug.