Skip to content

Alerts

dockmesh evaluates alert rules against the metrics pipeline every 30 seconds. When a rule fires, notifications go out via the configured channels.

A rule has:

FieldExample
NameProd CPU > 80%
Metriccpu_percent, memory_percent, container_restarts, disk_used_percent, …
ScopeContainer / Stack / Host / Host tag
Condition> 80, < 10, == 0, increased by 5 in 10m
Window5m, 15m, 1h
Severityinfo, warning, critical
CooldownTime between re-alerts on same resource
ChannelsWhere to send

Alerts → Rules → New rule walks through the fields. A live preview shows how many resources currently match the scope and would trigger.

Example: “Alert if any container in stacks tagged prod restarts more than 3 times in 10 minutes, notify Slack + PagerDuty, cooldown 30m, severity critical.”

Each severity has its own icon, color, and default cooldown:

LevelColorDefault cooldown
InfoBlue4h
WarningAmber30m
CriticalRed5m

Channels can be filtered by severity — e.g. Slack gets all, PagerDuty only critical.

Built-in channels (Settings → Channels):

  • Email — SMTP host + credentials, supports STARTTLS
  • Slack — incoming-webhook URL
  • Discord — webhook URL
  • Microsoft Teams — Incoming Webhook connector URL
  • ntfy.sh — topic URL, optional auth
  • Gotify — server URL + app token
  • Generic webhook — POST JSON to any URL
  • PagerDuty — Events API v2 integration key. Dedup-key is derived from rule+container so repeated fires fold into one PD incident.
  • Pushover — app_token + user_key, optional device + sound. Critical alerts map to priority 1 (visual alert); emergency priority 2 is intentionally not exposed (would need ack-handling the UI doesn’t have yet).

See individual integration guides for per-channel setup. Telegram is not built-in — use the generic webhook against the Telegram Bot API if you need it.

Without cooldown, a container that keeps crashing would spam alerts every 30 seconds. Cooldown suppresses duplicate alerts on the same resource for the configured window. When the underlying state clears and re-fires, you get a new alert.

Alerts → Mute rules lets you temporarily silence alerts matching a filter:

  • Mute everything on host prod-01 for 2 hours (during maintenance)
  • Mute warnings from stack experimental-app until Friday
  • Mute specific rule permanently (equivalent to disabling it)

Mutes are a separate concept from disabling — disabled rules don’t evaluate at all; muted rules evaluate but don’t notify.

Alerts → History shows every alert fired, with:

  • Timestamp
  • Rule name
  • Resource (container / stack / host)
  • Severity
  • Value that tripped the threshold
  • Which channels received it
  • Resolution time (if auto-resolved)

Export to CSV for compliance or post-mortems.

dockmesh ships with four container-level defaults on first install so every new deployment has coverage from day one:

RuleMetricThresholdDurationSeverity
Container CPU > 90% (sustained)cpu_percentgt 905 minwarning
Container CPU > 95% (critical)cpu_percentgt 9515 mincritical
Container memory > 90%mem_percentgt 905 minwarning
Container memory > 98% (near-OOM)mem_percentgt 9860scritical

Built-in rules are flagged with a “built-in” badge in the Alerts table. They can be edited (change threshold, duration, mute, attach channels) and disabled, but not deleted — disabling them is the supported way to opt out. Deletion returns 409 Conflict from the API.

Host-level rules (disk, agent-offline, backup-job-failed) need per-host metrics that aren’t emitted yet — they’ll ship with follow-up slices that add the collectors.