Alarm Rules

Built-in alarm rules ship pre-configured with mipo and cover scanner health, infrastructure liveness, and core-service availability. Each rule has a trigger event, severity, and default routing — operators can override severity, mute, or change notification channels per rule from the Alarm Rules admin page. Rules cannot be created or deleted — they are code-defined and seeded automatically on every boot.

Fields & Columns

Name Description
Enabled Whether this rule creates alarms. Disabled rules ignore matching events.
Name Unique rule identifier (e.g., scanner-down, tls-expiring)
Description Human-readable explanation of what the rule detects
Severity Current severity for new alarms. Can be overridden from the default.
Auto-Resolve Event type that auto-resolves matching alarms (null = manual only)
scanner-down Availability: fires on scanner_down (default: high), auto-resolves on scanner_up. Scanner missed heartbeat timeout (2 minutes).
ingest-down Availability: fires on ingest_down (default: critical), auto-resolves on ingest_up. Ingest API node stopped heartbeating.
db-down Availability: fires on db_down (default: critical), auto-resolves on db_up. Database connection pool test failed.
proxy-down Availability: fires on proxy_down (default: critical), auto-resolves on proxy_up. Traefik reverse proxy health check failed.
dns-down Availability: fires on dns_down (default: high), auto-resolves on dns_up. Public FQDN DNS resolution failed.
backup-failed Operations: fires on backup_failed (default: high), manual resolve only. Manual or scheduled backup completed with errors.
db-disk-high Threshold: fires on db_disk_high (default: warning), auto-resolves on db_disk_normal. Database disk usage exceeded threshold.
db-connections-high Threshold: fires on db_connections_high (default: warning), auto-resolves on db_connections_normal. Database connection pool has sustained waiting queries.
tls-expiring Deadline: fires on tls_expiring (default: warning), auto-resolves on tls_renewed. TLS certificate expires within 30 days.
scanner-load-high Resource: fires on scanner_load_high (default: warning), auto-resolves on scanner_load_normal. Scanner load average exceeds 80% of CPU capacity.
scanner-memory-high Resource: fires on scanner_memory_high (default: warning), auto-resolves on scanner_memory_normal. Scanner available memory below 10%.
db-sessions-high Database: fires on db_sessions_high (default: warning), auto-resolves on db_sessions_normal. Active database sessions exceed 80% of max_connections.
db-long-queries Database: fires on db_long_queries_high (default: warning), auto-resolves on db_long_queries_normal. Database query running longer than 60 seconds.
auth-failures-high Security: fires on auth_failures_high (default: high), auto-resolves on auth_failures_normal. More than 10 authentication failures in 5 minutes.
session-ip-spread Security: fires on session_ip_spread_high (default: warning), manual resolve only. One user account has active sessions from too many distinct IP addresses.
scan-stuck Stall: fires on scan_stuck (default: warning), manual resolve only. Scan still running but all jobs are finished.
scanner-heartbeat-failed Security: fires on scanner_heartbeat_failed (default: warning), manual resolve only. Scanner heartbeat authentication failed — invalid API key or IP binding violation. Requires manual acknowledgement; a scanner can resume heartbeating successfully while prior auth failures remain in scope.
dispatcher-down Availability: fires on dispatcher_down (default: critical), auto-resolves on dispatcher_up. Dispatcher service stopped heartbeating for more than 2 minutes. When the dispatcher is down, no new scan jobs are dispatched and active scans stall.
backup-down Availability: fires on backup_down (default: warning), auto-resolves on backup_up. Backup container is unreachable via DNS resolution. Typically indicates the backup container has stopped or failed to start.
manager-memory-high Resource: fires on manager_memory_high (default: warning), auto-resolves on manager_memory_normal. Manager heap usage exceeded 80% of the 512 MB container memory limit. Sustained high memory may precede an OOM termination; consider restarting the manager or investigating large in-flight requests.
container-recovery-failed Infrastructure: fires on container_recovery_failed (default: critical), manual resolve only. A container did not recover within 2 minutes after the watchdog attempted a restart. Indicates a persistent crash loop or configuration error that automatic healing cannot fix.
container-restart-storm Infrastructure: fires on container_restart_storm (default: high), manual resolve only. A container restarted more than 3 times within a 15-minute window. Suggests a crash loop; investigate container logs before allowing further restarts.
container-down Infrastructure: fires on container_down (default: high), auto-resolves on container_up. Container is running but not reachable over the Docker internal network — health check connections are refused or timing out.
container-restarted Infrastructure: fires on container_restarted (default: info), manual resolve only. Container was restarted by the watchdog autoheal mechanism. Informational — the restart itself is the resolution of the underlying fault. Review logs if restarts become frequent.

How To

Disable an alarm type

  1. Find the rule in the table
  2. Click the enabled toggle to turn it off
  3. New events of this type will be ignored (existing alarms remain)

Override severity

  1. Find the rule in the table
  2. Use the severity dropdown to change the level
  3. New alarms will use the overridden severity

Reset severity to default

  1. Find the rule with a non-default severity (shown in parentheses)
  2. Click the "Reset" button
  3. Severity reverts to the code-defined default

Gotchas

  1. Disabling a rule does not resolve existing alarms — it only prevents new ones.
  2. Severity overrides apply to new alarms only. Existing alarms keep their original severity.
  3. Rules are re-seeded on boot. New rules appear automatically after a code update.
  4. The default severity is immutable — it reflects the code-defined importance of the fault.

API Calls (3)

Method Path Description
GET /api/admin/alerting/alarm-rules List all built-in alarm rules
PATCH /api/admin/alerting/alarm-rules/:id Toggle enabled or override severity
POST /api/admin/alerting/alarm-rules/:id/reset Reset severity to default

Related Pages

  • Alarms — Rules create alarms when matching events arrive
  • Events — Events are matched against rules to create alarms
  • Notification Policies — Policies can be scoped to specific alarm rule names