Metrics

Prometheus metrics catalog

Metrics

MetricTypeDescription
vigil_tainted_nodesGaugeNodes currently waiting for DaemonSet readiness
vigil_taint_removal_duration_secondsHistogramTime from node creation to taint removal
vigil_successful_removals_totalCounterTaint removals after all DaemonSets Ready
vigil_timeout_removals_totalCounterTaint removals due to timeout
vigil_expected_daemonsetsGauge (by node)Expected DaemonSets per node
vigil_ready_daemonsetsGauge (by node)Ready DaemonSet pods per node
vigil_reconcile_errors_totalCounterReconciliation errors
vigil_discovery_duration_secondsHistogramTime to evaluate scheduling rules
vigil_timeout_blocking_daemonset_totalCounter (by ds)Which DaemonSet blocked at timeout

Alerting

Recommended alert rules:

# Alert if >10% of taint removals are timeouts
- alert: VigilTimeoutRate
  expr: |
    rate(vigil_timeout_removals_total[15m])
    / rate(vigil_successful_removals_total[15m] + vigil_timeout_removals_total[15m])
    > 0.1
  for: 5m

# Alert if controller is down
- alert: VigilControllerDown
  expr: absent(up{job="vigil-controller"})
  for: 5m