Metrics

Prometheus metrics exposed by the Veneer controller.

Veneer exposes Prometheus metrics on the metrics endpoint (default :8080/metrics, configurable via metricsBindAddress). All metrics use the veneer_ namespace prefix.

Veneer intentionally does not duplicate Lumina metrics (which are already in Prometheus). Instead, it focuses on what Veneer decided and what actions it took.

Metrics at a Glance

MetricTypeDescription
veneer_reconciliation_duration_secondsHistogramDuration of reconciliation cycles
veneer_reconciliation_totalCounterTotal reconciliation cycles
veneer_lumina_data_freshness_secondsGaugeAge of Lumina data
veneer_lumina_data_availableGaugeWhether Lumina data is fresh
veneer_decision_totalCounterDecisions made by the engine
veneer_reserved_instance_data_availableGaugeWhether RI metrics are available
veneer_reserved_instance_countGaugeRI count by type and region
veneer_savings_plan_utilization_percentGaugeSP utilization percentage
veneer_savings_plan_remaining_capacity_dollarsGaugeSP remaining capacity ($/hr)
veneer_overlay_operations_totalCounterTotal overlay operations
veneer_overlay_operation_errors_totalCounterTotal overlay operation errors
veneer_overlay_countGaugeCurrent overlay count
veneer_prometheus_query_duration_secondsHistogramPrometheus query duration
veneer_prometheus_query_errors_totalCounterPrometheus query errors
veneer_prometheus_query_result_countGaugePrometheus query result count
veneer_config_overlays_disabledGaugeWhether overlays are disabled
veneer_config_utilization_threshold_percentGaugeConfigured utilization threshold
veneer_infoGaugeController version info

Reconciliation Metrics

MetricTypeLabelsDescription
veneer_reconciliation_duration_secondsHistogramDuration of metrics reconciliation cycles. Buckets: 0.1s to ~51s (exponential).
veneer_reconciliation_totalCounterresultTotal number of reconciliation cycles. Labels: result=success|error.

Data Source Health Metrics

MetricTypeLabelsDescription
veneer_lumina_data_freshness_secondsGaugeAge of Lumina data in seconds.
veneer_lumina_data_availableGauge1 if Lumina data is available and fresh, 0 if stale or unavailable.

Decision Metrics

MetricTypeLabelsDescription
veneer_decision_totalCountercapacity_type, should_exist, reasonTotal decisions made by the decision engine.

Label values for veneer_decision_total:

LabelValuesDescription
capacity_typecompute_savings_plan, ec2_instance_savings_plan, reserved_instance, preferenceType of AWS pre-paid capacity
should_existtrue, falseWhether an overlay should exist based on the decision
reasoncapacity_available, utilization_above_threshold, no_capacity, ri_available, ri_not_found, unknownReason for the decision

Reserved Instance Metrics

MetricTypeLabelsDescription
veneer_reserved_instance_data_availableGauge1 if Lumina is exposing RI metrics, 0 if not.
veneer_reserved_instance_countGaugeinstance_type, regionNumber of Reserved Instances detected by instance type and region.

Savings Plan Metrics

MetricTypeLabelsDescription
veneer_savings_plan_utilization_percentGaugetype, instance_family, regionSavings Plan utilization percentage.
veneer_savings_plan_remaining_capacity_dollarsGaugetype, instance_family, regionSavings Plan remaining capacity in dollars per hour.

Label values:

LabelValuesDescription
typeSP type identifierType of Savings Plan
instance_familyFamily name or allInstance family (or all for Compute SPs)
regionAWS region or globalRegion scope

NodeOverlay Lifecycle Metrics

MetricTypeLabelsDescription
veneer_overlay_operations_totalCounteroperation, capacity_typeTotal NodeOverlay operations.
veneer_overlay_operation_errors_totalCounteroperation, error_typeTotal NodeOverlay operation errors.
veneer_overlay_countGaugecapacity_typeCurrent number of NodeOverlays managed by Veneer.

Label values:

LabelValuesDescription
operationcreate, update, deleteType of overlay operation
capacity_typecompute_savings_plan, ec2_instance_savings_plan, reserved_instance, preferenceCapacity type the overlay targets
error_typevalidation, api, not_foundType of error encountered

Prometheus Query Metrics

MetricTypeLabelsDescription
veneer_prometheus_query_duration_secondsHistogramquery_typeDuration of Prometheus queries to Lumina. Uses default Prometheus buckets.
veneer_prometheus_query_errors_totalCounterquery_typeTotal Prometheus query errors.
veneer_prometheus_query_result_countGaugequery_typeNumber of results returned by the last Prometheus query.

Label values for query_type:

ValueDescription
sp_utilizationSavings Plan utilization query
sp_capacitySavings Plan remaining capacity query
riReserved Instance count query
data_freshnessLumina data freshness check

Configuration Metrics

MetricTypeLabelsDescription
veneer_config_overlays_disabledGauge1 if overlay creation is disabled (dry-run mode), 0 if enabled.
veneer_config_utilization_threshold_percentGaugeConfigured utilization threshold for overlay deletion.

Info Metric

MetricTypeLabelsDescription
veneer_infoGaugeversion, disabled_modeController information. Always set to 1.

Example PromQL Queries

Reconciliation Health

# Reconciliation error rate (last 5 minutes)
rate(veneer_reconciliation_total{result="error"}[5m])
/ rate(veneer_reconciliation_total[5m])

# Average reconciliation duration
rate(veneer_reconciliation_duration_seconds_sum[5m])
/ rate(veneer_reconciliation_duration_seconds_count[5m])

Data Source Health

# Alert if Lumina data is unavailable
veneer_lumina_data_available == 0

# Data freshness in minutes
veneer_lumina_data_freshness_seconds / 60

Overlay Activity

# Overlay creation rate by capacity type
rate(veneer_overlay_operations_total{operation="create"}[1h])

# Current overlay count
veneer_overlay_count

# Overlay operation error rate
rate(veneer_overlay_operation_errors_total[5m])

Savings Plans Monitoring

# SP utilization across all types
veneer_savings_plan_utilization_percent

# Remaining SP capacity ($/hour)
veneer_savings_plan_remaining_capacity_dollars

Grafana Dashboard

You can build a Grafana dashboard using these metrics. Key panels to include:

  1. Reconciliation Status – Success/error rate over time
  2. Lumina Data Freshness – Gauge showing data age
  3. Overlay Count – Breakdown by capacity type
  4. Decision Activity – Create vs delete decisions over time
  5. Prometheus Query Performance – Query latency and error rates
  6. SP Utilization – Per-type utilization percentages