Troubleshooting
Common Issues
No NodeOverlays Created
Symptom: Veneer is running but no NodeOverlays appear.
Check data availability:
# Check if Lumina data is available
kubectl port-forward -n veneer-system svc/veneer-metrics 8080:8080
curl -s http://localhost:8080/metrics | grep veneer_lumina_data_available
# Expected: veneer_lumina_data_available 1
If veneer_lumina_data_available is 0:
- Verify Lumina is running:
kubectl get pods -n lumina-system - Verify Prometheus is scraping Lumina: check Prometheus targets UI
- Verify the Prometheus URL is correct in Veneer’s configuration
Check utilization threshold:
curl -s http://localhost:8080/metrics | grep veneer_savings_plan_utilization
If utilization is above the configured utilization threshold (default 95%), overlays will not be created because the pre-paid capacity is fully consumed.
Check disabled mode:
curl -s http://localhost:8080/metrics | grep veneer_config_overlays_disabled
# Expected: veneer_config_overlays_disabled 0
If 1, Veneer is in disabled mode. Overlays are created but with an impossible requirement so they never match. See the NodeOverlay CRD reference for details on disabled mode overlays.
x86 Selected Despite ARM64 Preference
Symptom: You configured a preference for ARM64 but Karpenter still provisions x86 instances.
This can happen for several reasons:
Bin-packing filtered out ARM64 – If the aggregate CPU requirement falls in the 97-128 vCPU range, only x86 32xlarge instances are available (Graviton has no 32xlarge). See Bin-Packing and NodeOverlay.
ARM64 spot capacity exhausted – Even with lower Priority values, AWS will select x86 if ARM64 spot pools lack capacity. Check CloudTrail for
InsufficientInstanceCapacityerrors.NodeOverlay not applied – Verify the overlay exists and matches the instance types:
kubectl get nodeoverlays -l veneer.io/type=preferenceSee the NodeOverlay CRD reference for label and requirement details.
See the Bin-Packing page for diagnostic steps and solutions.
“Failed to Query Data Freshness” Errors
Symptom: Log errors about Prometheus connectivity.
# Check Prometheus connectivity
curl http://<prometheus-url>:9090/-/healthy
# Check Prometheus targets
curl -s http://<prometheus-url>:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Verify Lumina metrics exist
curl -s 'http://<prometheus-url>:9090/api/v1/query?query=savings_plan_remaining_capacity' | jq '.data.result | length'
If running locally with port-forward:
# Verify port-forward is active
lsof -i:9090
# Re-establish if needed
kubectl port-forward -n lumina-system svc/lumina-prometheus 9090:9090
“Context Deadline Exceeded” Errors
Symptom: Timeout errors when querying Prometheus.
- Check that Lumina is running and healthy:
kubectl get pods -n lumina-system - Verify Prometheus has scraped recent metrics
- Check Prometheus query performance – some queries may be slow on large datasets
Port Already in Use
Symptom: Veneer fails to start with a bind error.
# Find process using the port
lsof -ti:8081 | xargs kill -9
# Or change the port in config (see Configuration Reference)
healthProbeBindAddress: ":8082"
Port values are configurable via the Configuration reference or Helm chart values.
Overlays Created But Karpenter Ignores Them
Symptom: NodeOverlays exist but provisioning behavior doesn’t change.
Verify Karpenter supports NodeOverlay:
kubectl get crd nodeoverlays.karpenter.shCheck overlay requirements match instance types: The requirements in the overlay must match instances that Karpenter is considering. Verify with:
kubectl get nodeoverlay <name> -o yamlCheck allocation strategy in CloudTrail: Look for
capacity-optimized-prioritized(spot) orprioritized(on-demand) in CreateFleet requests. If you seeprice-capacity-optimizedorlowest-price, NodeOverlay is not being applied.
“No Matching Capacity” Warnings
Symptom: Veneer can’t match Savings Plans utilization with capacity data.
- Check that Lumina is exposing both utilization and capacity metrics
- Verify ARNs match between metrics
- Enable debug logging to see actual query results:
logLevel: "debug"
Data Freshness
Veneer checks Lumina data freshness before each reconciliation. If data is stale (older than expected), Veneer skips the reconciliation cycle to avoid making decisions based on outdated information.
Monitor freshness:
curl -s http://localhost:8080/metrics | grep veneer_lumina_data_freshness_seconds
The veneer_lumina_data_available metric reports whether data is fresh enough to act on. See the Metrics reference for the full list of available metrics.
Common causes of stale data:
- Lumina controller is not running or is unhealthy
- Prometheus is not scraping Lumina
- Network connectivity issues between Veneer and Prometheus
Debugging with Logs
Enable Debug Logging
# config.yaml
logLevel: "debug"
Or via environment variable:
export VENEER_LOG_LEVEL=debug
Key Log Messages
| Log Message | Meaning |
|---|---|
Starting metrics reconciler | Controller started successfully |
Reconciliation complete | A reconciliation cycle finished |
Lumina data is stale | Data freshness check failed, skipping cycle |
Creating NodeOverlay | An overlay is being created |
Deleting NodeOverlay | An overlay is being removed |
SP utilization at/above threshold | SP is fully utilized, no overlay needed |
SP utilization below threshold | SP has remaining capacity, overlay created |
Useful kubectl Commands
# View Veneer logs
kubectl logs -n veneer-system -l app.kubernetes.io/name=veneer --tail=100
# Follow logs in real-time
kubectl logs -n veneer-system -l app.kubernetes.io/name=veneer -f
# List all Veneer-managed overlays
kubectl get nodeoverlays -l app.kubernetes.io/managed-by=veneer
# Describe a specific overlay
kubectl describe nodeoverlay cost-aware-ec2-sp-m5-us-west-2
# Check Veneer metrics
kubectl port-forward -n veneer-system svc/veneer-metrics 8080:8080
curl -s http://localhost:8080/metrics | grep veneer_