Troubleshooting
Controller Not Starting
Symptoms
- Pod is in CrashLoopBackOff
- Logs show configuration errors
Resolution
Check logs for configuration validation errors:
kubectl logs -n lumina-system deployment/lumina-controller-managerVerify the ConfigMap has valid YAML:
kubectl get configmap -n lumina-system lumina-config -o yamlCommon config issues (see the Configuration Reference for valid formats):
- Account IDs must be exactly 12 digits
- IAM role ARNs must match the format
arn:aws:iam::<account-id>:role/<role-name> - ARN account ID must match the configured
accountId - Duration values must be valid Go durations (e.g., “5m”, “1h”, “24h”)
AWS Account Access Failures
Symptoms
lumina_account_validation_status == 0in Prometheus- Logs show AssumeRole errors
Resolution
Check the account validation metric:
lumina_account_validation_statusVerify the controller’s service account has permission to assume the target role:
# If using IRSA, check the annotation kubectl get sa -n lumina-system lumina-controller -o yamlVerify the target IAM role trust policy allows the controller role to assume it. See the Installation Guide for the required trust relationship.
Test AssumeRole manually:
aws sts assume-role --role-arn arn:aws:iam::123456789012:role/lumina-readonly --role-session-name test
No Cost Metrics
Symptoms
ec2_instance_hourly_costmetrics are missing or all zero- Controller is running but no cost data
Resolution
Check data freshness metrics:
lumina_data_freshness_secondsIf values are very high, data collection may be failing.
Check data collection status:
lumina_data_last_success == 0Verify EC2 instances are in cache using the debug endpoints:
curl http://localhost:8080/debug/cache/stats | jqVerify on-demand pricing is loaded:
curl http://localhost:8080/debug/cache/pricing/ondemand | jq '. | keys'If pricing is missing, check that the controller can access the AWS Pricing API (requires
pricing:GetProductspermission). See the IAM setup for the required policy.
Savings Plan Not Applying to Instances
Symptoms
- Savings Plans exist but instances show on-demand pricing
- SP utilization is 0%
Resolution
Verify the SP is discovered using the debug endpoints:
curl http://localhost:8080/debug/cache/risp | jq '.savings_plans'Check if SP rates are cached:
curl http://localhost:8080/debug/cache/pricing/sp | jqIf rates are missing, wait 1-2 minutes for the SP Rates Reconciler to fetch them. Check data freshness:
lumina_data_freshness_seconds{data_type="sp_rates"}Verify the SP type matches your instances (see Cost Calculation for details):
- EC2 Instance SPs only cover the specified instance family and region
- Compute SPs cover any instance family and region
- Neither applies to spot instances
Metric Duplication in Multi-Cluster Setup
Symptoms
- Cost metrics are doubled/tripled in Prometheus
- Aggregate cost queries return inflated values
Resolution
Set metrics.disableInstanceMetrics: true on all clusters except the management cluster (see Configuration Reference):
config:
metrics:
disableInstanceMetrics: true
This disables ec2_instance, ec2_instance_count, and ec2_instance_hourly_cost on worker clusters while keeping Savings Plans and Reserved Instance metrics enabled.
Stale Data
Symptoms
lumina_data_freshness_secondsis very high for some data types- Costs appear outdated
Resolution
Check which data type is stale using the data freshness metric:
lumina_data_freshness_seconds > 600
Expected freshness by data type:
| Data Type | Expected Freshness | Alert Threshold |
|---|---|---|
ec2_instances | ~5 minutes | >10 minutes |
reserved_instances | ~1 hour | >2 hours |
savings_plans | ~1 hour | >2 hours |
pricing | ~24 hours | >48 hours |
sp_rates | ~2 minutes | >10 minutes |
spot_pricing | ~1 hour | >2 hours |
If a specific data type is consistently stale, check:
- AWS API rate limiting (check logs for throttling errors)
- Network connectivity to AWS endpoints
- IAM permissions for the specific API (see Installation Guide)
Pricing Accuracy Issues
Symptoms
- Some instances show
pricing_accuracy="estimated"instead of"accurate"
Resolution
This is normal during cache warming (first few minutes after startup). The SP Rates Reconciler needs to fetch rates for all instance type/region combinations.
Monitor cache effectiveness using the cost metrics:
# Percentage of instances using accurate pricing
sum(ec2_instance_hourly_cost{pricing_accuracy="accurate"}) /
sum(ec2_instance_hourly_cost) * 100
If estimated pricing persists:
- Check SP rates cache via the debug endpoints:
curl http://localhost:8080/debug/cache/pricing/sp | jq - Look for specific missing rates:
curl "http://localhost:8080/debug/cache/pricing/sp/lookup?instance_type=<type>®ion=<region>&sp=<arn>" | jq - Some rate combinations may not exist (e.g., Windows rate for a Linux-only SP), which is indicated by sentinel values.
High Memory Usage
Symptoms
- Controller OOMKilled
- Memory usage growing over time
Resolution
Reduce pricing data by limiting operating systems (see Pricing Configuration):
config: pricing: operatingSystems: - "Linux" # Only load Linux pricingReduce the number of configured regions if not all are needed.
Increase memory limits:
resources: limits: memory: 1Gi