Troubleshooting

Common issues and debugging techniques for the Lumina controller

Controller Not Starting

Symptoms

Pod is in CrashLoopBackOff
Logs show configuration errors

Resolution

Check logs for configuration validation errors:

kubectl logs -n lumina-system deployment/lumina-controller-manager

Verify the ConfigMap has valid YAML:

kubectl get configmap -n lumina-system lumina-config -o yaml

Common config issues (see the Configuration Reference for valid formats):
- Account IDs must be exactly 12 digits
- IAM role ARNs must match the format arn:aws:iam::<account-id>:role/<role-name>
- ARN account ID must match the configured accountId
- Duration values must be valid Go durations (e.g., “5m”, “1h”, “24h”)

AWS Account Access Failures

Symptoms

lumina_account_validation_status == 0 in Prometheus
Logs show AssumeRole errors

Resolution

Check the account validation metric:
```
lumina_account_validation_status
```

Verify the controller’s service account has permission to assume the target role:

# If using IRSA, check the annotation
kubectl get sa -n lumina-system lumina-controller -o yaml

Verify the target IAM role trust policy allows the controller role to assume it. See the Installation Guide for the required trust relationship.

Test AssumeRole manually:

aws sts assume-role --role-arn arn:aws:iam::123456789012:role/lumina-readonly --role-session-name test

No Cost Metrics

Symptoms

ec2_instance_hourly_cost metrics are missing or all zero
Controller is running but no cost data

Resolution

Check data freshness metrics:
```
lumina_data_freshness_seconds
```
If values are very high, data collection may be failing.
Check data collection status:
```
lumina_data_last_success == 0
```
Verify EC2 instances are in cache using the debug endpoints:
```
curl http://localhost:8080/debug/cache/stats | jq
```

Verify on-demand pricing is loaded:

curl http://localhost:8080/debug/cache/pricing/ondemand | jq '. | keys'

If pricing is missing, check that the controller can access the AWS Pricing API (requires pricing:GetProducts permission). See the IAM setup for the required policy.

Savings Plan Not Applying to Instances

Symptoms

Savings Plans exist but instances show on-demand pricing
SP utilization is 0%

Resolution

Verify the SP is discovered using the debug endpoints:

curl http://localhost:8080/debug/cache/risp | jq '.savings_plans'

Check if SP rates are cached:

curl http://localhost:8080/debug/cache/pricing/sp | jq

If rates are missing, wait 1-2 minutes for the SP Rates Reconciler to fetch them. Check data freshness:
```
lumina_data_freshness_seconds{data_type="sp_rates"}
```
Verify the SP type matches your instances (see Cost Calculation for details):
- EC2 Instance SPs only cover the specified instance family and region
- Compute SPs cover any instance family and region
- Neither applies to spot instances

Metric Duplication in Multi-Cluster Setup

Symptoms

Cost metrics are doubled/tripled in Prometheus
Aggregate cost queries return inflated values

Resolution

Set metrics.disableInstanceMetrics: true on all clusters except the management cluster (see Configuration Reference):

config:
  metrics:
    disableInstanceMetrics: true

This disables ec2_instance, ec2_instance_count, and ec2_instance_hourly_cost on worker clusters while keeping Savings Plans and Reserved Instance metrics enabled.

Stale Data

Symptoms

lumina_data_freshness_seconds is very high for some data types
Costs appear outdated

Resolution

Check which data type is stale using the data freshness metric:

lumina_data_freshness_seconds > 600

Expected freshness by data type:

Data Type	Expected Freshness	Alert Threshold
`ec2_instances`	~5 minutes	>10 minutes
`reserved_instances`	~1 hour	>2 hours
`savings_plans`	~1 hour	>2 hours
`pricing`	~24 hours	>48 hours
`sp_rates`	~2 minutes	>10 minutes
`spot_pricing`	~1 hour	>2 hours

If a specific data type is consistently stale, check:

AWS API rate limiting (check logs for throttling errors)
Network connectivity to AWS endpoints
IAM permissions for the specific API (see Installation Guide)

Pricing Accuracy Issues

Symptoms

Some instances show pricing_accuracy="estimated" instead of "accurate"

Resolution

This is normal during cache warming (first few minutes after startup). The SP Rates Reconciler needs to fetch rates for all instance type/region combinations.

Monitor cache effectiveness using the cost metrics:

# Percentage of instances using accurate pricing
sum(ec2_instance_hourly_cost{pricing_accuracy="accurate"}) /
sum(ec2_instance_hourly_cost) * 100

If estimated pricing persists:

Check SP rates cache via the debug endpoints:

curl http://localhost:8080/debug/cache/pricing/sp | jq

Look for specific missing rates:

curl "http://localhost:8080/debug/cache/pricing/sp/lookup?instance_type=<type>&region=<region>&sp=<arn>" | jq

Some rate combinations may not exist (e.g., Windows rate for a Linux-only SP), which is indicated by sentinel values.

High Memory Usage

Symptoms

Controller OOMKilled
Memory usage growing over time

Resolution

Reduce pricing data by limiting operating systems (see Pricing Configuration):

config:
  pricing:
    operatingSystems:
      - "Linux"  # Only load Linux pricing

Reduce the number of configured regions if not all are needed.
Increase memory limits:
```
resources:
  limits:
    memory: 1Gi
```