S3 Health States
KafScale brokers continuously monitor S3 availability and publish a health state based on latency and error-rate sampling. This state controls broker behavior and provides observability into storage health.
State definitions
| State | Value | Condition | Broker behavior |
|---|---|---|---|
| Healthy | 0 | Latency < warn AND error rate < warn | Normal operation |
| Degraded | 1 | Latency >= warn OR error rate >= warn | Accepts requests, emits warnings |
| Unavailable | 2 | Latency >= crit OR error rate >= crit | Rejects produces, serves cached fetches |
Threshold configuration
| Variable | Default | Description |
|---|---|---|
KAFSCALE_S3_LATENCY_WARN_MS |
500 |
Latency threshold for degraded state |
KAFSCALE_S3_LATENCY_CRIT_MS |
2000 |
Latency threshold for unavailable state |
KAFSCALE_S3_ERROR_RATE_WARN |
0.01 |
Error rate threshold for degraded (1%) |
KAFSCALE_S3_ERROR_RATE_CRIT |
0.05 |
Error rate threshold for unavailable (5%) |
KAFSCALE_S3_HEALTH_WINDOW_SEC |
60 |
Sampling window for health calculation |
Tune these based on your S3 region and latency expectations. Cross-region S3 access may require higher latency thresholds.
Metrics
| Metric | Type | Description |
|---|---|---|
kafscale_s3_health_state |
Gauge | Current health state (0, 1, or 2) |
kafscale_s3_latency_ms_avg |
Gauge | Average S3 operation latency over window |
kafscale_s3_latency_ms_p99 |
Gauge | p99 S3 operation latency |
kafscale_s3_error_rate |
Gauge | Error rate over sampling window (0.0 to 1.0) |
kafscale_s3_state_duration_seconds |
Gauge | Time in current state |
kafscale_s3_state_transitions_total |
Counter | Total state transitions (label: from, to) |
kafscale_s3_operations_total |
Counter | Total S3 operations (label: operation, status) |
Alerting rules
Wire S3 health into Prometheus Alertmanager:
groups:
- name: kafscale-s3-health
rules:
- alert: KafScaleS3Unavailable
expr: kafscale_s3_health_state == 2
for: 0m
labels:
severity: critical
annotations:
summary: "KafScale S3 unavailable"
description: "Broker {{ $labels.pod }} cannot reach S3. Produces are rejected."
- alert: KafScaleS3Degraded
expr: kafscale_s3_health_state == 1
for: 5m
labels:
severity: warning
annotations:
summary: "KafScale S3 degraded"
description: "Broker {{ $labels.pod }} S3 latency or error rate elevated for 5+ minutes."
- alert: KafScaleS3LatencyHigh
expr: kafscale_s3_latency_ms_avg > 300
for: 10m
labels:
severity: warning
annotations:
summary: "KafScale S3 latency elevated"
description: "Average S3 latency {{ $value }}ms on {{ $labels.pod }}."
- alert: KafScaleS3ErrorRateHigh
expr: kafscale_s3_error_rate > 0.005
for: 5m
labels:
severity: warning
annotations:
summary: "KafScale S3 error rate elevated"
description: "S3 error rate {{ $value | humanizePercentage }} on {{ $labels.pod }}."
Behavior by state
Healthy
Normal operation. All produce and fetch requests are processed.
Degraded
Brokers continue to accept requests but emit warning logs and increment kafscale_s3_degraded_requests_total. Monitor this state to catch issues before they escalate.
Unavailable
Brokers protect data integrity by rejecting produce requests with a retriable error code. Clients should retry with exponential backoff. Fetch requests are served from cache when possible.
# Client sees this error during unavailable state
ERROR: [kafka] produce failed: KAFKA_STORAGE_ERROR (retriable)
Ops API endpoints
Query health state via the ops API:
# Get current health state
curl http://localhost:9093/ops/health/s3
Response:
{
"state": "healthy",
"state_value": 0,
"latency_ms_avg": 87,
"latency_ms_p99": 142,
"error_rate": 0.0,
"state_duration_seconds": 3847,
"window_seconds": 60
}
Query S3 health history:
# Get health history (last hour)
curl http://localhost:9093/ops/health/s3/history?minutes=60
Tuning recommendations
| Scenario | Recommended thresholds |
|---|---|
| Same-region S3 | warn: 500ms / 1%, crit: 2000ms / 5% (defaults) |
| Cross-region S3 | warn: 1000ms / 1%, crit: 5000ms / 5% |
| High-throughput | warn: 300ms / 0.5%, crit: 1000ms / 2% |
| Cost-optimized (S3 Standard-IA) | warn: 800ms / 1%, crit: 3000ms / 5% |
Next steps
- Operations for monitoring and scaling
- Runtime Settings for all environment variables
- Metrics for complete metrics reference