S3 Health States
KafScale brokers publish a live S3 health state based on latency and error-rate sampling windows.
State definitions
- HEALTHY: S3 latency and error rate below warning thresholds.
- DEGRADED: Latency or error rate above warning thresholds.
- UNAVAILABLE: Latency or error rate above critical thresholds.
Metrics and thresholds
Metrics:
kafscale_s3_health_state(labelstate)kafscale_s3_latency_ms_avgkafscale_s3_error_ratekafscale_s3_state_duration_seconds
Thresholds (configurable):
KAFSCALE_S3_LATENCY_WARN_MSKAFSCALE_S3_LATENCY_CRIT_MSKAFSCALE_S3_ERROR_RATE_WARNKAFSCALE_S3_ERROR_RATE_CRITKAFSCALE_S3_HEALTH_WINDOW_SEC
Alerting integration
Wire S3 health transitions into Prometheus rules or your alerting stack. A common pattern:
- Alert on
state="unavailable"immediately. - Alert on
state="degraded"if sustained for more than a few minutes. - Track latency/error trends to tune thresholds per region.