Like what you see? ⭐ Star the repo ⭐ to support the project and keep it in the spotlight. See the stargazers →

S3 Health States

KafScale brokers continuously monitor S3 availability and publish a health state based on latency and error-rate sampling. This state controls broker behavior and provides observability into storage health.


State definitions

State Value Condition Broker behavior
Healthy 0 Latency < warn AND error rate < warn Normal operation
Degraded 1 Latency >= warn OR error rate >= warn Accepts requests, emits warnings
Unavailable 2 Latency >= crit OR error rate >= crit Rejects produces, serves cached fetches
State transitions
Healthy state = 0 Degraded state = 1 Unavailable state = 2 warn exceeded crit exceeded metrics recover below thresholds

Threshold configuration

Variable Default Description
KAFSCALE_S3_LATENCY_WARN_MS 500 Latency threshold for degraded state
KAFSCALE_S3_LATENCY_CRIT_MS 2000 Latency threshold for unavailable state
KAFSCALE_S3_ERROR_RATE_WARN 0.01 Error rate threshold for degraded (1%)
KAFSCALE_S3_ERROR_RATE_CRIT 0.05 Error rate threshold for unavailable (5%)
KAFSCALE_S3_HEALTH_WINDOW_SEC 60 Sampling window for health calculation

Tune these based on your S3 region and latency expectations. Cross-region S3 access may require higher latency thresholds.


Metrics

Metric Type Description
kafscale_s3_health_state Gauge Current health state (0, 1, or 2)
kafscale_s3_latency_ms_avg Gauge Average S3 operation latency over window
kafscale_s3_latency_ms_p99 Gauge p99 S3 operation latency
kafscale_s3_error_rate Gauge Error rate over sampling window (0.0 to 1.0)
kafscale_s3_state_duration_seconds Gauge Time in current state
kafscale_s3_state_transitions_total Counter Total state transitions (label: from, to)
kafscale_s3_operations_total Counter Total S3 operations (label: operation, status)

Alerting rules

Wire S3 health into Prometheus Alertmanager:

groups:
  - name: kafscale-s3-health
    rules:
      - alert: KafScaleS3Unavailable
        expr: kafscale_s3_health_state == 2
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "KafScale S3 unavailable"
          description: "Broker {{ $labels.pod }} cannot reach S3. Produces are rejected."

      - alert: KafScaleS3Degraded
        expr: kafscale_s3_health_state == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "KafScale S3 degraded"
          description: "Broker {{ $labels.pod }} S3 latency or error rate elevated for 5+ minutes."

      - alert: KafScaleS3LatencyHigh
        expr: kafscale_s3_latency_ms_avg > 300
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "KafScale S3 latency elevated"
          description: "Average S3 latency {{ $value }}ms on {{ $labels.pod }}."

      - alert: KafScaleS3ErrorRateHigh
        expr: kafscale_s3_error_rate > 0.005
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "KafScale S3 error rate elevated"
          description: "S3 error rate {{ $value | humanizePercentage }} on {{ $labels.pod }}."

Behavior by state

Healthy

Normal operation. All produce and fetch requests are processed.

Degraded

Brokers continue to accept requests but emit warning logs and increment kafscale_s3_degraded_requests_total. Monitor this state to catch issues before they escalate.

Unavailable

Brokers protect data integrity by rejecting produce requests with a retriable error code. Clients should retry with exponential backoff. Fetch requests are served from cache when possible.

# Client sees this error during unavailable state
ERROR: [kafka] produce failed: KAFKA_STORAGE_ERROR (retriable)

Ops API endpoints

Query health state via the ops API:

# Get current health state
curl http://localhost:9093/ops/health/s3

Response:

{
  "state": "healthy",
  "state_value": 0,
  "latency_ms_avg": 87,
  "latency_ms_p99": 142,
  "error_rate": 0.0,
  "state_duration_seconds": 3847,
  "window_seconds": 60
}

Query S3 health history:

# Get health history (last hour)
curl http://localhost:9093/ops/health/s3/history?minutes=60

Tuning recommendations

Scenario Recommended thresholds
Same-region S3 warn: 500ms / 1%, crit: 2000ms / 5% (defaults)
Cross-region S3 warn: 1000ms / 1%, crit: 5000ms / 5%
High-throughput warn: 300ms / 0.5%, crit: 1000ms / 2%
Cost-optimized (S3 Standard-IA) warn: 800ms / 1%, crit: 3000ms / 5%

Next steps