Like what you see? ⭐ Star the repo ⭐ to support the project and keep it in the spotlight. See the stargazers →

Metrics and Dashboards

KafScale exposes Prometheus metrics on /metrics from both brokers and the operator. The console UI and Grafana dashboard templates are built on the same metrics.

Endpoints

  • Broker metricshttp://<broker-host>:9093/metrics
  • Operator metricshttp://<operator-host>:8080/metrics

In local development, the console can scrape broker metrics if you set KAFSCALE_CONSOLE_BROKER_METRICS_URL.

ISR Terminology

The broker advertises ISR (in-sync replica) values as part of metadata responses and exposes related counts in metrics/UI. This reflects KafScale’s logical replica set in etcd metadata, not Kafka’s internal LeaderAndIsr protocol. We do not implement Kafka ISR management or any internal replication APIs; ISR here is a metadata indicator used for client compatibility and visibility.

Broker Metrics

Broker metrics are emitted directly by the broker process.

Metric Type Labels Description
kafscale_s3_health_state Gauge state 1 for the active S3 health state (healthy, degraded, unavailable).
kafscale_s3_latency_ms_avg Gauge - Average S3 latency (ms) over the sliding window.
kafscale_s3_error_rate Gauge - Fraction of failed S3 operations in the sliding window.
kafscale_s3_state_duration_seconds Gauge - Seconds spent in the current S3 health state.
kafscale_produce_rps Gauge - Produce requests per second (sliding window).
kafscale_fetch_rps Gauge - Fetch requests per second (sliding window).
kafscale_admin_requests_total Counter api Count of admin API requests by API name.
kafscale_admin_request_errors_total Counter api Count of admin API errors by API name.
kafscale_admin_request_latency_ms_avg Gauge api Average admin API latency (ms).

Admin API label values are human-readable for common ops APIs (DescribeGroups, ListGroups, OffsetForLeaderEpoch, DescribeConfigs, AlterConfigs). Less common keys show as api_<id> (for example, api_37 for CreatePartitions).

Operator Metrics

Operator metrics are exported by the controller runtime metrics server.

Metric Type Labels Description
kafscale_operator_clusters Gauge - Count of managed KafScaleCluster resources.
kafscale_operator_snapshot_publish_total Counter result Snapshot publish attempts (success or error).
kafscale_operator_etcd_snapshot_age_seconds Gauge cluster Seconds since last successful etcd snapshot upload.
kafscale_operator_etcd_snapshot_last_success_timestamp Gauge cluster Unix timestamp of last successful snapshot upload.
kafscale_operator_etcd_snapshot_last_schedule_timestamp Gauge cluster Unix timestamp of last scheduled snapshot job.
kafscale_operator_etcd_snapshot_stale Gauge cluster 1 when the snapshot age exceeds the staleness threshold.
kafscale_operator_etcd_snapshot_success Gauge cluster 1 if at least one successful snapshot was recorded.
kafscale_operator_etcd_snapshot_access_ok Gauge cluster 1 if the snapshot bucket preflight succeeds.

The cluster label uses namespace/name.

Grafana Dashboard

The Grafana template lives in docs/grafana/broker-dashboard.json. It expects Prometheus to scrape both broker and operator metrics endpoints.

Metric Coverage

Metric names and behavior evolve as the platform grows. When in doubt, consult the /metrics endpoint in your environment to see the current exported series.