Metrics and Dashboards
KafScale exposes Prometheus metrics on /metrics from both brokers and the operator.
The console UI and Grafana dashboard templates are built on the same metrics.
Endpoints
- Broker metrics –
http://<broker-host>:9093/metrics - Operator metrics –
http://<operator-host>:8080/metrics
In local development, the console can scrape broker metrics if you set
KAFSCALE_CONSOLE_BROKER_METRICS_URL.
ISR Terminology
The broker advertises ISR (in-sync replica) values as part of metadata responses
and exposes related counts in metrics/UI. This reflects KafScale’s logical
replica set in etcd metadata, not Kafka’s internal LeaderAndIsr protocol. We
do not implement Kafka ISR management or any internal replication APIs; ISR here
is a metadata indicator used for client compatibility and visibility.
Broker Metrics
Broker metrics are emitted directly by the broker process.
| Metric | Type | Labels | Description |
|---|---|---|---|
kafscale_s3_health_state |
Gauge | state |
1 for the active S3 health state (healthy, degraded, unavailable). |
kafscale_s3_latency_ms_avg |
Gauge | - | Average S3 latency (ms) over the sliding window. |
kafscale_s3_error_rate |
Gauge | - | Fraction of failed S3 operations in the sliding window. |
kafscale_s3_state_duration_seconds |
Gauge | - | Seconds spent in the current S3 health state. |
kafscale_produce_rps |
Gauge | - | Produce requests per second (sliding window). |
kafscale_fetch_rps |
Gauge | - | Fetch requests per second (sliding window). |
kafscale_admin_requests_total |
Counter | api |
Count of admin API requests by API name. |
kafscale_admin_request_errors_total |
Counter | api |
Count of admin API errors by API name. |
kafscale_admin_request_latency_ms_avg |
Gauge | api |
Average admin API latency (ms). |
Admin API label values are human-readable for common ops APIs
(DescribeGroups, ListGroups, OffsetForLeaderEpoch, DescribeConfigs,
AlterConfigs). Less common keys show as api_<id> (for example,
api_37 for CreatePartitions).
Operator Metrics
Operator metrics are exported by the controller runtime metrics server.
| Metric | Type | Labels | Description |
|---|---|---|---|
kafscale_operator_clusters |
Gauge | - | Count of managed KafScaleCluster resources. |
kafscale_operator_snapshot_publish_total |
Counter | result |
Snapshot publish attempts (success or error). |
kafscale_operator_etcd_snapshot_age_seconds |
Gauge | cluster |
Seconds since last successful etcd snapshot upload. |
kafscale_operator_etcd_snapshot_last_success_timestamp |
Gauge | cluster |
Unix timestamp of last successful snapshot upload. |
kafscale_operator_etcd_snapshot_last_schedule_timestamp |
Gauge | cluster |
Unix timestamp of last scheduled snapshot job. |
kafscale_operator_etcd_snapshot_stale |
Gauge | cluster |
1 when the snapshot age exceeds the staleness threshold. |
kafscale_operator_etcd_snapshot_success |
Gauge | cluster |
1 if at least one successful snapshot was recorded. |
kafscale_operator_etcd_snapshot_access_ok |
Gauge | cluster |
1 if the snapshot bucket preflight succeeds. |
The cluster label uses namespace/name.
Grafana Dashboard
The Grafana template lives in docs/grafana/broker-dashboard.json. It expects
Prometheus to scrape both broker and operator metrics endpoints.
Metric Coverage
Metric names and behavior evolve as the platform grows. When in doubt, consult
the /metrics endpoint in your environment to see the current exported series.