Metrics and Dashboards
KafScale exposes Prometheus metrics on /metrics from both brokers and the operator.
The console UI and Grafana dashboard templates are built on the same metrics.
Endpoints
- Broker metrics –
http://<broker-host>:9093/metrics - Operator metrics –
http://<operator-host>:8080/metrics
In local development, the console can scrape broker metrics if you set
KAFSCALE_CONSOLE_BROKER_METRICS_URL. Operator metrics can be wired into the
console via KAFSCALE_CONSOLE_OPERATOR_METRICS_URL.
ISR Terminology
The broker advertises ISR (in-sync replica) values as part of metadata responses
and exposes related counts in metrics/UI. This reflects KafScale’s logical
replica set in etcd metadata, not Kafka’s internal LeaderAndIsr protocol. We
do not implement Kafka ISR management or any internal replication APIs; ISR here
is a metadata indicator used for client compatibility and visibility.
Broker Metrics
Broker metrics are emitted directly by the broker process.
| Metric | Type | Labels | Description |
|---|---|---|---|
kafscale_s3_health_state |
Gauge | state |
1 for the active S3 health state (healthy, degraded, unavailable). |
kafscale_s3_latency_ms_avg |
Gauge | - | Average S3 latency (ms) over the sliding window. |
kafscale_s3_error_rate |
Gauge | - | Fraction of failed S3 operations in the sliding window. |
kafscale_s3_state_duration_seconds |
Gauge | - | Seconds spent in the current S3 health state. |
kafscale_produce_rps |
Gauge | - | Produce requests per second (sliding window). |
kafscale_fetch_rps |
Gauge | - | Fetch requests per second (sliding window). |
kafscale_produce_latency_ms |
Histogram | - | Produce request latency distribution (use p95 in PromQL). |
kafscale_consumer_lag |
Histogram | - | Consumer lag distribution (use p95 in PromQL). |
kafscale_consumer_lag_max |
Gauge | - | Maximum observed consumer lag. |
kafscale_broker_uptime_seconds |
Gauge | - | Seconds since broker start. |
kafscale_broker_cpu_percent |
Gauge | - | Process CPU usage percent between scrapes. |
kafscale_broker_mem_alloc_bytes |
Gauge | - | Allocated heap bytes. |
kafscale_broker_mem_sys_bytes |
Gauge | - | Memory obtained from the OS. |
kafscale_broker_heap_inuse_bytes |
Gauge | - | Heap in-use bytes. |
kafscale_broker_goroutines |
Gauge | - | Number of goroutines. |
kafscale_admin_requests_total |
Counter | api |
Count of admin API requests by API name. |
kafscale_admin_request_errors_total |
Counter | api |
Count of admin API errors by API name. |
kafscale_admin_request_latency_ms_avg |
Gauge | api |
Average admin API latency (ms). |
kafscale_authz_denied_total |
Counter | action, resource |
Count of authorization denials by action/resource. |
Admin API label values are human-readable for common ops APIs
(DescribeGroups, ListGroups, OffsetForLeaderEpoch, DescribeConfigs,
AlterConfigs). Less common keys show as api_<id> (for example,
api_37 for CreatePartitions).
Operator Metrics
Operator metrics are exported by the controller runtime metrics server.
| Metric | Type | Labels | Description |
|---|---|---|---|
kafscale_operator_clusters |
Gauge | - | Count of managed KafScaleCluster resources. |
kafscale_operator_snapshot_publish_total |
Counter | result |
Snapshot publish attempts (success or error). |
kafscale_operator_etcd_snapshot_age_seconds |
Gauge | cluster |
Seconds since last successful etcd snapshot upload. |
kafscale_operator_etcd_snapshot_last_success_timestamp |
Gauge | cluster |
Unix timestamp of last successful snapshot upload. |
kafscale_operator_etcd_snapshot_last_schedule_timestamp |
Gauge | cluster |
Unix timestamp of last scheduled snapshot job. |
kafscale_operator_etcd_snapshot_stale |
Gauge | cluster |
1 when the snapshot age exceeds the staleness threshold. |
kafscale_operator_etcd_snapshot_success |
Gauge | cluster |
1 if at least one successful snapshot was recorded. |
kafscale_operator_etcd_snapshot_access_ok |
Gauge | cluster |
1 if the snapshot bucket preflight succeeds. |
The cluster label uses namespace/name.
Kubernetes ServiceMonitor
If you’re using the Prometheus Operator, create a ServiceMonitor to scrape KafScale:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafscale-brokers
labels:
app: kafscale
spec:
selector:
matchLabels:
app.kubernetes.io/name: kafscale
app.kubernetes.io/component: broker
endpoints:
- port: metrics
interval: 15s
path: /metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafscale-operator
labels:
app: kafscale
spec:
selector:
matchLabels:
app.kubernetes.io/name: kafscale
app.kubernetes.io/component: operator
endpoints:
- port: metrics
interval: 30s
path: /metrics
Recommended Alert Rules
Example Prometheus alerting rules for production deployments:
groups:
- name: kafscale
rules:
- alert: KafScaleS3Unhealthy
expr: kafscale_s3_health_state{state="healthy"} != 1
for: 2m
labels:
severity: critical
annotations:
summary: "KafScale S3 connection unhealthy"
description: "Broker {{ $labels.instance }} has been in non-healthy S3 state for 2+ minutes."
- alert: KafScaleHighProduceLatency
expr: histogram_quantile(0.95, rate(kafscale_produce_latency_ms_bucket[5m])) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "KafScale produce latency elevated"
description: "P95 produce latency on {{ $labels.instance }} exceeds 500ms."
- alert: KafScaleConsumerLagHigh
expr: kafscale_consumer_lag_max > 100000
for: 10m
labels:
severity: warning
annotations:
summary: "KafScale consumer lag high"
description: "Consumer lag on {{ $labels.instance }} exceeds 100k messages."
- alert: KafScaleEtcdSnapshotStale
expr: kafscale_operator_etcd_snapshot_stale == 1
for: 5m
labels:
severity: warning
annotations:
summary: "KafScale etcd snapshot stale"
description: "Cluster {{ $labels.cluster }} has not had a successful snapshot recently."
- alert: KafScaleS3ErrorRateHigh
expr: kafscale_s3_error_rate > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "KafScale S3 error rate elevated"
description: "S3 error rate on {{ $labels.instance }} exceeds 5%."
- alert: KafScaleBrokerDown
expr: up{job="kafscale-brokers"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "KafScale broker unreachable"
description: "Broker {{ $labels.instance }} is not responding to scrapes."
Tune thresholds based on your workload. KafScale’s S3-native architecture means
produce latencies are inherently higher than traditional Kafka (expect 200-400ms
typical), so adjust KafScaleHighProduceLatency accordingly.
Grafana Dashboard
The Grafana template lives in docs/grafana/broker-dashboard.json. It expects
Prometheus to scrape both broker and operator metrics endpoints.
Import the dashboard via Grafana UI or provision it in your Grafana deployment:
# Example provisioning config
apiVersion: 1
providers:
- name: KafScale
folder: KafScale
type: file
options:
path: /var/lib/grafana/dashboards/kafscale
Metric Coverage
Metric names and behavior evolve as the platform grows. When in doubt, consult
the /metrics endpoint in your environment to see the current exported series.