Like what you see? ⭐ Star the repo ⭐ to support the project and keep it in the spotlight. See the stargazers →

Metrics and Dashboards

KafScale exposes Prometheus metrics on /metrics from both brokers and the operator. The console UI and Grafana dashboard templates are built on the same metrics.

Endpoints

  • Broker metricshttp://<broker-host>:9093/metrics
  • Operator metricshttp://<operator-host>:8080/metrics

In local development, the console can scrape broker metrics if you set KAFSCALE_CONSOLE_BROKER_METRICS_URL. Operator metrics can be wired into the console via KAFSCALE_CONSOLE_OPERATOR_METRICS_URL.

ISR Terminology

The broker advertises ISR (in-sync replica) values as part of metadata responses and exposes related counts in metrics/UI. This reflects KafScale’s logical replica set in etcd metadata, not Kafka’s internal LeaderAndIsr protocol. We do not implement Kafka ISR management or any internal replication APIs; ISR here is a metadata indicator used for client compatibility and visibility.

Broker Metrics

Broker metrics are emitted directly by the broker process.

Metric Type Labels Description
kafscale_s3_health_state Gauge state 1 for the active S3 health state (healthy, degraded, unavailable).
kafscale_s3_latency_ms_avg Gauge - Average S3 latency (ms) over the sliding window.
kafscale_s3_error_rate Gauge - Fraction of failed S3 operations in the sliding window.
kafscale_s3_state_duration_seconds Gauge - Seconds spent in the current S3 health state.
kafscale_produce_rps Gauge - Produce requests per second (sliding window).
kafscale_fetch_rps Gauge - Fetch requests per second (sliding window).
kafscale_produce_latency_ms Histogram - Produce request latency distribution (use p95 in PromQL).
kafscale_consumer_lag Histogram - Consumer lag distribution (use p95 in PromQL).
kafscale_consumer_lag_max Gauge - Maximum observed consumer lag.
kafscale_broker_uptime_seconds Gauge - Seconds since broker start.
kafscale_broker_cpu_percent Gauge - Process CPU usage percent between scrapes.
kafscale_broker_mem_alloc_bytes Gauge - Allocated heap bytes.
kafscale_broker_mem_sys_bytes Gauge - Memory obtained from the OS.
kafscale_broker_heap_inuse_bytes Gauge - Heap in-use bytes.
kafscale_broker_goroutines Gauge - Number of goroutines.
kafscale_admin_requests_total Counter api Count of admin API requests by API name.
kafscale_admin_request_errors_total Counter api Count of admin API errors by API name.
kafscale_admin_request_latency_ms_avg Gauge api Average admin API latency (ms).
kafscale_authz_denied_total Counter action, resource Count of authorization denials by action/resource.

Admin API label values are human-readable for common ops APIs (DescribeGroups, ListGroups, OffsetForLeaderEpoch, DescribeConfigs, AlterConfigs). Less common keys show as api_<id> (for example, api_37 for CreatePartitions).

Operator Metrics

Operator metrics are exported by the controller runtime metrics server.

Metric Type Labels Description
kafscale_operator_clusters Gauge - Count of managed KafScaleCluster resources.
kafscale_operator_snapshot_publish_total Counter result Snapshot publish attempts (success or error).
kafscale_operator_etcd_snapshot_age_seconds Gauge cluster Seconds since last successful etcd snapshot upload.
kafscale_operator_etcd_snapshot_last_success_timestamp Gauge cluster Unix timestamp of last successful snapshot upload.
kafscale_operator_etcd_snapshot_last_schedule_timestamp Gauge cluster Unix timestamp of last scheduled snapshot job.
kafscale_operator_etcd_snapshot_stale Gauge cluster 1 when the snapshot age exceeds the staleness threshold.
kafscale_operator_etcd_snapshot_success Gauge cluster 1 if at least one successful snapshot was recorded.
kafscale_operator_etcd_snapshot_access_ok Gauge cluster 1 if the snapshot bucket preflight succeeds.

The cluster label uses namespace/name.

Kubernetes ServiceMonitor

If you’re using the Prometheus Operator, create a ServiceMonitor to scrape KafScale:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafscale-brokers
  labels:
    app: kafscale
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kafscale
      app.kubernetes.io/component: broker
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafscale-operator
  labels:
    app: kafscale
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kafscale
      app.kubernetes.io/component: operator
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Example Prometheus alerting rules for production deployments:

groups:
  - name: kafscale
    rules:
      - alert: KafScaleS3Unhealthy
        expr: kafscale_s3_health_state{state="healthy"} != 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "KafScale S3 connection unhealthy"
          description: "Broker {{ $labels.instance }} has been in non-healthy S3 state for 2+ minutes."

      - alert: KafScaleHighProduceLatency
        expr: histogram_quantile(0.95, rate(kafscale_produce_latency_ms_bucket[5m])) > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "KafScale produce latency elevated"
          description: "P95 produce latency on {{ $labels.instance }} exceeds 500ms."

      - alert: KafScaleConsumerLagHigh
        expr: kafscale_consumer_lag_max > 100000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "KafScale consumer lag high"
          description: "Consumer lag on {{ $labels.instance }} exceeds 100k messages."

      - alert: KafScaleEtcdSnapshotStale
        expr: kafscale_operator_etcd_snapshot_stale == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "KafScale etcd snapshot stale"
          description: "Cluster {{ $labels.cluster }} has not had a successful snapshot recently."

      - alert: KafScaleS3ErrorRateHigh
        expr: kafscale_s3_error_rate > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "KafScale S3 error rate elevated"
          description: "S3 error rate on {{ $labels.instance }} exceeds 5%."

      - alert: KafScaleBrokerDown
        expr: up{job="kafscale-brokers"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "KafScale broker unreachable"
          description: "Broker {{ $labels.instance }} is not responding to scrapes."

Tune thresholds based on your workload. KafScale’s S3-native architecture means produce latencies are inherently higher than traditional Kafka (expect 200-400ms typical), so adjust KafScaleHighProduceLatency accordingly.

Grafana Dashboard

The Grafana template lives in docs/grafana/broker-dashboard.json. It expects Prometheus to scrape both broker and operator metrics endpoints.

Import the dashboard via Grafana UI or provision it in your Grafana deployment:

# Example provisioning config
apiVersion: 1
providers:
  - name: KafScale
    folder: KafScale
    type: file
    options:
      path: /var/lib/grafana/dashboards/kafscale

Metric Coverage

Metric names and behavior evolve as the platform grows. When in doubt, consult the /metrics endpoint in your environment to see the current exported series.