Like what you see? ⭐ Star the repo ⭐ to support the project and keep it in the spotlight. See the stargazers →

Development Guide

This document tracks the steps needed to work on Kafscale. It complements the architecture spec in kafscale-spec.md.

Quickstart (Local)

make build
make test

Prerequisites

  • Go 1.22+ (the module currently targets Go 1.25)
  • buf (https://buf.build/docs/installation/) for protobuf builds
  • protoc plus the protoc-gen-go and protoc-gen-go-grpc plugins (installed automatically by buf if you use the managed mode)
  • Docker + Kubernetes CLI tools if you plan to iterate on the operator

Repository Layout

  • cmd/broker, cmd/operator: binary entry points
  • pkg/: Go libraries (protocol, storage, broker, operator)
  • proto/: protobuf definitions for metadata and internal control plane APIs
  • pkg/gen/: auto-generated protobuf + gRPC Go code (ignored until buf generate runs)
  • docs/: specs and this guide
  • test/: integration + e2e suites
  • docs/storage.md: deeper design notes for the storage subsystem, including S3 client expectations

Refer to kafscale-spec.md for the detailed package-by-package breakdown.

Common Commands

Command Purpose
make build Compile all Go binaries.
make test Run unit tests (includes go vet and race detector).
make test-produce-consume MinIO + Franz produce/consume e2e suite.
make test-consumer-group Consumer group persistence e2e (embedded etcd + memory S3).
make test-ops-api Ops/admin API e2e (embedded etcd + memory S3).
make test-multi-segment-durability Multi-segment restart durability e2e (embedded etcd + MinIO).
make test-full Unit tests plus local e2e suites.
make test-operator Operator envtest + optional kind-based integration.
make demo Local demo with broker + console + embedded etcd.
make demo-platform Kind-based demo (operator HA + managed etcd + console).
make docker-build Build broker/operator/console images locally.
make docker-clean Remove dev images + Docker caches.
make stop-containers Stop leftover MinIO/kind containers.
make tidy Clean go.mod/go.sum.
make lint Run golangci-lint (requires installation).
make help List all Makefile targets.

Generating Protobuf Code

We use buf to manage protobuf builds. All metadata schemas and control-plane RPCs live under proto/.

brew install buf      # or equivalent
make proto            # runs `buf generate`

The generated Go code goes into pkg/gen/{metadata,control}. Do not edit generated files manually—re-run make proto whenever the .proto sources change.

Release Workflow

We publish container images and GitHub releases from tags. This keeps release artifacts reproducible and aligned with the Helm chart.

  • Tag format: vX.Y.Z for stable releases, vX.Y.Z-rc.N or vX.Y.Z-dev.N for prereleases.
  • Tag push triggers the Docker workflow to build and push kafscale-broker, kafscale-operator, and kafscale-console images to GHCR.
  • The workflow also creates a GitHub release with autogenerated notes.
  • The Helm chart defaults image tags to appVersion, so bump deploy/helm/kafscale/Chart.yaml version and appVersion for each release. Users can override operator.image.tag, console.image.tag, and operator.brokerImage.tag to pin a specific version, or set operator.image.useLatest=true, console.image.useLatest=true, and operator.brokerImage.useLatest=true for dev/latest installs.
  • Release notes live in docs/releases/ and should include a human-readable summary plus a “Security fixes” section listing any known CVEs addressed (or “None”).

Testing Expectations

Pull requests must include strict test coverage for the changes they introduce. At a minimum:

  • Add or extend unit tests for all non-trivial logic.
  • Run the relevant e2e suite(s); changes to broker behavior should run make test-produce-consume and any related e2e tests.
  • Extend e2e coverage when you fix bugs so regressions are caught earlier.

To add headers to new files, run python3 hack/license_headers.py.

Test Workflows (Details)

Local MinIO / S3 setup

make test-produce-consume assumes there is an S3 endpoint in front of the broker, so we keep a local MinIO container (kafscale-minio) running to exercise a production-like S3 stack.

Default MinIO settings (used when KAFSCALE_USE_MEMORY_S3=1 is not set):

Setting Value
Endpoint http://127.0.0.1:9000
Bucket kafscale
Region us-east-1
Addressing Path-style

To point at a different S3-compatible endpoint, set the KAFSCALE_S3_* variables listed under Environment Variables. To skip MinIO entirely, set KAFSCALE_USE_MEMORY_S3=1 and the broker uses the in-memory S3 client for faster, deterministic runs.

Related targets:

  • make test-produce-consume runs the MinIO-backed produce/consume suite.
  • make test-produce-consume-debug adds Kafka trace logging (KAFSCALE_LOG_LEVEL=debug, KAFSCALE_TRACE_KAFKA=true).
  • make test-consumer-group and make test-ops-api use embedded etcd + in-memory S3.
  • make test-multi-segment-durability uses MinIO and restarts the broker across multiple segment flushes.
  • make stop-containers stops leftover MinIO/kind helper containers before re-running tests.

Demo workflow

Need an interactive run? make demo boots embedded etcd plus the broker + console, opens http://127.0.0.1:48080/ui/, and keeps everything running until you hit Ctrl+C. It is the quickest way to click around the UI while real messages flow through the broker/MinIO stack.

The demo wires KAFSCALE_CONSOLE_BROKER_METRICS_URL=http://127.0.0.1:39093/metrics so the console scrapes broker Prometheus metrics and populates the S3/metrics cards with live data. Any other process starting cmd/console can set the same env var (for example, go run ./cmd/console) to render broker-reported S3 state/latency instead of the mock placeholders.

The broker exports live throughput gauges (kafscale_produce_rps / kafscale_fetch_rps) so the UI can show messages-per-second alongside S3 latency. The sliding window defaults to 60 seconds; override it with KAFSCALE_THROUGHPUT_WINDOW_SEC before starting the broker if you want a shorter (spikier) or longer (smoother) view.

Operator envtest (no kind)

TestOperatorManagedEtcdResources uses controller-runtime envtest to validate operator reconciliation without spinning up kind. Install envtest assets with setup-envtest or set KUBEBUILDER_ASSETS to a directory containing kube-apiserver and etcd binaries.

Broker logging levels

The broker reads KAFSCALE_LOG_LEVEL at start-up. If the variable is unset we operate in warning-and-above mode, which keeps regular e2e/test runs quiet. Set KAFSCALE_LOG_LEVEL=info or debug (optionally together with KAFSCALE_TRACE_KAFKA=true) when you need additional visibility; the test-produce-consume-debug target wires those env vars up for you.

Environment Variables

Test and E2E

  • KAFSCALE_E2E – Enable e2e tests.
  • KAFSCALE_E2E_KIND – Enable kind-based e2e.
  • KAFSCALE_KIND_CLUSTER – Kind cluster name.
  • KAFSCALE_KIND_RECREATE – Force kind cluster recreation.
  • KAFSCALE_E2E_DEMO – Run demo stack test.
  • KAFSCALE_E2E_OPEN_UI – Open UI during demo test.
  • KAFSCALE_E2E_DEBUG – Enable debug output in e2e tests.
  • KAFSCALE_BROKER_IMAGE, KAFSCALE_OPERATOR_IMAGE, KAFSCALE_CONSOLE_IMAGE – Image overrides for kind e2e.
  • KAFSCALE_LOCAL_FRANZ – Use local franz-go build in tests.

Demo workload tuning

  • KAFSCALE_DEMO_BROKER_ADDR
  • KAFSCALE_DEMO_TOPICS
  • KAFSCALE_DEMO_MESSAGES_PER_SEC
  • KAFSCALE_DEMO_GROUP

Broker / S3

  • KAFSCALE_S3_BUCKET
  • KAFSCALE_S3_REGION
  • KAFSCALE_S3_NAMESPACE
  • KAFSCALE_S3_ENDPOINT
  • KAFSCALE_S3_PATH_STYLE
  • KAFSCALE_S3_KMS_ARN
  • KAFSCALE_S3_ACCESS_KEY
  • KAFSCALE_S3_SECRET_KEY
  • KAFSCALE_S3_SESSION_TOKEN
  • KAFSCALE_USE_MEMORY_S3
  • KAFSCALE_THROUGHPUT_WINDOW_SEC

Broker logging

  • KAFSCALE_LOG_LEVEL
  • KAFSCALE_TRACE_KAFKA

Operator / etcd

  • KAFSCALE_ETCD_ENDPOINTS
  • KAFSCALE_ETCD_USERNAME
  • KAFSCALE_ETCD_PASSWORD

Kafka Compatibility Tracking

To stay Kafka-compatible we track every protocol key + version that upstream exposes. Upstream Kafka 3.7.0 currently advertises the following highest ApiVersions (see kafka-protocol docs):

API Key Name Kafka 3.7 Version Kafscale Status
0 Produce 9 ✅ Implemented
1 Fetch 13 ✅ Implemented
2 ListOffsets 7 ✅ Implemented (v0 only)
3 Metadata 12 ✅ Implemented
4 LeaderAndIsr 5 ❌ Not needed (internal)
5 StopReplica 3 ❌ Not needed (internal)
6 UpdateMetadata 7 ❌ Not needed (internal)
7 ControlledShutdown 3 ❌ Replaced by Kubernetes rollouts
8 OffsetCommit 3 ✅ Implemented
9 OffsetFetch 5 ✅ Implemented
10 FindCoordinator 3 ✅ Implemented
11 JoinGroup 4 ✅ Implemented
12 Heartbeat 4 ✅ Implemented
13 LeaveGroup 4 ✅ Implemented
14 SyncGroup 4 ✅ Implemented
15 DescribeGroups 5 ✅ Implemented
16 ListGroups 5 ✅ Implemented
17 SaslHandshake 1 ❌ Authentication not in scope yet
18 ApiVersions 3 ✅ Implemented (v0 only)
19 CreateTopics 7 ✅ Implemented (v0 only)
20 DeleteTopics 6 ✅ Implemented (v0 only)
21 DeleteRecords 2 ❌ Rely on S3 lifecycle
22 InitProducerId 4 ❌ Transactions out of scope
23 OffsetForLeaderEpoch 3 ✅ Implemented
24 AddPartitionsToTxn 3 ❌ Transactions out of scope
25 AddOffsetsToTxn 3 ❌ Transactions out of scope
26 EndTxn 3 ❌ Transactions out of scope
27 WriteTxnMarkers 0 ❌ Transactions out of scope
28 TxnOffsetCommit 3 ❌ Transactions out of scope
29 DescribeAcls 1 ❌ Auth not in v1
30 CreateAcls 1 ❌ Auth not in v1
31 DeleteAcls 1 ❌ Auth not in v1
32 DescribeConfigs 4 ✅ Implemented
33 AlterConfigs 1 ✅ Implemented
34 AlterReplicaLogDirs 1 ❌ Not relevant (S3 backed)
35 DescribeLogDirs 1 ❌ Not relevant (S3 backed)
36 SaslAuthenticate 2 ❌ Auth not in v1
37 CreatePartitions 0-3 ✅ Implemented
38 CreateDelegationToken 2 ❌ Auth not in v1
39 RenewDelegationToken 2 ❌ Auth not in v1
40 ExpireDelegationToken 2 ❌ Auth not in v1
41 DescribeDelegationToken 2 ❌ Auth not in v1
42 DeleteGroups 0-2 ✅ Implemented

We revisit this table each milestone. Anything marked 🔜 or ❌ has a pointer in the spec backlog so we can track when to bring it online (e.g., DescribeGroups/ListGroups for Kafka UI parity, OffsetForLeaderEpoch for catch-up tooling).

Coding Standards

  • Keep all new code documented in kafscale-spec.md or cross-link back to the spec
  • Favor context-rich structured logging (zerolog) and Prometheus metrics
  • Protobufs should remain backward compatible; prefer adding optional fields over rewriting existing ones
  • No stream processing primitives in the broker—hand those workloads off to Flink/Wayang or equivalent engines
  • Every change must land with unit tests, smoke/integration coverage, and regression tests where appropriate; skipping tests requires an explicit TODO anchored to a tracking issue.
  • Secrets live only in Kubernetes; never write S3 or etcd credentials into source control or etcd. Reference them via credentialsSecretRef and let the operator project them at runtime.
  • When testing against etcd locally, set KAFSCALE_ETCD_ENDPOINTS (comma-separated), plus KAFSCALE_ETCD_USERNAME / KAFSCALE_ETCD_PASSWORD if auth is enabled. The broker will fall back to the in-memory store when those vars are absent.