Development Guide
This document tracks the steps needed to work on Kafscale. It complements the architecture spec in kafscale-spec.md.
Quickstart (Local)
make build
make test
Prerequisites
- Go 1.22+ (the module currently targets Go 1.25)
buf(https://buf.build/docs/installation/) for protobuf buildsprotocplus theprotoc-gen-goandprotoc-gen-go-grpcplugins (installed automatically bybufif you use the managed mode)- Docker + Kubernetes CLI tools if you plan to iterate on the operator
Repository Layout
cmd/broker,cmd/operator: binary entry pointspkg/: Go libraries (protocol, storage, broker, operator)proto/: protobuf definitions for metadata and internal control plane APIspkg/gen/: auto-generated protobuf + gRPC Go code (ignored untilbuf generateruns)docs/: specs and this guidetest/: integration + e2e suitesdocs/storage.md: deeper design notes for the storage subsystem, including S3 client expectations
Refer to kafscale-spec.md for the detailed package-by-package breakdown.
Common Commands
| Command | Purpose |
|---|---|
make build |
Compile all Go binaries. |
make test |
Run unit tests (includes go vet and race detector). |
make test-produce-consume |
MinIO + Franz produce/consume e2e suite. |
make test-consumer-group |
Consumer group persistence e2e (embedded etcd + memory S3). |
make test-ops-api |
Ops/admin API e2e (embedded etcd + memory S3). |
make test-multi-segment-durability |
Multi-segment restart durability e2e (embedded etcd + MinIO). |
make test-full |
Unit tests plus local e2e suites. |
make test-operator |
Operator envtest + optional kind-based integration. |
make demo |
Local demo with broker + console + embedded etcd. |
make demo-platform |
Kind-based demo (operator HA + managed etcd + console). |
make docker-build |
Build broker/operator/console images locally. |
make docker-clean |
Remove dev images + Docker caches. |
make stop-containers |
Stop leftover MinIO/kind containers. |
make tidy |
Clean go.mod/go.sum. |
make lint |
Run golangci-lint (requires installation). |
make help |
List all Makefile targets. |
Generating Protobuf Code
We use buf to manage protobuf builds. All metadata schemas and control-plane RPCs live under proto/.
brew install buf # or equivalent
make proto # runs `buf generate`
The generated Go code goes into pkg/gen/{metadata,control}. Do not edit generated files manually—re-run make proto whenever the .proto sources change.
Release Workflow
We publish container images and GitHub releases from tags. This keeps release artifacts reproducible and aligned with the Helm chart.
- Tag format:
vX.Y.Zfor stable releases,vX.Y.Z-rc.NorvX.Y.Z-dev.Nfor prereleases. - Tag push triggers the Docker workflow to build and push
kafscale-broker,kafscale-operator, andkafscale-consoleimages to GHCR. - The workflow also creates a GitHub release with autogenerated notes.
- The Helm chart defaults image tags to
appVersion, so bumpdeploy/helm/kafscale/Chart.yamlversionandappVersionfor each release. Users can overrideoperator.image.tag,console.image.tag, andoperator.brokerImage.tagto pin a specific version, or setoperator.image.useLatest=true,console.image.useLatest=true, andoperator.brokerImage.useLatest=truefor dev/latest installs. - Release notes live in
docs/releases/and should include a human-readable summary plus a “Security fixes” section listing any known CVEs addressed (or “None”).
Testing Expectations
Pull requests must include strict test coverage for the changes they introduce. At a minimum:
- Add or extend unit tests for all non-trivial logic.
- Run the relevant e2e suite(s); changes to broker behavior should run
make test-produce-consumeand any related e2e tests. - Extend e2e coverage when you fix bugs so regressions are caught earlier.
To add headers to new files, run python3 hack/license_headers.py.
Test Workflows (Details)
Local MinIO / S3 setup
make test-produce-consume assumes there is an S3 endpoint in front of the broker, so we keep a local MinIO container (kafscale-minio) running to exercise a production-like S3 stack.
Default MinIO settings (used when KAFSCALE_USE_MEMORY_S3=1 is not set):
| Setting | Value |
|---|---|
| Endpoint | http://127.0.0.1:9000 |
| Bucket | kafscale |
| Region | us-east-1 |
| Addressing | Path-style |
To point at a different S3-compatible endpoint, set the KAFSCALE_S3_* variables listed under Environment Variables. To skip MinIO entirely, set KAFSCALE_USE_MEMORY_S3=1 and the broker uses the in-memory S3 client for faster, deterministic runs.
Related targets:
make test-produce-consumeruns the MinIO-backed produce/consume suite.make test-produce-consume-debugadds Kafka trace logging (KAFSCALE_LOG_LEVEL=debug,KAFSCALE_TRACE_KAFKA=true).make test-consumer-groupandmake test-ops-apiuse embedded etcd + in-memory S3.make test-multi-segment-durabilityuses MinIO and restarts the broker across multiple segment flushes.make stop-containersstops leftover MinIO/kind helper containers before re-running tests.
Demo workflow
Need an interactive run? make demo boots embedded etcd plus the broker + console, opens http://127.0.0.1:48080/ui/, and keeps everything running until you hit Ctrl+C. It is the quickest way to click around the UI while real messages flow through the broker/MinIO stack.
The demo wires KAFSCALE_CONSOLE_BROKER_METRICS_URL=http://127.0.0.1:39093/metrics so the console scrapes broker Prometheus metrics and populates the S3/metrics cards with live data. Any other process starting cmd/console can set the same env var (for example, go run ./cmd/console) to render broker-reported S3 state/latency instead of the mock placeholders.
The broker exports live throughput gauges (kafscale_produce_rps / kafscale_fetch_rps) so the UI can show messages-per-second alongside S3 latency. The sliding window defaults to 60 seconds; override it with KAFSCALE_THROUGHPUT_WINDOW_SEC before starting the broker if you want a shorter (spikier) or longer (smoother) view.
Operator envtest (no kind)
TestOperatorManagedEtcdResources uses controller-runtime envtest to validate operator reconciliation without spinning up kind. Install envtest assets with setup-envtest or set KUBEBUILDER_ASSETS to a directory containing kube-apiserver and etcd binaries.
Broker logging levels
The broker reads KAFSCALE_LOG_LEVEL at start-up. If the variable is unset we operate in warning-and-above mode, which keeps regular e2e/test runs quiet. Set KAFSCALE_LOG_LEVEL=info or debug (optionally together with KAFSCALE_TRACE_KAFKA=true) when you need additional visibility; the test-produce-consume-debug target wires those env vars up for you.
Environment Variables
Test and E2E
KAFSCALE_E2E– Enable e2e tests.KAFSCALE_E2E_KIND– Enable kind-based e2e.KAFSCALE_KIND_CLUSTER– Kind cluster name.KAFSCALE_KIND_RECREATE– Force kind cluster recreation.KAFSCALE_E2E_DEMO– Run demo stack test.KAFSCALE_E2E_OPEN_UI– Open UI during demo test.KAFSCALE_E2E_DEBUG– Enable debug output in e2e tests.KAFSCALE_BROKER_IMAGE,KAFSCALE_OPERATOR_IMAGE,KAFSCALE_CONSOLE_IMAGE– Image overrides for kind e2e.KAFSCALE_LOCAL_FRANZ– Use local franz-go build in tests.
Demo workload tuning
KAFSCALE_DEMO_BROKER_ADDRKAFSCALE_DEMO_TOPICSKAFSCALE_DEMO_MESSAGES_PER_SECKAFSCALE_DEMO_GROUP
Broker / S3
KAFSCALE_S3_BUCKETKAFSCALE_S3_REGIONKAFSCALE_S3_NAMESPACEKAFSCALE_S3_ENDPOINTKAFSCALE_S3_PATH_STYLEKAFSCALE_S3_KMS_ARNKAFSCALE_S3_ACCESS_KEYKAFSCALE_S3_SECRET_KEYKAFSCALE_S3_SESSION_TOKENKAFSCALE_USE_MEMORY_S3KAFSCALE_THROUGHPUT_WINDOW_SEC
Broker logging
KAFSCALE_LOG_LEVELKAFSCALE_TRACE_KAFKA
Operator / etcd
KAFSCALE_ETCD_ENDPOINTSKAFSCALE_ETCD_USERNAMEKAFSCALE_ETCD_PASSWORD
Kafka Compatibility Tracking
To stay Kafka-compatible we track every protocol key + version that upstream exposes. Upstream Kafka 3.7.0 currently advertises the following highest ApiVersions (see kafka-protocol docs):
| API Key | Name | Kafka 3.7 Version | Kafscale Status |
|---|---|---|---|
| 0 | Produce | 9 | ✅ Implemented |
| 1 | Fetch | 13 | ✅ Implemented |
| 2 | ListOffsets | 7 | ✅ Implemented (v0 only) |
| 3 | Metadata | 12 | ✅ Implemented |
| 4 | LeaderAndIsr | 5 | ❌ Not needed (internal) |
| 5 | StopReplica | 3 | ❌ Not needed (internal) |
| 6 | UpdateMetadata | 7 | ❌ Not needed (internal) |
| 7 | ControlledShutdown | 3 | ❌ Replaced by Kubernetes rollouts |
| 8 | OffsetCommit | 3 | ✅ Implemented |
| 9 | OffsetFetch | 5 | ✅ Implemented |
| 10 | FindCoordinator | 3 | ✅ Implemented |
| 11 | JoinGroup | 4 | ✅ Implemented |
| 12 | Heartbeat | 4 | ✅ Implemented |
| 13 | LeaveGroup | 4 | ✅ Implemented |
| 14 | SyncGroup | 4 | ✅ Implemented |
| 15 | DescribeGroups | 5 | ✅ Implemented |
| 16 | ListGroups | 5 | ✅ Implemented |
| 17 | SaslHandshake | 1 | ❌ Authentication not in scope yet |
| 18 | ApiVersions | 3 | ✅ Implemented (v0 only) |
| 19 | CreateTopics | 7 | ✅ Implemented (v0 only) |
| 20 | DeleteTopics | 6 | ✅ Implemented (v0 only) |
| 21 | DeleteRecords | 2 | ❌ Rely on S3 lifecycle |
| 22 | InitProducerId | 4 | ❌ Transactions out of scope |
| 23 | OffsetForLeaderEpoch | 3 | ✅ Implemented |
| 24 | AddPartitionsToTxn | 3 | ❌ Transactions out of scope |
| 25 | AddOffsetsToTxn | 3 | ❌ Transactions out of scope |
| 26 | EndTxn | 3 | ❌ Transactions out of scope |
| 27 | WriteTxnMarkers | 0 | ❌ Transactions out of scope |
| 28 | TxnOffsetCommit | 3 | ❌ Transactions out of scope |
| 29 | DescribeAcls | 1 | ❌ Auth not in v1 |
| 30 | CreateAcls | 1 | ❌ Auth not in v1 |
| 31 | DeleteAcls | 1 | ❌ Auth not in v1 |
| 32 | DescribeConfigs | 4 | ✅ Implemented |
| 33 | AlterConfigs | 1 | ✅ Implemented |
| 34 | AlterReplicaLogDirs | 1 | ❌ Not relevant (S3 backed) |
| 35 | DescribeLogDirs | 1 | ❌ Not relevant (S3 backed) |
| 36 | SaslAuthenticate | 2 | ❌ Auth not in v1 |
| 37 | CreatePartitions | 0-3 | ✅ Implemented |
| 38 | CreateDelegationToken | 2 | ❌ Auth not in v1 |
| 39 | RenewDelegationToken | 2 | ❌ Auth not in v1 |
| 40 | ExpireDelegationToken | 2 | ❌ Auth not in v1 |
| 41 | DescribeDelegationToken | 2 | ❌ Auth not in v1 |
| 42 | DeleteGroups | 0-2 | ✅ Implemented |
We revisit this table each milestone. Anything marked 🔜 or ❌ has a pointer in the spec backlog so we can track when to bring it online (e.g., DescribeGroups/ListGroups for Kafka UI parity, OffsetForLeaderEpoch for catch-up tooling).
Coding Standards
- Keep all new code documented in
kafscale-spec.mdor cross-link back to the spec - Favor context-rich structured logging (zerolog) and Prometheus metrics
- Protobufs should remain backward compatible; prefer adding optional fields over rewriting existing ones
- No stream processing primitives in the broker—hand those workloads off to Flink/Wayang or equivalent engines
- Every change must land with unit tests, smoke/integration coverage, and regression tests where appropriate; skipping tests requires an explicit TODO anchored to a tracking issue.
- Secrets live only in Kubernetes; never write S3 or etcd credentials into source control or etcd. Reference them via
credentialsSecretRefand let the operator project them at runtime. - When testing against etcd locally, set
KAFSCALE_ETCD_ENDPOINTS(comma-separated), plusKAFSCALE_ETCD_USERNAME/KAFSCALE_ETCD_PASSWORDif auth is enabled. The broker will fall back to the in-memory store when those vars are absent.