Operations Guide
This guide covers day-to-day operations for KafScale clusters in production. For environment variable reference, see Runtime Settings. For metrics details, see Metrics Reference. For the admin API, see Ops API.
Prerequisites
Before operating a production cluster:
- etcd cluster (3+ nodes recommended, odd quorum)
- S3 bucket with appropriate IAM permissions
- Kubernetes cluster with the KafScale operator installed
kubectlandhelmCLI tools
Security & Hardening
RBAC
The Helm chart creates a scoped service account and RBAC role so the operator only touches its CRDs, Secrets, and Deployments inside the release namespace.
S3 credentials
Credentials live in user-managed Kubernetes secrets. The operator never writes them to etcd. Snapshot jobs map KAFSCALE_S3_ACCESS_KEY/KAFSCALE_S3_SECRET_KEY into AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY.
Console auth
The UI requires KAFSCALE_UI_USERNAME and KAFSCALE_UI_PASSWORD. In Helm, set console.auth.username and console.auth.password.
TLS
Terminate TLS at your ingress, load balancer, or service mesh. Broker/console TLS env flags are not wired in v1.x. See Proxy TLS via LoadBalancer for the recommended external TLS pattern.
Admin APIs
Create/Delete Topics are enabled by default. Set KAFSCALE_ALLOW_ADMIN_APIS=false on broker pods to disable them, and gate external access via mTLS, ingress auth, or network policies.
Network policies
Allow the operator + brokers to reach etcd and S3 endpoints and lock everything else down.
ACLs (v1.5+)
Optional basic ACL enforcement is available at the broker. Identity comes from Kafka client.id until SASL is introduced.
Configuration variables:
| Variable | Description |
|---|---|
KAFSCALE_ACL_ENABLED |
Enable ACL enforcement (true/false) |
KAFSCALE_ACL_JSON |
Inline JSON ACL configuration |
KAFSCALE_ACL_FILE |
Path to ACL configuration file |
KAFSCALE_ACL_FAIL_OPEN |
Allow traffic when ACL config is missing/invalid (default: false, fail-closed) |
Principal source: Set KAFSCALE_PRINCIPAL_SOURCE to control how client identity is derived:
| Value | Description |
|---|---|
client_id |
Use Kafka client.id (default) |
remote_addr |
Use client IP address |
proxy_addr |
Use address from PROXY protocol header (requires KAFSCALE_PROXY_PROTOCOL=true) |
Example ACL configuration (Helm):
operator:
acl:
enabled: true
configJson: |
{"default_policy":"deny","principals":[
{"name":"analytics","allow":[{"action":"fetch","resource":"topic","name":"orders-*"}]}
]}
auth:
principalSource: "proxy_addr"
proxyProtocol: true
Auth denials: Broker logs emit a rate-limited authorization denied entry with principal/action/resource context.
PROXY protocol (v1.5+)
Use KAFSCALE_PROXY_PROTOCOL=true with KAFSCALE_PRINCIPAL_SOURCE=proxy_addr to derive principals from a trusted TCP proxy (PROXY protocol v1/v2).
Trust boundary: Only enable proxy_addr/PROXY protocol when brokers are reachable only through a trusted LB or sidecar that injects the header. Do not expose brokers directly, or clients can spoof identity.
Behavior:
- Fail-closed: When
KAFSCALE_PROXY_PROTOCOL=true, brokers reject connections that do not include a valid PROXY header. - Header limits: PROXY v1 headers are capped at 256 bytes; oversized headers are rejected.
- Health checks: PROXY v2
LOCALconnections are accepted (no identity); ensure LB health checks don’t rely on ACL-protected operations.
Health/metrics
Prometheus can scrape /metrics on brokers and operator for early detection of S3 pressure or degraded nodes. The operator exposes metrics on port 8080 and the Helm chart can create a metrics Service, ServiceMonitor, and PrometheusRule.
Startup gating
Broker pods exit immediately if they cannot read metadata or write a probe object to S3 during startup, so Kubernetes restarts them rather than leaving a stuck listener in place.
Leader IDs
Each broker advertises a numeric NodeID in etcd. In a single-node demo you’ll always see Leader=0 in the Console’s topic detail because the only broker has ID 0. In real clusters those IDs align with the broker addresses the operator published; if you see Leader=3, look for the broker with NodeID 3 in the metadata payload.
External Broker Access
By default, brokers advertise the in-cluster service DNS name. That works for clients running inside Kubernetes, but external clients must connect to a reachable address. Configure both the broker Service exposure and the advertised address so clients learn the external endpoint from metadata responses.
See Runtime Settings — External Broker Access for all CRD fields.
Kafka Proxy (recommended for external access)
For external clients plus broker churn, deploy the Kafka-aware proxy. It answers Metadata/FindCoordinator requests with a single stable endpoint (the proxy service), then forwards all other Kafka requests to the brokers. This keeps clients connected even as broker pods scale or rotate.
Recommended settings:
- Run 2+ proxy replicas behind a LoadBalancer service
- Point the proxy at etcd via
KAFSCALE_PROXY_ETCD_ENDPOINTS - Set
KAFSCALE_PROXY_ADVERTISED_HOST/KAFSCALE_PROXY_ADVERTISED_PORTto the public DNS + port
Example (HA proxy + external access):
helm upgrade --install kafscale deploy/helm/kafscale \
--namespace kafscale --create-namespace \
--set proxy.enabled=true \
--set proxy.replicaCount=2 \
--set proxy.service.type=LoadBalancer \
--set proxy.service.port=9092 \
--set proxy.advertisedHost=kafka.example.com \
--set proxy.advertisedPort=9092 \
--set proxy.etcdEndpoints[0]=http://kafscale-etcd-client.kafscale.svc.cluster.local:2379
Direct broker exposure
Use direct broker Service settings when you intentionally expose dedicated brokers (for example, isolating traffic or pinning producers to specific nodes). This requires explicit endpoint management.
Example (GKE/AWS/Azure load balancer):
apiVersion: kafscale.io/v1alpha1
kind: KafScaleCluster
metadata:
name: kafscale
namespace: kafscale
spec:
brokers:
advertisedHost: kafka.example.com
advertisedPort: 9092
service:
type: LoadBalancer
annotations:
networking.gke.io/load-balancer-type: "External"
loadBalancerSourceRanges:
- 203.0.113.0/24
s3:
bucket: kafscale
region: us-east-1
credentialsSecretRef: kafscale-s3-credentials
etcd:
endpoints: []
Proxy TLS via LoadBalancer (recommended)
The proxy is the external Kafka entrypoint. For TLS in v1.5, terminate TLS at the cloud LoadBalancer by supplying Service annotations in the Helm values. This keeps broker traffic plaintext inside the cluster while enabling TLS for external clients.
proxy:
enabled: true
service:
type: LoadBalancer
port: 9092
annotations:
# Add your cloud provider TLS annotations here (ACM / GCP / Azure, etc.)
loadBalancerSourceRanges:
- 203.0.113.0/24
If you need in-cluster ACME/Let’s Encrypt support, use a TCP-capable gateway (Traefik, etc.) with cert-manager. Keep it off by default to avoid extra operational dependencies.
AWS NLB TLS
proxy:
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:REGION:ACCOUNT:certificate/ID"
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "9092"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
GCP / Azure
L4 LoadBalancers typically do TCP pass-through; TLS termination often requires a provider gateway/ingress (or a TCP-capable ingress controller). If you terminate TLS outside the Service, keep the proxy Service as plain TCP.
Note: Annotation keys vary by provider and feature. Always validate against your cloud provider docs.
TLS termination (cert-manager)
For custom certificate management without cloud provider TLS:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: kafscale-kafka-cert
namespace: kafscale
spec:
secretName: kafscale-kafka-tls
dnsNames:
- kafka.example.com
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
Monitoring
Prometheus endpoints
| Component | Endpoint | Default Port |
|---|---|---|
| Broker | http://<broker-host>:9093/metrics |
9093 |
| Operator | http://<operator-host>:8080/metrics |
8080 |
Grafana dashboards
Import the pre-built dashboards from the repository:
kubectl apply -f https://raw.githubusercontent.com/KafScale/platform/main/docs/grafana/broker-dashboard.json
kubectl apply -f https://raw.githubusercontent.com/KafScale/platform/main/docs/grafana/operator-dashboard.json
For the full metrics catalog, see Metrics.
Ops API Examples
KafScale exposes Kafka admin APIs for operator workflows. See Ops API for the full reference.
# List consumer groups
kafka-consumer-groups.sh --bootstrap-server <broker> --list
# Describe a consumer group
kafka-consumer-groups.sh --bootstrap-server <broker> --describe --group <group-id>
# Delete a consumer group
kafka-consumer-groups.sh --bootstrap-server <broker> --delete --group <group-id>
# Read topic configs
kafka-configs.sh --bootstrap-server <broker> --describe --entity-type topics --entity-name <topic>
# Increase partition count (additive only)
kafka-topics.sh --bootstrap-server <broker> --alter --topic <topic> --partitions <count>
# Update topic retention
kafka-configs.sh --bootstrap-server <broker> --alter --entity-type topics --entity-name <topic> \
--add-config retention.ms=604800000
Scaling
Horizontal scaling
Brokers are stateless and scale horizontally. No partition rebalancing required—S3 is the source of truth.
kubectl scale deployment kafscale-broker --replicas=5
HPA (CPU-based)
kubectl autoscale deployment kafscale-broker \
--cpu-percent=70 \
--min=3 \
--max=12
HPA (custom metrics)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: kafscale-broker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: kafscale-broker
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: kafscale_produce_rps
target:
type: AverageValue
averageValue: "1000"
Scaling etcd
etcd should maintain an odd number of nodes (3, 5, 7). For scaling:
- Add new etcd member to the cluster
- Update
KAFSCALE_ETCD_ENDPOINTSin broker configuration - Perform rolling restart of brokers
Do not scale etcd down during active traffic. Always ensure quorum (n/2 + 1 nodes).
etcd Operations
KafScale depends on etcd for metadata and offsets. Treat it as a production datastore.
Best practices
- Run a dedicated etcd cluster (do not share the Kubernetes control-plane etcd)
- Use SSD-backed disks for data and WAL volumes
- Deploy an odd number of members (3 for most clusters, 5 for higher fault tolerance)
- Spread members across zones/racks to survive single-AZ failures
- Enable compaction/defragmentation and monitor fsync/proposal latency
Operator-managed etcd
If no etcd endpoints are supplied, the operator provisions a 3-node etcd StatefulSet. Recommended settings:
- Use an SSD-capable StorageClass for the etcd PVCs
- Set a PodDisruptionBudget so only one etcd pod can be evicted at a time
- Pin etcd pods across zones with topology spread or anti-affinity
- Enable snapshot backups to a dedicated S3 bucket
- Monitor leader changes, fsync latency, and disk usage
Endpoint resolution order
The operator resolves etcd endpoints in this order:
KafScaleCluster.spec.etcd.endpointsKAFSCALE_OPERATOR_ETCD_ENDPOINTS- Managed etcd (operator creates a 3-node StatefulSet)
Availability signals
When etcd is unavailable, brokers reject producer/admin/consumer-group operations with REQUEST_TIMED_OUT. Producers see per-partition errors in the Produce response; admin and group APIs return the same code.
Fetch requests for cached segments continue to work during etcd outages.
S3 Health Gating
Brokers monitor S3 health and reject requests when S3 is degraded.
| State | Value | Behavior |
|---|---|---|
healthy |
0 | Normal operation |
degraded |
1 | Elevated latency, reduced throughput |
unavailable |
2 | Rejects produce requests, serves cached fetches only |
# Check current state
curl -s http://broker:9093/metrics | grep kafscale_s3_health_state
When S3 is unavailable:
- Produce requests return
KAFKA_STORAGE_ERROR - Fetch requests serve from cache if available
- Brokers automatically recover when S3 returns
Backup and Disaster Recovery
etcd snapshots to S3
The operator uploads etcd snapshots to a dedicated S3 bucket (separate from broker segment storage). See Runtime Settings for the full variable reference.
Key defaults:
| Variable | Default |
|---|---|
KAFSCALE_OPERATOR_ETCD_SNAPSHOT_BUCKET |
kafscale-etcd-<namespace>-<cluster> |
KAFSCALE_OPERATOR_ETCD_SNAPSHOT_PREFIX |
etcd-snapshots |
KAFSCALE_OPERATOR_ETCD_SNAPSHOT_SCHEDULE |
0 * * * * (hourly) |
KAFSCALE_OPERATOR_ETCD_SNAPSHOT_ETCDCTL_IMAGE |
kubesphere/etcd:3.6.4-0 |
KAFSCALE_OPERATOR_ETCD_SNAPSHOT_IMAGE |
amazon/aws-cli:2.15.0 |
KAFSCALE_OPERATOR_ETCD_SNAPSHOT_STALE_AFTER_SEC |
7200 |
The operator performs an S3 write preflight before enabling snapshots. If the check fails, the EtcdSnapshotAccess condition is set to False.
Snapshot restore (managed etcd)
When the operator manages etcd, each pod runs restore init containers before etcd starts:
- Snapshot download container pulls the latest
.dbsnapshot - Restore container runs
etcdctl snapshot restoreif data directory is empty - If no snapshot is available, etcd starts fresh
Consumer Offsets After Restore
Etcd restores recover committed consumer offsets. If a consumer has no committed offsets, it may start at the end and see zero records even though data exists in S3. In production:
- Ensure consumers commit offsets (default for most Kafka clients).
- Set
auto.offset.reset=earliestas a safety net for new or uncommitted consumers.
Snapshot alerts
| Alert | Condition | Severity |
|---|---|---|
KafScaleSnapshotAccessFailed |
Snapshot writes failing | Critical |
KafScaleSnapshotStale |
Last snapshot > threshold | Warning |
KafScaleSnapshotNeverSucceeded |
No successful snapshots | Critical |
Manual snapshot
# Via ops API
curl -X POST http://operator:8080/api/v1/etcd/snapshot
# Via etcdctl
ETCDCTL_API=3 etcdctl snapshot save /tmp/kafscale-snapshot.db \
--endpoints=$ETCD_ENDPOINTS
Restore from snapshot
# Scale down brokers
kubectl scale deployment kafscale-broker --replicas=0
# Restore etcd
ETCDCTL_API=3 etcdctl snapshot restore /tmp/kafscale-snapshot.db \
--data-dir=/var/lib/etcd-restore
# Restart operator
kubectl rollout restart deployment kafscale-operator
# Scale up brokers
kubectl scale deployment kafscale-broker --replicas=3
S3 data durability
S3 segment data does not need backup—S3 provides 11 9’s durability. Ensure:
- S3 bucket versioning is enabled
- Cross-region replication if required for DR
- Lifecycle policies match retention requirements
Multi-Region S3 (CRR)
KafScale writes to a primary bucket and can read from a replica bucket in the broker’s region. With S3 Cross-Region Replication (CRR), objects are asynchronously copied to replica buckets. Brokers read from the local replica and fall back to primary on miss.
Setup
- Create buckets in each region and enable versioning (required for CRR)
- Configure CRR rules from primary to each replica
- Update cluster spec:
spec:
s3:
bucket: kafscale-prod-us-east-1
region: us-east-1
readBucket: kafscale-prod-eu-west-1
readRegion: eu-west-1
IAM for read replicas
- Primary bucket: Allow
PutObject,DeleteObject,ListBucketfrom writer cluster - Replica buckets: Allow only
GetObjectandListBucket, deny writes
Verify CRR
# Write test object
aws s3 cp test.txt s3://kafscale-prod-us-east-1/crr-test/
# Check replication status
aws s3api head-object \
--bucket kafscale-prod-us-east-1 \
--key crr-test/test.txt \
--query 'ReplicationStatus'
# Confirm in replica
aws s3 ls s3://kafscale-prod-eu-west-1/crr-test/
Monitor CRR
| Metric | Alert threshold |
|---|---|
kafscale_s3_replica_fallback_total |
Rate > 0.1/s for 10m |
kafscale_s3_read_latency_ms |
p99 > 200ms |
kafscale_s3_replica_miss_ratio |
> 5% sustained |
Upgrades
Helm upgrade
helm upgrade kafscale KafScale/platform \
--namespace kafscale \
--set broker.image.tag=v1.5.0 \
--set operator.image.tag=v1.5.0
kubectl rollout status deployment/kafscale-broker -n kafscale
The operator drains brokers through the gRPC control plane before restarting pods.
Rolling restart
kubectl rollout restart deployment kafscale-broker -n kafscale
Rollback
helm history kafscale -n kafscale
helm rollback kafscale -n kafscale
Capacity & Cost
S3 cost estimation
Assumptions: 100 GB/day ingestion, 7-day retention, 4 MB segments, 3 brokers
| Item | Calculation | Cost |
|---|---|---|
| Storage | 700 GB x $0.023/GB | $16.10 |
| PUT requests | 25,000/day x 30 x $0.005/1000 | $3.75 |
| GET requests | 100,000/day x 30 x $0.0004/1000 | $1.20 |
| Data transfer (in-region) | Free | $0 |
| Total S3 | ~$21/month |
Durability vs cost tradeoff
See Runtime Settings — Durability Settings for the KAFSCALE_PRODUCE_SYNC_FLUSH tradeoff between durability and S3 write costs.
Troubleshooting
Brokers not starting
kubectl get pods -l app=kafscale-broker
kubectl logs -l app=kafscale-broker --tail=100
Common causes:
- etcd endpoints unreachable
- S3 credentials invalid
- Insufficient memory/CPU
High produce latency
- Check
kafscale_s3_latency_ms_avgmetric - Verify S3 bucket is in same region as brokers
- Check broker CPU/memory utilization
- Consider increasing
KAFSCALE_SEGMENT_BYTESfor larger batches
Consumer group rebalancing
- Check
kafscale_consumer_group_membersfor instability - Verify consumer
session.timeout.msis appropriate - Check network connectivity
etcd connection errors
ETCDCTL_API=3 etcdctl endpoint health --endpoints=$ETCD_ENDPOINTS
kubectl exec -it kafscale-broker-0 -- env | grep ETCD
ACL denials (v1.5+)
If clients are unexpectedly rejected:
- Check broker logs for
authorization deniedentries - Verify
KAFSCALE_PRINCIPAL_SOURCEmatches your identity strategy - Confirm ACL config is valid JSON (invalid config = fail-closed by default)
- For debugging, temporarily set
KAFSCALE_ACL_FAIL_OPEN=true
PROXY protocol issues (v1.5+)
If connections fail with PROXY protocol enabled:
- Ensure load balancer is configured to inject PROXY headers
- Check that brokers are not directly exposed (trust boundary violation)
- Verify PROXY v1 headers don’t exceed 256 bytes
- For health checks, use PROXY v2
LOCALconnections or exclude ACL-protected operations
Debug mode
helm upgrade kafscale KafScale/platform \
--set broker.env.KAFSCALE_LOG_LEVEL=debug \
--set broker.env.KAFSCALE_TRACE_KAFKA=true
For the complete environment variable reference, see Runtime Settings.