Storage Format

KafScale stores all message data in S3 as immutable segment files. This page covers the binary formats, caching strategy, and retention configuration.

S3 key layout

s3://{bucket}/{namespace}/{topic}/{partition}/segment-{base_offset}.kfs
s3://{bucket}/{namespace}/{topic}/{partition}/segment-{base_offset}.index

Example:

s3://kafscale-data/production/orders/0/segment-00000000000000000000.kfs
s3://kafscale-data/production/orders/0/segment-00000000000000000000.index

The 20-digit zero-padded offset ensures lexicographic sorting matches offset order.

Segment file format

Each .kfs segment is a self-contained file with header, message batches, and footer.

Field Size Description

Segment header (32 bytes)

Magic number4 bytes0x4B414653 ("KAFS")
Version2 bytesFormat version (1)
Flags2 bytesCompression codec, etc.
Base offset8 bytesFirst offset in segment
Message count4 bytesNumber of messages
Created timestamp8 bytesUnix milliseconds
Reserved4 bytesFuture use

Segment body (variable)

Message batch 1variableKafka RecordBatch format
Message batch 2variableKafka RecordBatch format
...More batches until segment sealed
CRC324 bytesChecksum of all batches
Last offset8 bytesLast offset in segment
Footer magic4 bytes0x454E4421 ("END!")

Message batch format

Batches are Kafka-compatible (magic byte 2) for client interoperability.

Batch header (49 bytes)

Base offset8 bytesFirst offset in batch
Batch length4 bytesTotal bytes in batch
Partition leader epoch4 bytesLeader epoch
Magic1 byte2 (Kafka v2 format)
CRC324 bytesChecksum of batch
Attributes2 bytesCompression, timestamp type
Last offset delta4 bytesLast record offset - base
First timestamp8 bytesTimestamp of first record
Max timestamp8 bytesMax timestamp in batch
Producer ID8 bytes-1 (no idempotence)
Producer epoch2 bytes-1
Base sequence4 bytes-1
Record count4 bytesNumber of records

Individual record format

Each record within a batch uses varint encoding for compactness.

LengthvarintTotal record size
Attributes1 byteUnused (0)
Timestamp deltavarintDelta from batch first timestamp
Offset deltavarintDelta from batch base offset
Key lengthvarint-1 for null, else byte count
KeybytesMessage key (optional)
Value lengthvarintMessage value byte count
ValuebytesMessage payload
Headers countvarintNumber of headers
HeadersbytesKey-value header pairs

Index file format

Sparse index for fast offset-to-position lookups. One entry per N messages.

Index header (16 bytes)

Magic4 bytes0x494458 ("IDX")
Version2 bytes1
Entry count4 bytesNumber of index entries
Interval4 bytesMessages between entries
Reserved2 bytesFuture use

Index entries (12 bytes each)

Offset8 bytesMessage offset
Position4 bytesByte position in segment file

To locate offset N: binary search index entries, then scan forward from nearest position.

Cache architecture

Multi-layer cache
L1: Hot Segment Cache Last N segments per partition LRU eviction · 1-4 GB <1ms latency L2: Index Cache All indexes for assigned partitions Refreshed on segment roll · 100-500 MB <1ms latency S3 Source of truth · ∞ capacity 50-100ms latency miss miss Check L1 → Check L2 → Fetch from S3 → Populate caches → Return to client

Cache configuration

Variable Default Description
KAFSCALE_CACHE_SIZE 1GB L1 hot segment cache size
KAFSCALE_INDEX_CACHE_SIZE 256MB L2 index cache size
KAFSCALE_READAHEAD_SEGMENTS 2 Segments to prefetch

Flush triggers

Segments are sealed and flushed to S3 when any condition is met:

Trigger Default Variable
Buffer size threshold 4 MB KAFSCALE_SEGMENT_BYTES
Time since last flush 500 ms KAFSCALE_FLUSH_INTERVAL_MS
Explicit flush request Admin API or graceful shutdown

Flush sequence

  1. Seal current buffer (no more writes accepted)
  2. Compress batches (Snappy by default)
  3. Build sparse index file
  4. Upload segment + index to S3 (both must succeed)
  5. Update etcd with new segment metadata
  6. Ack waiting producers (if acks=all)
  7. Clear flushed data from buffer

S3 lifecycle configuration

Use bucket lifecycle rules to automatically expire old segments. Align with your topic retention settings.

Example: 7-day retention

{
  "Rules": [
    {
      "ID": "kafscale-retention-7d",
      "Filter": {
        "Prefix": "production/"
      },
      "Status": "Enabled",
      "Expiration": {
        "Days": 7
      }
    }
  ]
}

AWS CLI setup

aws s3api put-bucket-lifecycle-configuration \
  --bucket kafscale-data \
  --lifecycle-configuration file://lifecycle.json

Terraform example

resource "aws_s3_bucket_lifecycle_configuration" "kafscale" {
  bucket = aws_s3_bucket.kafscale_data.id

  rule {
    id     = "kafscale-retention"
    status = "Enabled"

    filter {
      prefix = "production/"
    }

    expiration {
      days = 7
    }
  }
}

Per-topic retention

For different retention per topic, use prefix-based rules:

{
  "Rules": [
    {
      "ID": "logs-1d",
      "Filter": { "Prefix": "production/logs/" },
      "Status": "Enabled",
      "Expiration": { "Days": 1 }
    },
    {
      "ID": "events-30d",
      "Filter": { "Prefix": "production/events/" },
      "Status": "Enabled",
      "Expiration": { "Days": 30 }
    },
    {
      "ID": "default-7d",
      "Filter": { "Prefix": "production/" },
      "Status": "Enabled",
      "Expiration": { "Days": 7 }
    }
  ]
}

Rules are evaluated in order; most specific prefix wins.

Compression

KafScale supports batch-level compression using Kafka-compatible codecs.

Codec ID Notes
None 0 No compression
Snappy 1 Default — fast, moderate ratio
LZ4 3 Faster decompression
ZSTD 4 Best ratio, slower

Set via KAFSCALE_COMPRESSION_CODEC or per-topic in CRD:

apiVersion: kafscale.io/v1alpha1
kind: KafscaleTopic
metadata:
  name: logs
spec:
  partitions: 6
  retention: 24h
  compression: zstd  # Better ratio for logs