Skip to content

Key Design Decisions

Deep technical dives into the major subsystems of HyperbyteDB. Each section explains the design rationale, algorithm details, and implementation specifics.


Write Path

The write path is split into two phases:

Phase 1 (synchronous, client-blocking): HTTP request → line protocol parsing → metadata registration → WAL append → 204 response. Data is durable at this point.

Phase 2 (asynchronous, background): Flush service reads WAL entries → groups by (db, rp, measurement) → partitions by hour → converts to Arrow RecordBatches → writes Parquet files → registers in metadata → truncates WAL.

WAL Design

The WAL uses RocksDB with two column families: wal (entries keyed by big-endian u64 sequence numbers) and wal_meta (single key last_seq). This encoding preserves numerical ordering in RocksDB's lexicographic key space. Entries are serialized with bincode for compact binary encoding.

The BatchingWal decorator provides optional group commit by batching multiple appends through a channel, reducing the number of RocksDB write operations under high concurrency.

Metadata Registration

Before WAL append, the ingestion service registers schema information: field types (enforced on subsequent writes — type conflicts return HTTP 400), tag keys, tag values (for SHOW TAG VALUES), and cardinality limits. An in-memory IngestSchemaCache reduces repeated metadata lookups on the hot path.

Flush Pipeline Parallelism

WAL entries → [Sequential] Group, sort, partition
            → [Parallel spawn_blocking] Parquet conversion (CPU-bound)
            → [Wait all] Collect results
            → [Parallel tokio::spawn] Write + register (I/O-bound)
            → [Wait all] Confirm writes
            → [Sequential] Truncate WAL

Memory-aware batch sizing: when max_points_per_batch = 0, the system reads /proc/meminfo on Linux and uses 25% of available memory, estimating 512 bytes per point, clamped to [10K, 500K] points.


Read Path

InfluxQL queries are processed through a multi-stage pipeline:

  1. Parse — Hand-rolled recursive descent parser produces an AST.
  2. Dispatch — SHOW/DDL statements execute directly against metadata. SELECT statements proceed to translation.
  3. File resolution — Query metadata for Parquet files overlapping the time range.
  4. Translation — InfluxQL AST → ClickHouse SQL using file() table function over Parquet globs.
  5. Tombstone injectionAND NOT (predicate) appended for each active tombstone.
  6. Execution — chDB executes the SQL in spawn_blocking, returning JSONEachRow.
  7. Result parsing — JSONEachRow → InfluxDB v1 series format (grouped by tag combination).

Chunked Execution

For large datasets (>50 Parquet files spanning a wide time range), the query service splits the time range into chunks and executes them in parallel, then concatenates and sorts results.

InfluxQL → ClickHouse Translation

Key mappings: MEANavg, FIRSTargMin(f, time), LASTargMax(f, time), PERCENTILE(f, N)quantile(N/100.0)(f). Transform functions (DERIVATIVE, MOVING_AVERAGE, etc.) use ClickHouse window functions (lagInFrame, windowed avg, sum).

Time buckets use toStartOfInterval(time, INTERVAL N UNIT) AS __time. The internal __time alias avoids collision with the raw time column.

Fill modes: fill(null)WITH FILL, fill(previous)INTERPOLATE (col AS col), fill(linear)INTERPOLATE (col AS col USING LINEAR).


Parquet Storage

File Layout

{data_dir}/{db}/{rp}/{measurement}/{YYYY-MM-DD}/{HH}[_{uuid}].parquet

UUID suffix prevents overwrites from concurrent flushes. Compacted files use _c{uuid} suffix.

Arrow Schema

Position Column Type Nullable
0 time Timestamp(Nanosecond, UTC) No
1..N Tags (sorted) Utf8 Yes
N+1..M Fields (sorted) Varies Yes

Writer Properties

  • ZSTD level 1 compression (good ratio, fast decompression)
  • 65,536 row groups
  • Page-level statistics (enables predicate pushdown in chDB)

Compaction

Streaming K-Way Merge (Default Path)

Assumes each input file is sorted by time (true for WAL flush and prior compacts). Opens one ParquetRecordBatchReader per file, builds a unified schema, and performs a min-heap merge ordered by time. Flushes accumulated rows when estimated size exceeds target_file_size_bytes.

Legacy Full Sort (Fallback)

If per-file time order is violated, falls back to reading all files into memory, concat_batches, sort_to_indices on time, then write_batch_with_limit.

Schema Unification

When merging files with different schemas (fields added over time), the unified schema is the union of all columns. Type widening: Int64 + Float64 → Float64, Int64 + UInt64 → Int64.

Tombstone Non-Application

Tombstones are not applied during compaction. This keeps compaction simple and idempotent. Tombstones are applied only at query time.


Clustering

Hybrid Replication Model

  • Data writes — master-master async replication via HTTP. Fire-and-forget with retry. Hinted handoff for unreachable peers.
  • Schema mutations — Raft consensus (OpenRaft) for consistent ordering. All nodes apply mutations in the same order.

Node State Machine

Joining → Syncing → Active → Draining → Leaving
                      └── Disconnected

Anti-Entropy

Merkle trees per (db, rp, measurement) with hourly buckets. SHA-256 hashing of sorted file metadata. Generation-gated cache avoids rebuilds when data hasn't changed. Background task compares with peers every anti_entropy_interval_secs.

Self-Repair

Compaction-driven Parquet self-repair: 1. Verified compaction — For cold buckets, compare per-origin content hashes with authoring peer via GET /internal/bucket-hash. Replace on mismatch. 2. Membership-driven repair — Scan peer manifests for buckets where the peer authored data but this node has nothing. Fetch missing slices.

WAL Truncation Safety

In cluster mode, the flush service uses min(chunk_max_seq, min_wal_ack_across_peers) as the safe truncation point, ensuring peers that are catching up can still read needed entries.

Graceful Drain

Sets Draining state (rejects writes), flushes WAL completely, waits for peer acks (up to 60s), Merkle verifies with a peer, notifies peers of departure, sets Leaving state.


Replication Wire Format

Data replication uses Content-Type: application/vnd.hyperbytedb.replicate+line.v1 with line protocol body. Database, RP, and precision in X-Hyperbytedb-* headers. No JSON for data replication; mutations use JSON on /internal/replicate-mutation.

Hinted handoff hints are stored as CFh1 binary payloads.


Authentication

Passwords hashed with Argon2id (random salt via SaltString::generate(OsRng)). Credential extraction order: query parameters (u/p) → HTTP Basic → Token header. Minimal hand-rolled Base64 decoder for Basic auth (no external dependency). Short TTL verification cache to avoid repeated Argon2 computations.


See Also