Skip to content

HyperbyteDB system architecture

This document is a long-form internal design walkthrough (module layout, paths, storage). It complements the shorter Architecture page and the deep dive series. If anything disagrees with src/ or a deep dive, prefer the code and the focused deep dive.

It is intended for contributors who need to understand how HyperbyteDB works under the hood.


Table of Contents

  1. Architecture Overview
  2. Hexagonal Architecture
  3. Module Structure
  4. Write Path
  5. Query Path
  6. WAL (Write-Ahead Log)
  7. Parquet Storage
  8. Metadata Store
  9. Flush Pipeline
  10. Compaction
  11. Query Engine (chDB)
  12. InfluxQL Parser
  13. ClickHouse SQL Translator
  14. Retention Enforcement
  15. DELETE and Tombstones
  16. Continuous Queries
  17. Authentication Internals
  18. Clustering and Replication
  19. Background Services
  20. Error Handling
  21. Observability
  22. Concurrency Model
  23. Dependencies
  24. Statement Summary
  25. Debug Binary
  26. Kubernetes Operator
  27. Deterministic Simulation Testing (HyperSim)

1. Architecture Overview

HyperbyteDB combines four storage/compute engines into one binary:

 Client (Telegraf, Grafana, curl)
 ┌─────────────────────────────┐
 │    HTTP Layer (axum)        │  Line protocol, InfluxQL, auth, gzip
 └──────────────┬──────────────┘
 ┌─────────────────────────────┐
 │   Application Services      │  Ingestion, Query, Flush, Compaction,
 │                              │  Retention, Continuous Queries
 └──────────────┬──────────────┘
 ┌─────────────────────────────┐
 │   Port Traits (interfaces)  │  WalPort, StoragePort, QueryPort,
 │                              │  MetadataPort, IngestionPort
 └──────────────┬──────────────┘
       ┌────────┼────────┬───────────┐
       ▼        ▼        ▼           ▼
 ┌──────┐ ┌──────────┐ ┌──────┐ ┌──────┐
 │RocksDB│ │ Parquet  │ │ chDB │ │ S3/  │
 │(WAL + │ │ (Arrow)  │ │(Click│ │Local │
 │ meta) │ │          │ │House)│ │  FS  │
 └──────┘ └──────────┘ └──────┘ └──────┘

RocksDB provides the WAL (durable, ordered write log) and metadata store (databases, measurements, schemas, users, tombstones, CQ definitions, Parquet file registry).

Apache Parquet is the long-term storage format. Points are batched, converted to Arrow RecordBatches, and written as ZSTD-compressed Parquet files (ZSTD level 1, row groups of 65,536 rows, page-level statistics enabled).

chDB (embedded ClickHouse) is the query engine. InfluxQL is transpiled to ClickHouse SQL that uses the file() table function to scan Parquet files directly.

object_store abstracts local filesystem and S3-compatible storage for Parquet files.


2. Hexagonal Architecture

HyperbyteDB uses the hexagonal (ports and adapters) pattern. Business logic lives in the application and domain layers and depends only on port traits, never on concrete implementations.

                  ┌────────────────────────────────────┐
                  │            Domain Layer             │
                  │  Point, FieldValue, Database,       │
                  │  RetentionPolicy, Precision,        │
                  │  QueryResponse, SeriesResult        │
                  └────────────────────────────────────┘
                  ┌────────────────────────────────────┐
                  │         Application Services        │
                  │  IngestionService, QueryService,    │
                  │  FlushService, CompactionService,   │
                  │  RetentionService, CQService        │
                  └──────────┬────────────┬────────────┘
                             │  depends   │
                             │  only on   │
                             ▼  ports     ▼
           ┌─────────────────────────────────────────────┐
           │              Port Traits                     │
           │  WalPort   StoragePort   QueryPort           │
           │  MetadataPort   IngestionPort                │
           └────────────┬────────────────┬───────────────┘
                        │                │
          ┌─────────────▼──┐    ┌────────▼───────────┐
          │   Adapters     │    │   Adapters          │
          │ (inbound)      │    │ (outbound)          │
          │ HTTP handlers, │    │ RocksDB WAL,        │
          │ Peer handlers  │    │ RocksDB Metadata,   │
          │                │    │ Parquet Writer,      │
          │                │    │ Local/S3 Storage,    │
          │                │    │ chDB Query Adapter   │
          └────────────────┘    └─────────────────────┘

This means: - Swapping RocksDB for another WAL requires only implementing WalPort. - Swapping chDB for another SQL engine requires only implementing QueryPort. - The HTTP layer can be replaced without touching business logic.


3. Module Structure

src/
├── main.rs                          CLI, server bootstrap, graceful shutdown
├── lib.rs                           Top-level module declarations
├── bootstrap.rs                     Composition root: builds all adapters and services
├── config.rs                        Figment-based configuration loading
├── error.rs                         HyperbytedbError enum + From impls
├── domain/
│   ├── point.rs                     Point, FieldValue (Float/Integer/UInteger/String/Boolean)
│   ├── series.rs                    SeriesKey (measurement + sorted tags)
│   ├── database.rs                  Database, RetentionPolicy, Precision
│   ├── query_result.rs              QueryResponse, StatementResult, SeriesResult
│   ├── wal.rs                       WalEntry struct
│   ├── measurement.rs               MeasurementMeta
│   ├── user.rs                      StoredUser (auth)
│   ├── continuous_query.rs          ContinuousQueryDef
│   └── storage_layout.rs            StorageLayout path helpers
├── ports/
│   ├── ingestion.rs                 trait IngestionPort
│   ├── query.rs                     trait QueryPort, trait QueryService
│   ├── storage.rs                   trait StoragePort
│   ├── wal.rs                       trait WalPort
│   ├── metadata.rs                  trait MetadataPort
│   └── auth.rs                      trait AuthPort
├── adapters/
│   ├── http/
│   │   ├── router.rs                Route definitions, AppState, build_router()
│   │   ├── write.rs                 POST /write (line protocol, gzip)
│   │   ├── query.rs                 GET/POST /query (InfluxQL, bind params, CSV, chunked)
│   │   ├── ping.rs                  /ping, /health
│   │   ├── metrics.rs               /metrics (Prometheus)
│   │   ├── auth_middleware.rs        Auth extraction & verification
│   │   ├── middleware.rs             Version headers, Request-Id
│   │   ├── error.rs                 HyperbytedbError → HTTP response mapping
│   │   ├── response.rs              InfluxDB error JSON formatting
│   │   ├── cluster.rs               /cluster/metrics, /cluster/nodes
│   │   ├── peer_handlers.rs         /internal/replicate, sync, membership, drain
│   │   ├── raft_handlers.rs         /internal/raft/vote, append, snapshot, membership
│   │   └── statements.rs            /api/v1/statements (statement summary)
│   │
│   ├── chdb/
│   │   ├── query_adapter.rs         Single-session chDB wrapper (QueryPort)
│   │   └── pool.rs                  Multi-session pool with concurrency limit
│   │
│   ├── storage/
│   │   ├── layout.rs                Re-exports StorageLayout from domain
│   │   ├── parquet_writer.rs        Arrow schema + RecordBatch → Parquet bytes (ZSTD)
│   │   ├── local.rs                 Local filesystem StoragePort
│   │   └── s3.rs                    S3-compatible StoragePort (object_store)
│   │
│   ├── wal/
│   │   └── rocksdb_wal.rs           RocksDB WAL implementation
│   │
│   ├── metadata/
│   │   └── rocksdb_meta.rs          RocksDB metadata implementation
│   │
│   └── auth.rs                      MetadataAuthAdapter
├── application/
│   ├── ingestion_service.rs         Line protocol → metadata → WAL
│   ├── peer_ingestion_service.rs    Same + fan-out replication to peers
│   ├── query_service.rs             InfluxQL dispatch, chDB execution, result formatting
│   ├── peer_query_service.rs        Same + Raft-based mutation replication
│   ├── flush_service.rs             Background WAL → Parquet flush
│   ├── compaction_service.rs        Compaction orchestration: periodic merge, `compact_all`, verified compaction, self-repair, orphan cleanup
│   ├── compaction_merge.rs          Parquet merge engine: Arrow streaming k-way merge by `time`, legacy full-buffer sort fallback
│   ├── retention_service.rs         Background expired-data cleanup
│   ├── continuous_query_service.rs  Background CQ scheduler
│   └── statement_summary.rs         Query statement tracking
├── influxql/
│   ├── ast.rs                       All AST node types (Statement, SelectStatement, Expr, etc.)
│   ├── parser.rs                    Hand-rolled recursive descent InfluxQL parser
│   ├── to_clickhouse.rs             AST → ClickHouse SQL translation
│   └── digest.rs                    Query digest/canonicalization
├── cluster/
│   ├── bootstrap.rs                 ClusterBootstrap: startup sync, Raft init
│   ├── membership.rs                ClusterMembership, SharedMembership, NodeState, NodeInfo
│   ├── peer_client.rs               HTTP client for write/mutation replication
│   ├── sync_client.rs               Sync client for startup/full/reconnect sync
│   ├── sync.rs                      SyncManifest, VerifyRequest/Response, DivergentBucket
│   ├── merkle.rs                    Merkle tree construction for anti-entropy
│   ├── replication_log.rs           RocksDB-backed replication/mutation ack tracking
│   ├── drain.rs                     DrainService: graceful node shutdown
│   ├── types.rs                     MutationRequest (replication uses line protocol + headers)
│   └── raft/
│       ├── mod.rs                   TypeConfig, HyperbytedbRaft type alias
│       ├── types.rs                 ClusterRequest, ClusterResponse
│       ├── log_store.rs             RaftStore: RocksDB-backed Raft log/state
│       ├── network.rs               HTTP-based Raft network transport
│       └── state_machine.rs         Raft state machine, schema mutation application
└── bin/
    └── hyperbytedb_debug.rs            CLI for cluster status, topology, Raft, metrics

4. Write Path

The write path is optimized for low-latency ingestion. Data is durable the moment the WAL append returns, and becomes queryable after the next flush cycle. For an exhaustive treatment of every step, see Deep Dive: Write Path.

 Client POST /write?db=mydb&precision=ns
 ┌─────────────────────────────────────────┐
 │ write.rs: handle_write()                │
 │  1. Extract db, rp, precision params    │
 │  2. Gzip decompress if needed           │
 │  3. Call IngestionPort.ingest()         │
 └──────────────┬──────────────────────────┘
 ┌─────────────────────────────────────────┐
 │ ingestion_service.rs                    │
 │  1. Verify database exists (metadata)   │
 │  2. Parse line protocol body            │
 │  3. Convert ParsedLine → Vec<Point>     │
 │     - Apply precision to timestamps     │
 │     - Default to current time if absent │
 │  4. Register field types + tag keys     │
 │     in metadata                         │
 │  5. Check cardinality limits            │
 │  6. Store tag values for SHOW queries   │
 │  7. Append WalEntry to WAL             │
 └──────────────┬──────────────────────────┘
 ┌─────────────────────────────────────────┐
 │ rocksdb_wal.rs: append()               │
 │  1. Atomic fetch_add sequence number    │
 │  2. bincode::serialize(WalEntry)        │
 │  3. WriteBatch: put to "wal" CF +       │
 │     update "last_seq" in "wal_meta" CF  │
 │  4. Return sequence number             │
 └─────────────────────────────────────────┘

In cluster mode, PeerIngestionService wraps the base service and, after the local WAL append succeeds, fires off async HTTP POST requests to all peers via PeerClient.replicate_write(). The local write returns 204 immediately without waiting for replication.

Data types

The Point struct carries: - measurement: String — measurement name - tags: BTreeMap<String, String> — sorted tag key-value pairs - fields: BTreeMap<String, FieldValue> — field key-value pairs - timestamp: i64 — nanoseconds since Unix epoch

FieldValue has five variants:

Variant Discriminant Arrow type Parquet type
Float(f64) 0 Float64 DOUBLE
Integer(i64) 1 Int64 INT64
UInteger(u64) 2 UInt64 INT64 (unsigned)
String(String) 3 Utf8 BYTE_ARRAY
Boolean(bool) 4 Boolean BOOLEAN

Field types are registered on first write and enforced on subsequent writes. A write that sends an integer where a float was previously registered returns a FieldTypeConflict error (HTTP 400).


5. Query Path

For an exhaustive treatment of every step, see Deep Dive: Read Path.

 Client GET /query?db=mydb&q=SELECT mean("value") FROM "cpu" WHERE time > now() - 1h GROUP BY time(5m)
 ┌────────────────────────────────────────────────────────────────┐
 │ query.rs: handle_query_impl()                                  │
 │  1. Extract q, db, epoch, pretty, chunked, params              │
 │  2. Substitute $param bind parameters if present               │
 │  3. Call QueryService.execute_query() with timeout wrapper      │
 │  4. Format response as JSON, CSV, or chunked                   │
 └──────────────┬─────────────────────────────────────────────────┘
 ┌────────────────────────────────────────────────────────────────┐
 │ query_service.rs: execute_query()                              │
 │  1. tokio::time::timeout wraps entire execution                │
 │  2. Parse InfluxQL string → Vec<Statement>                     │
 │  3. For each statement, dispatch:                              │
 │     ┌─────────────────────────────────────────────┐            │
 │     │ SHOW DATABASES  → metadata.list_databases() │            │
 │     │ SHOW MEASUREMENTS → metadata.list_meas()    │            │
 │     │ SHOW TAG KEYS   → metadata.get_meas()       │            │
 │     │ SHOW TAG VALUES → metadata.list_tag_values() │           │
 │     │ SHOW FIELD KEYS → metadata.get_meas()       │            │
 │     │ CREATE DATABASE → metadata.create_database() │           │
 │     │ DROP DATABASE   → metadata.drop_database()   │           │
 │     │ DELETE          → metadata.store_tombstone()  │           │
 │     │ SELECT          → see SELECT flow below       │           │
 │     └─────────────────────────────────────────────┘            │
 └──────────────┬─────────────────────────────────────────────────┘
                │ (SELECT only)
 ┌────────────────────────────────────────────────────────────────┐
 │ handle_select()                                                │
 │  1. Extract measurement name from FROM clause                  │
 │  2. Handle regex measurements: query metadata for matches,     │
 │     execute UNION ALL across all matching measurements         │
 │  3. Handle subqueries: translate inner SELECT first            │
 │  4. Resolve default retention policy from metadata             │
 │  5. Determine time range from WHERE clause                     │
 │  6. Generate Parquet glob pattern via StorageLayout             │
 │  7. Load tombstones for the measurement                        │
 │  8. Translate InfluxQL AST → ClickHouse SQL                    │
 │     (to_clickhouse::translate)                                 │
 │  9. Execute SQL via chDB (QueryPort.execute_sql)               │
 │ 10. Parse JSONEachRow output → SeriesResult[]                  │
 │     - Group by tag combinations                                │
 │     - Apply epoch formatting to timestamps                     │
 │ 11. Handle SLIMIT/SOFFSET (series-level pagination)            │
 │ 12. Handle INTO clause (write results to target measurement)   │
 └────────────────────────────────────────────────────────────────┘

Result formatting

chDB returns data in JSONEachRow format (one JSON object per row):

{"__time":"2024-01-15 10:00:00","host":"server01","mean_usage_idle":42.5}
{"__time":"2024-01-15 10:05:00","host":"server01","mean_usage_idle":38.2}

The query service transforms this into InfluxDB v1 series format: 1. Parse each line as a JSON object. 2. Rename __time back to time. 3. Convert ClickHouse datetime strings to nanosecond timestamps. 4. Apply the epoch parameter (convert to ns/us/ms/s integers or leave as RFC3339 strings). 5. Group rows by tag combination into separate SeriesResult objects. 6. Each SeriesResult gets a name (measurement), tags map, columns list, and values array.


6. WAL (Write-Ahead Log)

The WAL provides crash-safe durability for incoming writes before they're flushed to Parquet.

Implementation: RocksDbWal

  • Backing store: RocksDB with two column families.
  • Location: Configured via storage.wal_dir (default ./wal).
Column Family Purpose
wal Ordered WAL entries. Keys are big-endian u64 sequence numbers (8 bytes). Values are bincode-serialized WalEntry.
wal_meta Single key last_seq storing the current sequence number as big-endian u64.

WalEntry structure

struct WalEntry {
    database: String,
    retention_policy: String,
    points: Vec<Point>,
}

Serialized with bincode for compact binary encoding.

Operations

Operation Description
append(entry) Atomically increments sequence, serializes entry, writes to wal CF and updates last_seq in a single WriteBatch. Returns sequence number.
read_from(seq) Forward iterator from seq to end. Returns Vec<(u64, WalEntry)>.
read_range(start, count) Reads up to count entries starting at start. Used by flush service for chunked reads.
truncate_before(seq) Deletes all entries with sequence < seq using delete_range_cf.
last_sequence() Returns the current sequence number from the atomic counter.

Key encoding

Sequence numbers are encoded as big-endian u64 so that RocksDB's lexicographic ordering preserves numerical order. This allows efficient range scans and truncation.


7. Parquet Storage

File layout

{data_dir}/{database}/{retention_policy}/{measurement}/{YYYY-MM-DD}/{HH}[_{uuid}].parquet

Example:

/var/lib/hyperbytedb/data/mydb/autogen/cpu/2024-01-15/10.parquet
/var/lib/hyperbytedb/data/mydb/autogen/cpu/2024-01-15/10_a3f2b1c8.parquet

The UUID suffix is appended when multiple flushes write to the same hour bucket, preventing overwrites.

Arrow schema

Each Parquet file has a consistent Arrow schema:

Column Arrow Type Nullable Notes
time Timestamp(Nanosecond, Some("UTC")) No Always present, always first
<tag_key> Utf8 Yes One column per tag key
<field_name> Varies by type Yes Float64, Int64, UInt64, Utf8, Boolean

Schema is constructed from the measurement's MeasurementMeta (tag keys + field types) stored in the metadata store.

Parquet properties

  • Compression: ZSTD level 1 (Compression::ZSTD(ZstdLevel::try_new(1)))
  • Row group size: 65,536 rows (set_max_row_group_size(65536))
  • Statistics: Page-level (EnabledStatistics::Page)
  • Encoding: Default Parquet encodings (dictionary for strings, delta for integers)

Storage backends

Two StoragePort implementations:

LocalStorage — Writes to and reads from the local filesystem. Creates intermediate directories as needed.

S3Storage — Uses the object_store crate to read/write Parquet files to any S3-compatible service (AWS S3, MinIO, Cloudflare R2). Configured via [storage.s3].

Both backends are transparent to the rest of the system — the flush service and compaction service interact only through the StoragePort trait.


8. Metadata Store

Implementation: RocksDbMetadata

  • Backing store: RocksDB with one column family metadata.
  • Location: Configured via storage.meta_dir (default ./meta).
  • Serialization: JSON for all values (human-inspectable with RocksDB tools).

Key schema

Key Pattern Value Description
db:{name} JSON {"database": Database} Database definition with retention policies
meas:{db}:{name} JSON MeasurementMeta Field types (name → discriminant), tag keys
tag_val:{db}:{meas}:{key}:{value} Empty Tag value existence (for SHOW TAG VALUES)
pq:{db}:{rp}:{meas}:{min_time_hex} JSON {"path", "min_time", "max_time"} Parquet file registry
user:{username} JSON StoredUser Username, password_hash, admin flag
tombstone:{db}:{meas}:{uuid} JSON {"predicate_sql", "created_at"} DELETE tombstone records
cq:{db}:{name} JSON ContinuousQueryDef Continuous query definitions

MeasurementMeta

struct MeasurementMeta {
    name: String,
    field_types: HashMap<String, u8>,  // field_name → FieldValue discriminant
    tag_keys: Vec<String>,
}

Field types use the FieldValue discriminant (0=Float, 1=Integer, 2=UInteger, 3=String, 4=Boolean) to reconstruct Arrow schemas during flush.

Parquet file registry

Each Parquet file is registered with its time range:

{
  "path": "/var/lib/hyperbytedb/data/mydb/autogen/cpu/2024-01-15/10_abc123.parquet",
  "min_time": 1705312800000000000,
  "max_time": 1705316399999999999
}

The key includes min_time as a zero-padded hex string for efficient time-range prefix scanning.


9. Flush Pipeline

The flush service is the bridge between the WAL and Parquet storage. It runs as a background Tokio task.

Lifecycle

  1. Timer tick every flush.interval_secs (default 10s).
  2. Read WAL entries from last_flushed_seq + 1 in chunks of 5,000 entries.
  3. Group all points by (database, retention_policy, measurement).
  4. Partition each group by hour bucket (using the point's timestamp).
  5. Sub-batch large groups by max_points_per_batch (auto-detected from available memory, clamped to 10K–500K).
  6. For each sub-batch, in parallel via spawn_blocking:
  7. Look up MeasurementMeta from metadata.
  8. Build Arrow schema.
  9. Convert Vec<Point> to RecordBatch.
  10. Write RecordBatch to Parquet bytes (ZSTD level 1 compressed, 65,536-row groups).
  11. Write all Parquet files in parallel via tokio::spawn (I/O-parallel).
  12. Register each file in the metadata Parquet file registry.
  13. Truncate WAL entries up to the max flushed sequence.

Memory-aware batching

When max_points_per_batch = 0 (default), the flush service calls auto_detect_batch_size():

  • On Linux: reads /proc/meminfo for MemAvailable, uses 25% of that.
  • Elsewhere: defaults to 100,000 points.
  • Clamps to [MIN_BATCH_POINTS=10_000, MAX_BATCH_POINTS=500_000].

Each point is estimated at 512 bytes for memory budgeting.

Parallelism

  • CPU: Parquet conversion (schema building, RecordBatch creation, compression) runs in spawn_blocking tasks on the Tokio blocking threadpool.
  • I/O: File writes run in concurrent tokio::spawn tasks.
  • Sequencing: All CPU work completes before I/O begins, ensuring no partial writes.

10. Compaction

Compaction merges small Parquet files for a measurement's time bucket into fewer, larger files. Orchestration lives in src/application/compaction_service.rs (when to run, metadata, storage callbacks, cluster verification). The actual Parquet merge is implemented in src/application/compaction_merge.rs using Apache Arrow only (no chDB). For behavior detail, see Deep Dive: Compaction.

Trigger conditions

  • Runs every compaction.interval_secs (default 30s).
  • Scans metadata for (db, rp, measurement) triples.
  • Groups Parquet files by time bucket (bucket_duration, default 1 hour).
  • Triggers when a measurement has >= min_files_to_compact total files (default 2), then compacts each bucket with >= 2 files.

Process

Implementation: compaction_merge::merge_parquet_files (called from compaction_service).

  1. Open inputs for streaming: For each path, StoragePort::prepare_parquet_stream returns a filesystem path (local) or a temp file (S3 download), avoiding holding full Parquet bytes in memory on the happy path.
  2. Merge (default): merge_parquet_streaming opens one ParquetRecordBatchReader per file, builds a unified schema from Parquet metadata, conforms batches, and performs a k-way merge ordered by time (min-heap), assuming each input file is already sorted by time (true for WAL flush and prior compacts). Accumulator batches flush when row count or estimated size hits limits, then write_batch_with_limit emits one or more output Parquet blobs (ZSTD).
  3. Fallback: If per-file or global time order is violated, the code re-reads full file bytes and runs the legacy path: concat_batches + sort_to_indices / take + write_batch_with_limit.
  4. Register and delete: Register new files in metadata, unregister old paths, delete old objects from storage (still orchestrated in compaction_service).
  5. Orphan cleanup: Remove on-disk Parquet with no metadata entry (in compaction_service).

Steps 2–3 use the same schema-unification, conformance, and size-splitting ideas as before; see Deep Dive: Compaction for the full algorithm.

Tombstones are not applied during compaction. They are applied at query time only, by injecting AND NOT (predicate) clauses into generated SQL.

Safety

Compaction is idempotent. If it crashes mid-way, old files remain intact and the next compaction cycle retries. New files are written with UUID-suffixed names ({HH}_c{8-char-uuid}.parquet) to prevent collisions.

Cluster mode: hash verification and Parquet self-repair

When clustering is enabled and the compaction service is constructed with shared membership (src/bootstrap.rs), each compaction interval runs—after ordinary merges—verified compaction and optionally membership-driven self-repair:

  • Verified compaction hashes per-origin Parquet in cold buckets (older than compaction.verified_compaction_age_secs) against the authoring peer via GET /internal/bucket-hash, using the same bucket_duration as local compaction. On mismatch it re-downloads origin-filtered files from the peer’s GET /internal/sync/manifest + /internal/sync/parquet/....
  • Self-repair walks each active peer’s manifest (capped by max_repair_checks_per_cycle) to find buckets where that peer authored data but this node’s hash differs—including empty vs non-empty—then fetches the same way.

Sync manifest rows carry origin_node_id for precise filtering; older binaries omit the field and peers fall back to _n{id}_ filename tags.

For protocol detail, metrics, and configuration, see Deep Dive: Self-Repair.


11. Query Engine (chDB)

HyperbyteDB uses chDB (embedded ClickHouse) as its query engine. chDB provides the full ClickHouse SQL dialect including window functions, aggregates, and the file() table function.

Session management

chDB::Session is: - Send but not Sync — requires Mutex wrapping. - Synchronous — all calls block the thread, so they must run in spawn_blocking.

Single session (pool_size = 1)

struct ChdbQueryAdapter {
    session: Arc<Mutex<Session>>,
}

Each query: 1. Clones the Arc<Mutex<Session>>. 2. spawn_blockingblocking_lock()session.execute(sql, JSONEachRow). 3. Returns the UTF-8 result string.

Session pool (pool_size > 1)

struct ChdbPool {
    sessions: Vec<Arc<Mutex<Session>>>,
    next: AtomicUsize,
}

Round-robin assignment. Each session has its own session_data_path subdirectory ({base}/pool_{i}). Multiple sessions allow concurrent queries without contending on a single Mutex.

Output format

All queries use OutputFormat::JSONEachRow — one JSON object per result row. This is parsed by the query service into InfluxDB v1 series format.


12. InfluxQL Parser

The parser is a hand-rolled recursive descent parser (no parser generator). It lives in src/influxql/parser.rs.

Parse flow

Input: "SELECT mean(\"value\") FROM \"cpu\" WHERE time > now() - 1h; SHOW DATABASES"
                        split_statements(";")
                        ┌───────────┼───────────┐
                        ▼                       ▼
              parse_statement()        parse_statement()
              first token = "SELECT"   first token = "SHOW"
                        │                       │
                        ▼                       ▼
                parse_select()         parse_show()
                        │                       │
                        ▼                       ▼
              SelectStatement          Statement::ShowDatabases

Statement dispatch

The parser examines the first keyword (case-insensitive) and dispatches:

First token Handler
SELECT parse_select()
SHOW parse_show() → further dispatch by second/third token
CREATE parse_create()CREATE DATABASE, CREATE RETENTION POLICY, CREATE USER, CREATE CONTINUOUS QUERY
DROP parse_drop()DROP DATABASE, DROP MEASUREMENT, etc.
DELETE parse_delete()
ALTER parse_alter()
SET parse_set_password()
GRANT parse_grant()
REVOKE parse_revoke()

Expression parsing

The SELECT field list and WHERE clause use a precedence-climbing expression parser:

Precedence Operators
1 (lowest) OR
2 AND
3 =, !=, <>, <, <=, >, >=, =~, !~
4 +, -
5 *, /, %
6 (highest) Unary -, NOT

Atoms include: identifiers ("column" or bare), string literals ('value'), integer/float literals, duration literals (1h, 30s), now(), function calls (mean(...), derivative(...)), *, regex (/pattern/), and subqueries.

Duration parsing

Suffix Duration
ns Nanoseconds
u Microseconds
ms Milliseconds
s Seconds
m Minutes
h Hours
d Days
w Weeks

AST types

Key AST nodes (in src/influxql/ast.rs):

  • Statement — enum of all statement types
  • SelectStatement — fields, from, into, condition, group_by, order_by, limit, offset, slimit, soffset, fill, timezone
  • Field — expression + optional alias
  • Expr — recursive expression tree (identifiers, literals, function calls, binary/unary ops, subqueries)
  • FunctionCall — name + args
  • GroupBy — list of Dimension (Time, Tag, Regex)
  • FillOption — Null, None, Previous, Linear, Value(f64)
  • Measurement — optional database, optional RP, name or regex

13. ClickHouse SQL Translator

The translator (src/influxql/to_clickhouse.rs) converts an InfluxQL SelectStatement AST into ClickHouse SQL that operates on Parquet files.

Translation pipeline

 SelectStatement
 ┌─────────────────────────────────────┐
 │ SELECT clause                       │
 │  - Time bucket: toStartOfInterval() │
 │  - GROUP BY tags added to SELECT    │
 │  - Aggregate function mapping       │
 │  - Transform → window functions     │
 │  - fill(N) → ifNull() wrapping      │
 │  - Arithmetic expressions           │
 │  - Default aliases (mean_field)     │
 └──────────────┬──────────────────────┘
 ┌──────────────▼──────────────────────┐
 │ FROM clause                          │
 │  - file('{glob}', Parquet)           │
 │  - Subqueries become inline SELECTs │
 └──────────────┬──────────────────────┘
 ┌──────────────▼──────────────────────┐
 │ WHERE clause                         │
 │  - now() → now64()                   │
 │  - Duration → INTERVAL              │
 │  - Epoch literals → fromUnixTimestamp│
 │  - Regex =~ → match()               │
 │  - Tombstone predicates appended    │
 │  - String comparisons preserved     │
 └──────────────┬──────────────────────┘
 ┌──────────────▼──────────────────────┐
 │ GROUP BY clause                      │
 │  - time(5m) → toStartOfInterval()   │
 │  - Tag dimensions                    │
 └──────────────┬──────────────────────┘
 ┌──────────────▼──────────────────────┐
 │ ORDER BY clause                      │
 │  - WITH FILL for fill modes          │
 │  - INTERPOLATE for fill(previous)    │
 │    and fill(linear)                  │
 └──────────────┬──────────────────────┘
 ┌──────────────▼──────────────────────┐
 │ LIMIT / OFFSET                       │
 └─────────────────────────────────────┘

Function mapping

InfluxQL ClickHouse
MEAN(f) avg(f)
MEDIAN(f) median(f)
COUNT(f) count(f)
SUM(f) sum(f)
MIN(f) min(f)
MAX(f) max(f)
FIRST(f) argMin(f, time)
LAST(f) argMax(f, time)
PERCENTILE(f, N) quantile(N/100.0)(f)
SPREAD(f) (max(f) - min(f))
STDDEV(f) stddevPop(f)
MODE(f) topKWeighted(1)(f, 1)
DISTINCT(f) DISTINCT f

Transform function translation

Transforms use ClickHouse window functions:

InfluxQL ClickHouse
DERIVATIVE(f, 1s) (f - lagInFrame(f, 1) OVER (ORDER BY __time)) / nullIf(dateDiff('second', lagInFrame(__time, 1) OVER ..., __time), 0) * scale
NON_NEGATIVE_DERIVATIVE(...) Same as above, wrapped in greatest(..., 0)
DIFFERENCE(f) f - lagInFrame(f, 1) OVER (ORDER BY __time)
MOVING_AVERAGE(f, N) avg(f) OVER (ORDER BY __time ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW)
CUMULATIVE_SUM(f) sum(f) OVER (ORDER BY __time ROWS UNBOUNDED PRECEDING)
ELAPSED(f, unit) dateDiff('unit', lagInFrame(__time, 1) OVER ..., __time)

Time bucket translation

-- InfluxQL: GROUP BY time(5m)
-- ClickHouse:
toStartOfInterval(time, INTERVAL 5 MINUTE) AS __time

-- InfluxQL: GROUP BY time(1h, 15m)  -- offset
-- ClickHouse:
toStartOfInterval(time - INTERVAL 15 MINUTE, INTERVAL 1 HOUR) + INTERVAL 15 MINUTE AS __time

The internal alias __time avoids collision with the raw time column. It's renamed back to time in the result parser.

Fill translation

Fill mode ClickHouse
fill(null) ORDER BY __time WITH FILL FROM ... TO ... STEP INTERVAL ...
fill(none) No WITH FILL
fill(0) ifNull(agg, 0) + WITH FILL
fill(previous) WITH FILL + INTERPOLATE (col AS col)
fill(linear) WITH FILL + INTERPOLATE (col AS col USING LINEAR)

14. Retention Enforcement

The RetentionService runs every 60 seconds:

  1. Lists all databases from metadata.
  2. For each database, iterates retention policies.
  3. For RPs with a finite duration, calculates the cutoff time: now - duration.
  4. Queries the Parquet file registry for files with max_time < cutoff.
  5. Deletes those files from storage and removes their metadata entries.

Data in the WAL that has not yet been flushed is not affected by retention. It will be filtered at flush time if the timestamps fall outside the retention window.


15. DELETE and Tombstones

DELETE uses a tombstone-based approach:

On DELETE execution

  1. Parse the DELETE statement to extract measurement name and WHERE predicate.
  2. Convert the WHERE predicate to a ClickHouse SQL fragment.
  3. Store a tombstone record in metadata:
    tombstone:{db}:{measurement}:{uuid} → {"predicate_sql": "time < ...", "created_at": "..."}
    
  4. In cluster mode, replicate the DELETE mutation to all peers.

On query execution

Before translating a SELECT query, the query service loads all tombstones for the measurement and passes them to the translator. The translator appends AND NOT (predicate) for each tombstone to the WHERE clause.

On compaction

Tombstones are not applied during compaction. The compaction service merges Parquet files as-is without filtering rows. Tombstones remain in metadata and continue to be applied at query time until they are manually removed. This design keeps compaction simple and idempotent.


16. Continuous Queries

Storage

CQ definitions are stored in metadata under cq:{db}:{name} keys:

struct ContinuousQueryDef {
    name: String,
    database: String,
    query_text: String,          // The full SELECT ... INTO ... statement
    resample_every_secs: u64,    // Execution interval
    resample_for_secs: u64,      // Look-back window
    created_at: String,          // RFC3339 timestamp
}

Execution

The ContinuousQueryService runs a loop with a 10-second tick:

  1. Load all CQ definitions from metadata across all databases.
  2. For each CQ, check if resample_every_secs has elapsed since the last execution.
  3. If due, execute the query via the QueryService.
  4. The SELECT ... INTO ... clause in the query writes results to the target measurement.

Cluster behavior

CQ create/drop operations are replicated to all peers. Each node independently runs its own CQ scheduler. This means CQs may execute on multiple nodes simultaneously, which is safe because writes are idempotent (same timestamps + tags = same series).


17. Authentication Internals

Password storage

Passwords are hashed using Argon2 with random salts via SaltString::generate(OsRng). The resulting hash string (in PHC format) is stored in the metadata store under user:{username}.

Credential extraction order

The auth middleware checks three sources in order:

  1. Query parameters: u and p
  2. HTTP Basic auth: Authorization: Basic <base64(user:pass)>
  3. Token auth: Authorization: Token user:pass

The first match wins. If none match and auth is enabled, the request is rejected with 401.

Base64 decoding

A minimal hand-rolled Base64 decoder is used (no external dependency) for parsing Basic auth headers.

Verification

Argon2::default().verify_password(input_bytes, &stored_hash)

Uses the default Argon2id variant with parameters from the stored hash.


18. Clustering and Replication

Model

HyperbyteDB uses master-master (peer-to-peer) replication for data writes, with Raft consensus (via openraft) for schema mutations. Every node accepts reads and writes. Data writes are replicated asynchronously to all peers. Schema-mutating operations (CREATE/DROP DATABASE, DELETE, user/CQ/RP management) are routed through Raft to ensure consistent ordering across the cluster. For a comprehensive treatment, see Deep Dive: Clustering.

Replicated operations

Operation Endpoint Replication target
Write (line protocol) /internal/replicate All peers
CREATE DATABASE /internal/replicate-mutation All peers
DROP DATABASE /internal/replicate-mutation All peers
DELETE /internal/replicate-mutation All peers
CREATE USER /internal/replicate-mutation All peers
DROP USER /internal/replicate-mutation All peers
CREATE CONTINUOUS QUERY /internal/replicate-mutation All peers
DROP CONTINUOUS QUERY /internal/replicate-mutation All peers
CREATE RETENTION POLICY /internal/replicate-mutation All peers

Replication protocol

  1. Client writes to node A.
  2. Node A persists locally (WAL + metadata).
  3. Node A returns 204 to the client.
  4. Node A spawns an async task that POSTs to each peer's /internal/replicate endpoint.
  5. Each POST includes a X-Hyperbytedb-Replicated: true header.
  6. The receiving node checks for this header; if present, it persists locally but does not re-replicate (preventing loops).

MutationRequest types

enum MutationRequest {
    CreateDatabase(String),
    DropDatabase(String),
    CreateRetentionPolicy { db, rp },
    CreateUser { username, password_hash, admin },
    DropUser(String),
    Delete { database, measurement, predicate_sql },
    CreateContinuousQuery { database, name, definition },
    DropContinuousQuery { database, name },
}

Failure handling

  • Replication is fire-and-forget with logging. If a peer is unreachable, the error is logged at WARN level.
  • There is no retry queue or WAL replay for failed replications.
  • On peer recovery, data can be re-synchronized via backup/restore from a healthy node.

Network requirements

  • All nodes must be reachable by all other nodes on their cluster_addr and port.
  • The peers list should not include the node's own address (filtered at startup).
  • HTTP timeout for replication requests: 10 seconds.

19. Background Services

HyperbyteDB runs five background services as Tokio tasks:

Service Interval Purpose
Flush flush.interval_secs (default 10s) WAL → Parquet
Compaction compaction.interval_secs (default 30s) Merge small Parquet files; in cluster mode, same tick may run verified compaction + membership self-repair
Retention 60s (fixed) Delete expired Parquet files
Continuous Query 10s (fixed) Execute CQ schedules
Cluster Heartbeat 60s (fixed, cluster mode only) Log cluster status

All services listen on a watch::Receiver<bool> for graceful shutdown. On ctrl+c:

  1. The shutdown signal is sent via watch::channel.
  2. Each service finishes its current iteration.
  3. The flush service performs one final flush.
  4. The main task awaits all service handles.
  5. Logs "HyperbyteDB shut down cleanly".

20. Error Handling

HyperbytedbError

All internal errors are represented by the HyperbytedbError enum:

Variant HTTP Status When
DatabaseNotFound(name) 404 Query or write to non-existent DB
RetentionPolicyNotFound(name) 404 Reference to non-existent RP
FieldTypeConflict{field, measurement, got, expected} 400 Write sends wrong type for existing field
LineProtocolParse{line, reason} 400 Malformed line protocol
QueryParse(msg) 400 Invalid InfluxQL syntax
AuthFailed 401 Bad credentials
DatabaseRequired 400 /write without db parameter
MissingParameter(name) 400 /query without q parameter
CardinalityExceeded{...} 422 Tag cardinality limit hit
QueryTimeout 408 Query exceeded query_timeout_secs
Wal(msg) 500 RocksDB WAL error
Storage(msg) 500 File I/O or S3 error
Chdb(msg) 500 chDB execution error
Metadata(msg) 500 RocksDB metadata error
Internal(msg) 500 Serialization or other internal error

Error responses follow InfluxDB v1 format:

{"error": "database not found: \"nonexistent\""}

21. Observability

Metrics

Uses the metrics crate with metrics-exporter-prometheus:

Metric Type Labels Description
hyperbytedb_write_requests_total counter Write requests received
hyperbytedb_query_requests_total counter Query requests received
hyperbytedb_query_errors_total counter Failed queries
hyperbytedb_query_duration_seconds histogram Query latency distribution
hyperbytedb_ingestion_points_total counter Points ingested
hyperbytedb_flush_duration_seconds histogram Flush cycle duration
hyperbytedb_flush_points_total counter Points flushed to Parquet

Tracing

Uses tracing + tracing-subscriber with configurable filter levels via the RUST_LOG environment variable or the [logging] config section.

Structured JSON logging is available with format = "json".

Health endpoint

GET /health returns:

{"status": "pass", "message": "ready for queries and writes"}

Always returns 200 as long as the HTTP server is running.


22. Concurrency Model

HyperbyteDB is built on Tokio with a multi-threaded runtime (#[tokio::main] with features = ["full"]).

Thread usage

Work Thread type Notes
HTTP request handling Tokio async workers Non-blocking
InfluxQL parsing Tokio async workers CPU-bound but fast
WAL operations Tokio async workers RocksDB ops are synchronous but fast
chDB query execution spawn_blocking pool chDB Session is synchronous
Parquet conversion spawn_blocking pool CPU-intensive Arrow operations
Parquet I/O tokio::spawn async tasks Concurrent file writes
Peer replication tokio::spawn async tasks Non-blocking HTTP POSTs

Synchronization

Resource Mechanism
WAL sequence number AtomicU64 (lock-free)
chDB Session tokio::sync::Mutex (one per session)
Last flushed sequence tokio::sync::Mutex<u64>
Shutdown signal tokio::sync::watch channel

23. Dependencies

Core runtime

Crate Version Purpose
tokio 1.x Async runtime
axum 0.8 HTTP framework
axum-server 0.7 TLS support
tower / tower-http 0.5 / 0.6 Middleware (tracing, CORS, timeout)
hyper 1.x HTTP transport

Storage

Crate Version Purpose
rocksdb 0.22 WAL and metadata store
arrow 54 In-memory columnar format
parquet 54 Columnar file format
object_store 0.12 Local + S3 filesystem abstraction
chdb-rust 1.3 Embedded ClickHouse query engine

Serialization

Crate Version Purpose
serde / serde_json 1.x JSON serialization
bincode 1.x Binary WAL entry serialization
serde_urlencoded 0.7 Form-encoded POST body parsing

Parsing and protocol

Crate Version Purpose
influxdb-line-protocol 2.x Line protocol parsing
regex 1.x Regular expression support

Configuration

Crate Version Purpose
figment 0.10 Config from TOML + env vars
clap 4.x CLI argument parsing

Observability

Crate Version Purpose
tracing / tracing-subscriber 0.1 / 0.3 Structured logging
metrics / metrics-exporter-prometheus 0.24 / 0.16 Prometheus metrics

Auth and crypto

Crate Version Purpose
argon2 0.5 Password hashing
rand_core 0.6 Cryptographic RNG for salt generation

Utilities

Crate Version Purpose
chrono 0.4 Date/time handling
uuid 1.x Request IDs, Parquet file suffixes
bytes 1.x Zero-copy byte buffers
futures 0.3 Async stream utilities
async-trait 0.1 Async trait methods
thiserror 2.x Error derive macros
anyhow 1.x Top-level error handling
flate2 1.x Gzip decompression
reqwest 0.12 HTTP client for peer replication
openraft 0.10 Raft consensus for schema mutations
indexmap 2.x Insertion-ordered maps
crc32fast 1.x CRC32 checksums for Parquet files
sha2 0.10 SHA-256 hashing for Merkle trees

24. Statement Summary

The StatementSummary service tracks recently executed InfluxQL statements for debugging and observability. When enabled (statement_summary.enabled = true), it records the normalized query text, digest, execution time, and error status for up to max_entries (default 1,000) recent statements in a bounded ring buffer. Results are exposed via GET /api/v1/statements.


25. Debug Binary

The hyperbytedb-debug binary (src/bin/hyperbytedb_debug.rs) provides a CLI for inspecting cluster state without connecting to the HTTP API. Subcommands include cluster status, topology, health checks, replication state, Raft status, manifest inspection, and metrics.


26. Kubernetes Operator

The hyperbytedb-operator/ directory contains a Go-based Kubernetes operator built with Kubebuilder. It defines a HyperbytedbCluster CRD for declarative multi-node cluster management, handling StatefulSet creation, peer configuration, and rolling updates.


27. Deterministic Simulation Testing (HyperSim)

HyperbyteDB is tested with a TigerBeetle-style Deterministic Simulation Testing framework called HyperSim, living in the sibling crate hypersim/. HyperSim runs the real HyperbyteDB domain code (write path, WAL sequencing, compaction, retention, query routing) inside a single-threaded, tick-driven simulator with the network, storage, clock, and process lifecycle replaced by deterministic in-process models.

The port-adapter boundary is the simulation seam

The hexagonal architecture described in § 2 is what makes simulation viable. HyperSim binds alternative adapters to the same port traits:

Port Production adapter Simulation adapter (#[cfg(feature = "sim")])
WalPort RocksDbWal MemoryWal (BTreeMap-backed)
StoragePort Parquet + object_store MemoryStorage (BTreeMap<String, Bytes>)
HttpTransport ReqwestTransport SimTransport (routes through SimNetwork)
Wall clock SystemTime VirtualClock

RocksDB cannot run inside a deterministic simulator — its background threads, wall-clock rate limiters, and direct syscalls break every DST guarantee. The in-memory adapters are not test mocks; they are first-class adapters whose role is "drive the simulator." This mirrors exactly how TigerBeetle's VOPR replaces the storage engine at the port boundary.

Port-contract tests pin the two adapter families together

hyperbytedb/tests/wal_port_contract.rs runs an arbitrary seeded op stream against both MemoryWal and RocksDbWal and asserts byte-for-byte output equivalence after every operation. This ensures HyperSim findings transfer to the production binary — any divergence caught here is a bug in one adapter or the other. A StoragePort contract test is planned as a follow-up.

CI integration

On every PR, CI runs 100 deterministic simulations (1 virtual hour each, ~5 minutes wall time on 8 cores), alternating safety-mode and liveness-mode fault profiles. Every run is reproducible from a single seed; failure traces are uploaded as GitHub artifacts for offline replay with hypersim replay.

See ../../../hypersim/README.md for the full architecture, fault model, CLI reference, and determinism checklist for contributors.


Deep Dive Documents

For detailed technical documentation on specific subsystems, see: