HyperbyteDB system architecture¶

This document is a long-form internal design walkthrough. It complements the shorter Architecture page and the deep dive series. If anything disagrees with src/, prefer the code and the focused deep dive.

It is intended for contributors who need to understand how HyperbyteDB works under the hood.

Canonical short reference: Architecture and Core modules.

Table of Contents¶

Architecture Overview
Hexagonal Architecture
Module Structure
Write Path
Query Path
WAL (Write-Ahead Log)
chDB Storage
Metadata Store
Flush Pipeline
Background Merges
Query Engine (chDB)
TimeseriesQL Parser
ClickHouse SQL Translator
Retention Enforcement
DELETE and Tombstones
Continuous Queries
Authentication Internals
Clustering and Replication
Background Services
Error Handling
Observability
Concurrency Model
Dependencies
Statement Summary
Debug Binary
Kubernetes Operator

1. Architecture Overview¶

HyperbyteDB is a time-series database with InfluxDB v1 API compatibility, embedded chDB storage, RocksDB WAL/metadata, and optional master-master clustering.

 Client (Telegraf, Grafana, curl)
       │
       ▼
 ┌─────────────────────────────┐
 │    HTTP Layer (axum)        │  Line protocol, TimeseriesQL, auth, gzip
 └──────────────┬──────────────┘
                │
                ▼
 ┌─────────────────────────────┐
 │   Application Services      │  Ingestion, Query, Flush, Retention, CQ
 └──────────────┬──────────────┘
                │
                ▼
 ┌─────────────────────────────┐
 │   Port Traits (interfaces)  │  WalPort, QueryPort, MetadataPort,
 │                              │  PointsSinkPort, ReplicationPort
 └──────────────┬──────────────┘
                │
       ┌────────┴────────┐
       ▼                 ▼
 ┌──────────┐    ┌──────────────┐
 │ RocksDB  │    │ chDB         │
 │ WAL+meta │    │ MergeTree    │
 └──────────┘    └──────────────┘

RocksDB provides the WAL (durable, ordered write log) and metadata store (databases, measurements, schemas, users, tombstones, CQ definitions).

chDB (embedded ClickHouse) is both the query engine and storage backend. TimeseriesQL is transpiled to ClickHouse SQL; the flush service INSERTs WAL batches into per-measurement MergeTree tables under chdb.session_data_path.

2. Hexagonal Architecture¶

HyperbyteDB uses the hexagonal (ports and adapters) pattern. Business logic lives in the application and domain layers and depends only on port traits, never on concrete implementations.

                  ┌────────────────────────────────────┐
                  │            Domain Layer             │
                  │  Point, FieldValue, Database,       │
                  │  cluster/ DTOs, chdb_naming         │
                  └────────────────────────────────────┘
                                    │
                  ┌────────────────────────────────────┐
                  │         Application Services        │
                  │  IngestionService, QueryService,    │
                  │  FlushService, RetentionService,    │
                  │  cluster/ bootstrap, drain          │
                  └──────────┬────────────┬────────────┘
                             │  depends   │
                             │  only on   │
                             ▼  ports     ▼
           ┌─────────────────────────────────────────────┐
           │              Port Traits                     │
           │  WalPort  QueryPort  MetadataPort           │
           │  PointsSinkPort  ReplicationPort  FlushPort │
           └────────────┬────────────────┬───────────────┘
                        │                │
          ┌─────────────▼──┐    ┌────────▼───────────┐
          │   Adapters     │    │   Adapters          │
          │ (inbound)      │    │ (outbound)          │
          │ HTTP handlers, │    │ RocksDB WAL,        │
          │ Peer handlers  │    │ RocksDB Metadata,   │
          │                │    │ chDB query + sink,  │
          │                │    │ cluster/ peer IO    │
          └────────────────┘    └─────────────────────┘

This means: - Swapping RocksDB for another WAL requires only implementing WalPort. - Swapping chDB for another engine requires implementing QueryPort and PointsSinkPort. - Application services use ReplicationPort and FlushPort instead of concrete cluster clients. - The HTTP layer can be replaced without touching business logic.

3. Module Structure¶

src/
├── main.rs                          CLI, server bootstrap, graceful shutdown
├── lib.rs                           Top-level module declarations
├── bootstrap.rs                     Composition root: builds all adapters and services
├── config.rs                        Figment-based configuration loading
├── error.rs                         HyperbytedbError enum + From impls
│
├── domain/
│   ├── point.rs, database.rs, wal.rs, …   Core TSDB types
│   ├── chdb_naming.rs               Shared ClickHouse table/column naming
│   └── cluster/                     Membership, sync DTOs, replication wire types
│
├── ports/
│   ├── wal.rs, metadata.rs, query.rs, ingestion.rs, auth.rs
│   ├── points_sink.rs               Native MergeTree flush sink
│   ├── replication.rs               Outbound peer replication
│   └── flush.rs                     Graceful drain flush hook
│
├── application/
│   ├── ingestion_service.rs, query_service.rs, flush_service.rs, …
│   ├── replication_apply.rs, replication_dispatch.rs
│   └── cluster/                     bootstrap, drain, heartbeat, sync_manifest
│
├── timeseriesql/                    Parser, AST, ClickHouse translator
│
└── adapters/
    ├── http/                        Axum handlers + internal cluster endpoints
    ├── chdb/                        QueryPort + PointsSinkPort (native adapter)
    ├── wal/, metadata/, auth.rs
    └── cluster/                     peer_client, sync_client, replication_log, raft/

4. Write Path¶

The write path is optimized for low-latency ingestion. Data is durable the moment the WAL append returns, and becomes queryable after the next flush cycle. For an exhaustive treatment of every step, see Deep Dive: Write Path.

 Client POST /write?db=mydb&precision=ns
       │
       ▼
 ┌─────────────────────────────────────────┐
 │ write.rs: handle_write()                │
 │  1. Extract db, rp, precision params    │
 │  2. Gzip decompress if needed           │
 │  3. Call IngestionPort.ingest()         │
 └──────────────┬──────────────────────────┘
                │
                ▼
 ┌─────────────────────────────────────────┐
 │ ingestion_service.rs                    │
 │  1. Verify database exists (metadata)   │
 │  2. Parse line protocol body            │
 │  3. Convert ParsedLine → Vec<Point>     │
 │     - Apply precision to timestamps     │
 │     - Default to current time if absent │
 │  4. Register field types + tag keys     │
 │     in metadata                         │
 │  5. Check cardinality limits            │
 │  6. Store tag values for SHOW queries   │
 │  7. Append WalEntry to WAL             │
 └──────────────┬──────────────────────────┘
                │
                ▼
 ┌─────────────────────────────────────────┐
 │ rocksdb_wal.rs: append()               │
 │  1. Atomic fetch_add sequence number    │
 │  2. bincode::serialize(WalEntry)        │
 │  3. WriteBatch: put to "wal" CF +       │
 │     update "last_seq" in "wal_meta" CF  │
 │  4. Return sequence number             │
 └─────────────────────────────────────────┘

In cluster mode, PeerIngestionService wraps the base service and, after the local WAL append succeeds, fires off async HTTP POST requests to all peers via PeerClient.replicate_write(). The local write returns 204 immediately without waiting for replication.

Data types¶

The Point struct carries: - measurement: String — measurement name - tags: BTreeMap<String, String> — sorted tag key-value pairs - fields: BTreeMap<String, FieldValue> — field key-value pairs - timestamp: i64 — nanoseconds since Unix epoch

FieldValue has five variants:

Variant	Discriminant	Storage type
`Float(f64)`	0	Float64
`Integer(i64)`	1	Int64
`UInteger(u64)`	2	UInt64
`String(String)`	3	String
`Boolean(bool)`	4	Boolean

Field types are registered on first write and enforced on subsequent writes. A write that sends an integer where a float was previously registered returns a FieldTypeConflict error (HTTP 400).

5. Query Path¶

For an exhaustive treatment of every step, see Deep Dive: Read Path.

 Client GET /query?db=mydb&q=SELECT mean("value") FROM "cpu" WHERE time > now() - 1h GROUP BY time(5m)
       │
       ▼
 ┌────────────────────────────────────────────────────────────────┐
 │ query.rs: handle_query_impl()                                  │
 │  1. Extract q, db, epoch, pretty, chunked, params              │
 │  2. Substitute $param bind parameters if present               │
 │  3. Call QueryService.execute_query() with timeout wrapper      │
 │  4. Format response as JSON, CSV, or chunked                   │
 └──────────────┬─────────────────────────────────────────────────┘
                │
                ▼
 ┌────────────────────────────────────────────────────────────────┐
 │ query_service.rs: execute_query()                              │
 │  1. tokio::time::timeout wraps entire execution                │
 │  2. Parse TimeseriesQL string → Vec<Statement>                     │
 │  3. For each statement, dispatch:                              │
 │     ┌─────────────────────────────────────────────┐            │
 │     │ SHOW DATABASES  → metadata.list_databases() │            │
 │     │ SHOW MEASUREMENTS → metadata.list_meas()    │            │
 │     │ SHOW TAG KEYS   → metadata.get_meas()       │            │
 │     │ SHOW TAG VALUES → metadata.list_tag_values() │           │
 │     │ SHOW FIELD KEYS → metadata.get_meas()       │            │
 │     │ CREATE DATABASE → metadata.create_database() │           │
 │     │ DROP DATABASE   → metadata.drop_database()   │           │
 │     │ DELETE          → metadata.store_tombstone()  │           │
 │     │ SELECT          → see SELECT flow below       │           │
 │     └─────────────────────────────────────────────┘            │
 └──────────────┬─────────────────────────────────────────────────┘
                │ (SELECT only)
                ▼
 ┌────────────────────────────────────────────────────────────────┐
 │ handle_select()                                                │
 │  1. Extract measurement name from FROM clause                  │
 │  2. Handle regex measurements: query metadata for matches,     │
 │     execute UNION ALL across all matching measurements         │
 │  3. Handle subqueries: translate inner SELECT first            │
 │  4. Resolve default retention policy from metadata             │
 │  5. Determine time range from WHERE clause                     │
 │  6. Resolve native MergeTree table via chdb_naming              │
 │  7. Load tombstones for the measurement                        │
 │  8. Translate TimeseriesQL AST → ClickHouse SQL                │
 │     (to_clickhouse::translate_native_table)                    │
 │  9. Execute SQL via chDB (QueryPort.execute_sql)               │
 │ 10. Parse JSONEachRow output → SeriesResult[]                  │
 │     - Group by tag combinations                                │
 │     - Apply epoch formatting to timestamps                     │
 │ 11. Handle SLIMIT/SOFFSET (series-level pagination)            │
 │ 12. Handle INTO clause (write results to target measurement)   │
 └────────────────────────────────────────────────────────────────┘

Result formatting¶

chDB returns data in JSONEachRow format (one JSON object per row):

{"__time":"2024-01-15 10:00:00","host":"server01","mean_usage_idle":42.5}
{"__time":"2024-01-15 10:05:00","host":"server01","mean_usage_idle":38.2}

The query service transforms this into InfluxDB v1 series format: 1. Parse each line as a JSON object. 2. Rename __time back to time. 3. Convert ClickHouse datetime strings to nanosecond timestamps. 4. Apply the epoch parameter (convert to ns/us/ms/s integers or leave as RFC3339 strings). 5. Group rows by tag combination into separate SeriesResult objects. 6. Each SeriesResult gets a name (measurement), tags map, columns list, and values array.

6. WAL (Write-Ahead Log)¶

The WAL provides crash-safe durability for incoming writes before they're flushed into chDB MergeTree tables.

Implementation: `RocksDbWal`¶

Backing store: RocksDB with two column families.
Location: Configured via storage.wal_dir (default ./wal).

Column Family	Purpose
`wal`	Ordered WAL entries. Keys are big-endian `u64` sequence numbers (8 bytes). Values are `bincode`-serialized `WalEntry`.
`wal_meta`	Single key `last_seq` storing the current sequence number as big-endian `u64`.

WalEntry structure¶

struct WalEntry {
    database: String,
    retention_policy: String,
    points: Vec<Point>,
}

Serialized with bincode for compact binary encoding.

Operations¶

Operation	Description
`append(entry)`	Atomically increments sequence, serializes entry, writes to `wal` CF and updates `last_seq` in a single `WriteBatch`. Returns sequence number.
`read_from(seq)`	Forward iterator from `seq` to end. Returns `Vec<(u64, WalEntry)>`.
`read_range(start, count)`	Reads up to `count` entries starting at `start`. Used by flush service for chunked reads.
`truncate_before(seq)`	Deletes all entries with sequence < `seq` using `delete_range_cf`.
`last_sequence()`	Returns the current sequence number from the atomic counter.

Key encoding¶

Sequence numbers are encoded as big-endian u64 so that RocksDB's lexicographic ordering preserves numerical order. This allows efficient range scans and truncation.

7. chDB Storage¶

Time-series data is stored in embedded chDB MergeTree tables under chdb.session_data_path (configured in [chdb]).

Table layout¶

Each (database, retention_policy, measurement) maps to one physical table. Names are sanitised via domain/chdb_naming (for example mydb_autogen_cpu).

The native adapter (ChdbNativeAdapter, implementing PointsSinkPort) auto-creates and alters tables from MeasurementMeta on flush. Tables use ReplacingMergeTree ordered by (time, tag columns…).

Schema¶

Columns mirror the measurement schema registered in metadata:

Column	Type	Notes
`time`	`DateTime64(9)`	Nanosecond timestamps
tag keys	`String`	One column per tag; collision-safe naming
fields	Float / Int / String / …	From registered field types

8. Metadata Store¶

Implementation: `RocksDbMetadata`¶

Backing store: RocksDB column family metadata
Location: storage.meta_dir (default ./meta)
Serialization: JSON values

Key schema¶

Key Pattern	Value	Description
`db:{name}`	`Database`	Database + retention policies
`meas:{db}:{name}`	`MeasurementMeta`	Field types, tag keys
`tag_val:{db}:{meas}:{key}:{value}`	empty	Tag value index (SHOW TAG VALUES)
`user:{username}`	`StoredUser`	Auth credentials
`tombstone:{db}:{meas}:{uuid}`	predicate + timestamp	DELETE tombstones
`cq:{db}:{name}`	`ContinuousQueryDef`	Continuous query definitions
`mv:{db}:{name}`	`MaterializedViewDef`	Materialized view definitions

9. Flush Pipeline¶

The flush service (FlushServiceImpl) bridges the WAL and chDB. It runs as a background Tokio task.

Lifecycle¶

Timer tick every flush.interval_secs (default 10s).
Read WAL from last_flushed_seq + 1 in chunks of 5,000 entries.
Group points by (database, retention_policy, measurement).
Sub-batch by max_points_per_batch (auto-detected from memory when 0).
Call PointsSinkPort::write_points for each batch (INSERT into MergeTree).
Truncate WAL up to the flushed sequence (cluster-aware using peer acks when enabled).

In cluster mode, truncation waits until replication acks allow safe removal of entries peers may still need.

10. Background Merges¶

HyperbyteDB does not run an application-level compaction service. MergeTree part consolidation and background merges are handled internally by chDB/ClickHouse.

Retention deletes expired rows via RetentionService (ALTER TABLE … DELETE), not by deleting external files.

11. Query Engine (chDB)¶

HyperbyteDB uses chDB (embedded ClickHouse) as its query engine and storage backend. chDB provides the full ClickHouse SQL dialect including window functions and aggregates.

Session management¶

chDB::Session is: - Send but not Sync — requires Mutex wrapping. - Synchronous — all calls block the thread, so they must run in spawn_blocking.

Single session (`pool_size = 1`)¶

struct ChdbQueryAdapter {
    session: Arc<Mutex<Session>>,
}

Each query: 1. Clones the Arc<Mutex<Session>>. 2. spawn_blocking → blocking_lock() → session.execute(sql, JSONEachRow). 3. Returns the UTF-8 result string.

Session pool (`pool_size > 1`)¶

struct ChdbPool {
    sessions: Vec<Arc<Mutex<Session>>>,
    next: AtomicUsize,
}

Round-robin assignment. Each session has its own session_data_path subdirectory ({base}/pool_{i}). Multiple sessions allow concurrent queries without contending on a single Mutex.

Output format¶

All queries use OutputFormat::JSONEachRow — one JSON object per result row. This is parsed by the query service into InfluxDB v1 series format.

12. TimeseriesQL Parser¶

The query language module is src/timeseriesql/ (Influx-compatible TimeseriesQL).

The parser is a hand-rolled recursive descent parser (no parser generator). It lives in src/timeseriesql/parser.rs.

Parse flow¶

Input: "SELECT mean(\"value\") FROM \"cpu\" WHERE time > now() - 1h; SHOW DATABASES"
                                    │
                                    ▼
                        split_statements(";")
                                    │
                        ┌───────────┼───────────┐
                        ▼                       ▼
              parse_statement()        parse_statement()
              first token = "SELECT"   first token = "SHOW"
                        │                       │
                        ▼                       ▼
                parse_select()         parse_show()
                        │                       │
                        ▼                       ▼
              SelectStatement          Statement::ShowDatabases

Statement dispatch¶

The parser examines the first keyword (case-insensitive) and dispatches:

First token	Handler
`SELECT`	`parse_select()`
`SHOW`	`parse_show()` → further dispatch by second/third token
`CREATE`	`parse_create()` → `CREATE DATABASE`, `CREATE RETENTION POLICY`, `CREATE USER`, `CREATE CONTINUOUS QUERY`, `CREATE MATERIALIZED VIEW`
`DROP`	`parse_drop()` → `DROP DATABASE`, `DROP MEASUREMENT`, etc.
`DELETE`	`parse_delete()`
`ALTER`	`parse_alter()`
`SET`	`parse_set_password()`
`GRANT`	`parse_grant()`
`REVOKE`	`parse_revoke()`

Expression parsing¶

The SELECT field list and WHERE clause use a precedence-climbing expression parser:

Precedence	Operators
1 (lowest)	`OR`
2	`AND`
3	`=`, `!=`, `<>`, `<`, `<=`, `>`, `>=`, `=~`, `!~`
4	`+`, `-`
5	`*`, `/`, `%`
6 (highest)	Unary `-`, `NOT`

Atoms include: identifiers ("column" or bare), string literals ('value'), integer/float literals, duration literals (1h, 30s), now(), function calls (mean(...), derivative(...)), *, regex (/pattern/), and subqueries.

Duration parsing¶

Suffix	Duration
`ns`	Nanoseconds
`u`	Microseconds
`ms`	Milliseconds
`s`	Seconds
`m`	Minutes
`h`	Hours
`d`	Days
`w`	Weeks

AST types¶

Key AST nodes (in src/timeseriesql/ast.rs):

Statement — enum of all statement types
SelectStatement — fields, from, into, condition, group_by, order_by, limit, offset, slimit, soffset, fill, timezone
Field — expression + optional alias
Expr — recursive expression tree (identifiers, literals, function calls, binary/unary ops, subqueries)
FunctionCall — name + args
GroupBy — list of Dimension (Time, Tag, Regex)
FillOption — Null, None, Previous, Linear, Value(f64)
Measurement — optional database, optional RP, name or regex

13. ClickHouse SQL Translator¶

The translator (src/timeseriesql/to_clickhouse.rs) converts a TimeseriesQL SelectStatement AST into ClickHouse SQL that queries native MergeTree tables.

Translation pipeline¶

 SelectStatement
       │
       ▼
 ┌─────────────────────────────────────┐
 │ SELECT clause                       │
 │  - Time bucket: toStartOfInterval() │
 │  - GROUP BY tags added to SELECT    │
 │  - Aggregate function mapping       │
 │  - Transform → window functions     │
 │  - fill(N) → ifNull() wrapping      │
 │  - Arithmetic expressions           │
 │  - Default aliases (mean_field)     │
 └──────────────┬──────────────────────┘
                │
 ┌──────────────▼──────────────────────┐
 │ FROM clause                          │
 │  - Native table: `db_rp_measurement` │
 │  - Subqueries become inline SELECTs │
 └──────────────┬──────────────────────┘
                │
 ┌──────────────▼──────────────────────┐
 │ WHERE clause                         │
 │  - now() → now64()                   │
 │  - Duration → INTERVAL              │
 │  - Epoch literals → fromUnixTimestamp│
 │  - Regex =~ → match()               │
 │  - Tombstone predicates appended    │
 │  - String comparisons preserved     │
 └──────────────┬──────────────────────┘
                │
 ┌──────────────▼──────────────────────┐
 │ GROUP BY clause                      │
 │  - time(5m) → toStartOfInterval()   │
 │  - Tag dimensions                    │
 └──────────────┬──────────────────────┘
                │
 ┌──────────────▼──────────────────────┐
 │ ORDER BY clause                      │
 │  - WITH FILL for fill modes          │
 │  - INTERPOLATE for fill(previous)    │
 │    and fill(linear)                  │
 └──────────────┬──────────────────────┘
                │
 ┌──────────────▼──────────────────────┐
 │ LIMIT / OFFSET                       │
 └─────────────────────────────────────┘

Function mapping¶

TimeseriesQL	ClickHouse
`MEAN(f)`	`avg(f)`
`MEDIAN(f)`	`median(f)`
`COUNT(f)`	`count(f)`
`SUM(f)`	`sum(f)`
`MIN(f)`	`min(f)`
`MAX(f)`	`max(f)`
`FIRST(f)`	`argMin(f, time)`
`LAST(f)`	`argMax(f, time)`
`PERCENTILE(f, N)`	`quantile(N/100.0)(f)`
`SPREAD(f)`	`(max(f) - min(f))`
`STDDEV(f)`	`stddevPop(f)`
`MODE(f)`	`topKWeighted(1)(f, 1)`
`DISTINCT(f)`	`DISTINCT f`

Transform function translation¶

Transforms use ClickHouse window functions:

TimeseriesQL	ClickHouse
`DERIVATIVE(f, 1s)`	`(f - lagInFrame(f, 1) OVER (ORDER BY __time)) / nullIf(dateDiff('second', lagInFrame(__time, 1) OVER ..., __time), 0) * scale`
`NON_NEGATIVE_DERIVATIVE(...)`	Same as above, wrapped in `greatest(..., 0)`
`DIFFERENCE(f)`	`f - lagInFrame(f, 1) OVER (ORDER BY __time)`
`MOVING_AVERAGE(f, N)`	`avg(f) OVER (ORDER BY __time ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW)`
`CUMULATIVE_SUM(f)`	`sum(f) OVER (ORDER BY __time ROWS UNBOUNDED PRECEDING)`
`ELAPSED(f, unit)`	`dateDiff('unit', lagInFrame(__time, 1) OVER ..., __time)`

Time bucket translation¶

-- TimeseriesQL: GROUP BY time(5m)
-- ClickHouse:
toStartOfInterval(time, INTERVAL 5 MINUTE) AS __time

-- TimeseriesQL: GROUP BY time(1h, 15m)  -- offset
-- ClickHouse:
toStartOfInterval(time - INTERVAL 15 MINUTE, INTERVAL 1 HOUR) + INTERVAL 15 MINUTE AS __time

The internal alias __time avoids collision with the raw time column. It's renamed back to time in the result parser.

Fill translation¶

Fill mode	ClickHouse
`fill(null)`	`ORDER BY __time WITH FILL FROM ... TO ... STEP INTERVAL ...`
`fill(none)`	No `WITH FILL`
`fill(0)`	`ifNull(agg, 0)` + `WITH FILL`
`fill(previous)`	`WITH FILL` + `INTERPOLATE (col AS col)`
`fill(linear)`	`WITH FILL` + `INTERPOLATE (col AS col USING LINEAR)`

14. Retention Enforcement¶

The RetentionService runs every 60 seconds:

Lists all databases from metadata.
For each database, iterates retention policies.
For RPs with a finite duration, calculates the cutoff time: now - duration.
For each measurement, issues ALTER TABLE {table} DELETE WHERE time < cutoff via chDB.

Data in the WAL that has not yet been flushed is not affected by retention until it is inserted into MergeTree tables.

15. DELETE and Tombstones¶

DELETE uses a tombstone-based approach:

On DELETE execution¶

Parse the DELETE statement to extract measurement name and WHERE predicate.
Convert the WHERE predicate to a ClickHouse SQL fragment.

Store a tombstone record in metadata:

tombstone:{db}:{measurement}:{uuid} → {"predicate_sql": "time < ...", "created_at": "..."}

In cluster mode, replicate the DELETE mutation to all peers.

On query execution¶

Before executing a SELECT query, the query service loads all tombstones for the measurement and appends AND NOT (predicate) for each tombstone to the WHERE clause.

Tombstones are applied at query time. Physical row removal is handled by retention enforcement and MergeTree background merges.

16. Continuous Queries¶

Storage¶

CQ definitions are stored in metadata under cq:{db}:{name} keys:

struct ContinuousQueryDef {
    name: String,
    database: String,
    query_text: String,          // The full SELECT ... INTO ... statement
    resample_every_secs: u64,    // Execution interval
    resample_for_secs: u64,      // Look-back window
    created_at: String,          // RFC3339 timestamp
}

Execution¶

The ContinuousQueryService runs a loop with a 10-second tick:

Load all CQ definitions from metadata across all databases.
For each CQ, check if resample_every_secs has elapsed since the last execution.
If due, execute the query via the QueryService.
The SELECT ... INTO ... clause in the query writes results to the target measurement.

Cluster behavior¶

CQ create/drop operations are replicated to all peers via Raft. Execution is leader-only: only the current Raft leader runs the CQ scheduler, so downsampling writes happen once per interval. When leadership transfers, the new leader picks up scheduling on its next tick. Single-node deployments without Raft run CQs locally.

17. Authentication Internals¶

Password storage¶

Passwords are hashed using Argon2 with random salts via SaltString::generate(OsRng). The resulting hash string (in PHC format) is stored in the metadata store under user:{username}.

Credential extraction order¶

The auth middleware checks three sources in order:

Query parameters: u and p
HTTP Basic auth: Authorization: Basic <base64(user:pass)>
Token auth: Authorization: Token user:pass

The first match wins. If none match and auth is enabled, the request is rejected with 401.

Base64 decoding¶

A minimal hand-rolled Base64 decoder is used (no external dependency) for parsing Basic auth headers.

Verification¶

Argon2::default().verify_password(input_bytes, &stored_hash)

Uses the default Argon2id variant with parameters from the stored hash.

18. Clustering and Replication¶

Model¶

HyperbyteDB uses master-master (peer-to-peer) replication for data writes, with Raft consensus (via openraft) for schema mutations. Every node accepts reads and writes. Data writes are replicated asynchronously to all peers. Schema-mutating operations (CREATE/DROP DATABASE, DELETE, user/CQ/RP management) are routed through Raft to ensure consistent ordering across the cluster. For a comprehensive treatment, see Deep Dive: Clustering.

Replicated operations¶

Operation	Endpoint	Replication target
Write (line protocol)	`/internal/replicate`	All peers
CREATE DATABASE	`/internal/replicate-mutation`	All peers
DROP DATABASE	`/internal/replicate-mutation`	All peers
DELETE	`/internal/replicate-mutation`	All peers
CREATE USER	`/internal/replicate-mutation`	All peers
DROP USER	`/internal/replicate-mutation`	All peers
CREATE CONTINUOUS QUERY	`/internal/replicate-mutation`	All peers
DROP CONTINUOUS QUERY	`/internal/replicate-mutation`	All peers
CREATE MATERIALIZED VIEW	`/internal/replicate-mutation`	All peers (DDL reconciled on startup)
DROP MATERIALIZED VIEW	`/internal/replicate-mutation`	All peers
CREATE RETENTION POLICY	`/internal/replicate-mutation`	All peers

Replication protocol¶

Client writes to node A.
Node A persists locally (WAL + metadata).
Node A returns 204 to the client.
Node A spawns an async task that POSTs to each peer's /internal/replicate endpoint.
Each POST includes a X-Hyperbytedb-Replicated: true header.
The receiving node checks for this header; if present, it persists locally but does not re-replicate (preventing loops).

MutationRequest types¶

enum MutationRequest {
    CreateDatabase(String),
    DropDatabase(String),
    CreateRetentionPolicy { db, rp },
    CreateUser { username, password_hash, admin },
    DropUser(String),
    Delete { database, measurement, predicate_sql },
    CreateContinuousQuery { database, name, definition },
    DropContinuousQuery { database, name },
}

Failure handling¶

Replication is fire-and-forget with logging. If a peer is unreachable, the error is logged at WARN level.
There is no retry queue or WAL replay for failed replications.
On peer recovery, data can be re-synchronized via backup/restore from a healthy node.

Network requirements¶

All nodes must be reachable by all other nodes on their cluster_addr and port.
The peers list should not include the node's own address (filtered at startup).
HTTP timeout for replication requests: 10 seconds.

19. Background Services¶

HyperbyteDB runs four background services as Tokio tasks:

Service	Interval	Purpose
Flush	`flush.interval_secs` (default 10s)	WAL → chDB MergeTree INSERT
Retention	60s (fixed)	`ALTER TABLE … DELETE` for expired rows
Continuous Query	10s (fixed)	Execute CQ schedules
Cluster Heartbeat	60s (fixed, cluster mode only)	Log cluster status

All services listen on a watch::Receiver<bool> for graceful shutdown. On ctrl+c:

The shutdown signal is sent via watch::channel.
Each service finishes its current iteration.
The flush service performs one final flush.
The main task awaits all service handles.
Logs "HyperbyteDB shut down cleanly".

20. Error Handling¶

HyperbytedbError¶

All internal errors are represented by the HyperbytedbError enum:

Variant	HTTP Status	When
`DatabaseNotFound(name)`	404	Query or write to non-existent DB
`RetentionPolicyNotFound(name)`	404	Reference to non-existent RP
`FieldTypeConflict{field, measurement, got, expected}`	400	Write sends wrong type for existing field
`LineProtocolParse{line, reason}`	400	Malformed line protocol
`QueryParse(msg)`	400	Invalid TimeseriesQL syntax
`AuthFailed`	401	Bad credentials
`DatabaseRequired`	400	`/write` without `db` parameter
`MissingParameter(name)`	400	`/query` without `q` parameter
`CardinalityExceeded{...}`	422	Tag cardinality limit hit
`QueryTimeout`	408	Query exceeded `query_timeout_secs`
`Wal(msg)`	500	RocksDB WAL error
`Storage(msg)`	500	File I/O or S3 error
`Chdb(msg)`	500	chDB execution error
`Metadata(msg)`	500	RocksDB metadata error
`Internal(msg)`	500	Serialization or other internal error

Error responses follow InfluxDB v1 format:

{"error": "database not found: \"nonexistent\""}

21. Observability¶

Metrics¶

Uses the metrics crate with metrics-exporter-prometheus:

Metric	Type	Labels	Description
`hyperbytedb_write_requests_total`	counter	—	Write requests received
`hyperbytedb_query_requests_total`	counter	—	Query requests received
`hyperbytedb_query_errors_total`	counter	—	Failed queries
`hyperbytedb_query_duration_seconds`	histogram	—	Query latency distribution
`hyperbytedb_ingestion_points_total`	counter	—	Points ingested
`hyperbytedb_flush_duration_seconds`	histogram	—	Flush cycle duration
`hyperbytedb_flush_points_total`	counter	—	Points flushed to chDB

Tracing¶

Uses tracing + tracing-subscriber with configurable filter levels via the RUST_LOG environment variable or the [logging] config section.

Structured JSON logging is available with format = "json".

Health endpoint¶

GET /health returns:

{"status": "pass", "message": "ready for queries and writes"}

Always returns 200 as long as the HTTP server is running.

22. Concurrency Model¶

HyperbyteDB is built on Tokio with a multi-threaded runtime (#[tokio::main] with features = ["full"]).

Thread usage¶

Work	Thread type	Notes
HTTP request handling	Tokio async workers	Non-blocking
TimeseriesQL parsing	Tokio async workers	CPU-bound but fast
WAL operations	Tokio async workers	RocksDB ops are synchronous but fast
chDB query execution	`spawn_blocking` pool	chDB Session is synchronous
Native INSERT (flush)	`tokio::spawn` async tasks	Parallel batch writes
Peer replication	`tokio::spawn` async tasks	Non-blocking HTTP POSTs

Synchronization¶

Resource	Mechanism
WAL sequence number	`AtomicU64` (lock-free)
chDB Session	`tokio::sync::Mutex` (one per session)
Last flushed sequence	`tokio::sync::Mutex<u64>`
Shutdown signal	`tokio::sync::watch` channel

23. Dependencies¶

Core runtime¶

Crate	Version	Purpose
`tokio`	1.x	Async runtime
`axum`	0.8	HTTP framework
`axum-server`	0.7	TLS support
`tower` / `tower-http`	0.5 / 0.6	Middleware (tracing, CORS, timeout)
`hyper`	1.x	HTTP transport

Storage¶

Crate	Version	Purpose
`rocksdb`	0.22	WAL and metadata store
`chdb-rust`	1.3	Embedded ClickHouse query engine and native storage
`arrow`	54	Optional columnar ingest (`columnar-ingest` feature)

Serialization¶

Crate	Version	Purpose
`serde` / `serde_json`	1.x	JSON serialization
`bincode`	1.x	Binary WAL entry serialization
`serde_urlencoded`	0.7	Form-encoded POST body parsing

Parsing and protocol¶

Crate	Version	Purpose
`influxdb-line-protocol`	2.x	Line protocol parsing
`regex`	1.x	Regular expression support

Configuration¶

Crate	Version	Purpose
`figment`	0.10	Config from TOML + env vars
`clap`	4.x	CLI argument parsing

Observability¶

Crate	Version	Purpose
`tracing` / `tracing-subscriber`	0.1 / 0.3	Structured logging
`metrics` / `metrics-exporter-prometheus`	0.24 / 0.16	Prometheus metrics

Auth and crypto¶

Crate	Version	Purpose
`argon2`	0.5	Password hashing
`rand_core`	0.6	Cryptographic RNG for salt generation

Utilities¶

Crate	Version	Purpose
`chrono`	0.4	Date/time handling
`uuid`	1.x	Request IDs, tombstone keys
`bytes`	1.x	Zero-copy byte buffers
`futures`	0.3	Async stream utilities
`async-trait`	0.1	Async trait methods
`thiserror`	2.x	Error derive macros
`anyhow`	1.x	Top-level error handling
`flate2`	1.x	Gzip decompression
`reqwest`	0.12	HTTP client for peer replication
`openraft`	0.10	Raft consensus for schema mutations
`indexmap`	2.x	Insertion-ordered maps
`crc32fast`	1.x	CRC32 checksums (cluster sync verification)
`sha2`	0.10	SHA-256 hashing (query digest / canonical statement summary)

24. Statement Summary¶

The StatementSummary service tracks recently executed TimeseriesQL statements for debugging and observability. When enabled (statement_summary.enabled = true), it records the normalized query text, digest, execution time, and error status for up to max_entries (default 1,000) recent statements in a bounded ring buffer. Results are exposed via GET /api/v1/statements.

25. Kubernetes Operator¶

The hyperbytedb-operator/ directory contains a Go-based Kubernetes operator built with Kubebuilder. It defines a HyperbytedbCluster CRD for declarative multi-node cluster management, handling StatefulSet creation, peer configuration, and rolling updates.

Deep Dive Documents¶

For detailed technical documentation on specific subsystems, see:

Deep Dive: Write Path — line protocol ingestion through chDB MergeTree INSERT
Deep Dive: Read Path — TimeseriesQL parsing, ClickHouse SQL translation, and query execution
Deep Dive: Compaction — MergeTree background merges (no application-level compaction service)
Deep Dive: Self-Repair — WAL/metadata sync convergence between peers
Deep Dive: Clustering — Raft consensus, replication, and graceful drain
Developer guide — contributing, building, testing, and extending HyperbyteDB

HyperbyteDB system architecture¶

Table of Contents¶

1. Architecture Overview¶

2. Hexagonal Architecture¶

3. Module Structure¶

4. Write Path¶

Data types¶

5. Query Path¶

Result formatting¶

6. WAL (Write-Ahead Log)¶

Implementation: RocksDbWal¶

WalEntry structure¶

Operations¶

Key encoding¶

7. chDB Storage¶

Table layout¶

Schema¶

8. Metadata Store¶

Implementation: RocksDbMetadata¶

Key schema¶

9. Flush Pipeline¶

Lifecycle¶

10. Background Merges¶

11. Query Engine (chDB)¶

Session management¶

Single session (pool_size = 1)¶

Session pool (pool_size > 1)¶

Output format¶

12. TimeseriesQL Parser¶

Parse flow¶

Statement dispatch¶

Expression parsing¶

Duration parsing¶

AST types¶

13. ClickHouse SQL Translator¶

Translation pipeline¶

Function mapping¶

Transform function translation¶

Time bucket translation¶

Fill translation¶

14. Retention Enforcement¶

15. DELETE and Tombstones¶

On DELETE execution¶

On query execution¶

16. Continuous Queries¶

Storage¶

Execution¶

Cluster behavior¶

17. Authentication Internals¶

Password storage¶

Credential extraction order¶

Base64 decoding¶

Verification¶

18. Clustering and Replication¶

Model¶

Replicated operations¶

Replication protocol¶

MutationRequest types¶

Failure handling¶

Network requirements¶

19. Background Services¶

20. Error Handling¶

HyperbytedbError¶

21. Observability¶

Metrics¶

Tracing¶

Health endpoint¶

22. Concurrency Model¶

Thread usage¶

Synchronization¶

23. Dependencies¶

Core runtime¶

Storage¶

Serialization¶

Parsing and protocol¶

Configuration¶

Observability¶

Auth and crypto¶

Utilities¶

24. Statement Summary¶

Implementation: `RocksDbWal`¶

Implementation: `RocksDbMetadata`¶

Single session (`pool_size = 1`)¶

Session pool (`pool_size > 1`)¶