HyperbyteDB system architecture¶
This document is a long-form internal design walkthrough (module layout, paths, storage). It complements the shorter Architecture page and the deep dive series. If anything disagrees with src/ or a deep dive, prefer the code and the focused deep dive.
It is intended for contributors who need to understand how HyperbyteDB works under the hood.
Table of Contents¶
- Architecture Overview
- Hexagonal Architecture
- Module Structure
- Write Path
- Query Path
- WAL (Write-Ahead Log)
- Parquet Storage
- Metadata Store
- Flush Pipeline
- Compaction
- Query Engine (chDB)
- InfluxQL Parser
- ClickHouse SQL Translator
- Retention Enforcement
- DELETE and Tombstones
- Continuous Queries
- Authentication Internals
- Clustering and Replication
- Background Services
- Error Handling
- Observability
- Concurrency Model
- Dependencies
- Statement Summary
- Debug Binary
- Kubernetes Operator
- Deterministic Simulation Testing (HyperSim)
1. Architecture Overview¶
HyperbyteDB combines four storage/compute engines into one binary:
Client (Telegraf, Grafana, curl)
│
▼
┌─────────────────────────────┐
│ HTTP Layer (axum) │ Line protocol, InfluxQL, auth, gzip
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Application Services │ Ingestion, Query, Flush, Compaction,
│ │ Retention, Continuous Queries
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Port Traits (interfaces) │ WalPort, StoragePort, QueryPort,
│ │ MetadataPort, IngestionPort
└──────────────┬──────────────┘
│
┌────────┼────────┬───────────┐
▼ ▼ ▼ ▼
┌──────┐ ┌──────────┐ ┌──────┐ ┌──────┐
│RocksDB│ │ Parquet │ │ chDB │ │ S3/ │
│(WAL + │ │ (Arrow) │ │(Click│ │Local │
│ meta) │ │ │ │House)│ │ FS │
└──────┘ └──────────┘ └──────┘ └──────┘
RocksDB provides the WAL (durable, ordered write log) and metadata store (databases, measurements, schemas, users, tombstones, CQ definitions, Parquet file registry).
Apache Parquet is the long-term storage format. Points are batched, converted to Arrow RecordBatches, and written as ZSTD-compressed Parquet files (ZSTD level 1, row groups of 65,536 rows, page-level statistics enabled).
chDB (embedded ClickHouse) is the query engine. InfluxQL is transpiled to ClickHouse SQL that uses the file() table function to scan Parquet files directly.
object_store abstracts local filesystem and S3-compatible storage for Parquet files.
2. Hexagonal Architecture¶
HyperbyteDB uses the hexagonal (ports and adapters) pattern. Business logic lives in the application and domain layers and depends only on port traits, never on concrete implementations.
┌────────────────────────────────────┐
│ Domain Layer │
│ Point, FieldValue, Database, │
│ RetentionPolicy, Precision, │
│ QueryResponse, SeriesResult │
└────────────────────────────────────┘
│
┌────────────────────────────────────┐
│ Application Services │
│ IngestionService, QueryService, │
│ FlushService, CompactionService, │
│ RetentionService, CQService │
└──────────┬────────────┬────────────┘
│ depends │
│ only on │
▼ ports ▼
┌─────────────────────────────────────────────┐
│ Port Traits │
│ WalPort StoragePort QueryPort │
│ MetadataPort IngestionPort │
└────────────┬────────────────┬───────────────┘
│ │
┌─────────────▼──┐ ┌────────▼───────────┐
│ Adapters │ │ Adapters │
│ (inbound) │ │ (outbound) │
│ HTTP handlers, │ │ RocksDB WAL, │
│ Peer handlers │ │ RocksDB Metadata, │
│ │ │ Parquet Writer, │
│ │ │ Local/S3 Storage, │
│ │ │ chDB Query Adapter │
└────────────────┘ └─────────────────────┘
This means: - Swapping RocksDB for another WAL requires only implementing WalPort. - Swapping chDB for another SQL engine requires only implementing QueryPort. - The HTTP layer can be replaced without touching business logic.
3. Module Structure¶
src/
├── main.rs CLI, server bootstrap, graceful shutdown
├── lib.rs Top-level module declarations
├── bootstrap.rs Composition root: builds all adapters and services
├── config.rs Figment-based configuration loading
├── error.rs HyperbytedbError enum + From impls
│
├── domain/
│ ├── point.rs Point, FieldValue (Float/Integer/UInteger/String/Boolean)
│ ├── series.rs SeriesKey (measurement + sorted tags)
│ ├── database.rs Database, RetentionPolicy, Precision
│ ├── query_result.rs QueryResponse, StatementResult, SeriesResult
│ ├── wal.rs WalEntry struct
│ ├── measurement.rs MeasurementMeta
│ ├── user.rs StoredUser (auth)
│ ├── continuous_query.rs ContinuousQueryDef
│ └── storage_layout.rs StorageLayout path helpers
│
├── ports/
│ ├── ingestion.rs trait IngestionPort
│ ├── query.rs trait QueryPort, trait QueryService
│ ├── storage.rs trait StoragePort
│ ├── wal.rs trait WalPort
│ ├── metadata.rs trait MetadataPort
│ └── auth.rs trait AuthPort
│
├── adapters/
│ ├── http/
│ │ ├── router.rs Route definitions, AppState, build_router()
│ │ ├── write.rs POST /write (line protocol, gzip)
│ │ ├── query.rs GET/POST /query (InfluxQL, bind params, CSV, chunked)
│ │ ├── ping.rs /ping, /health
│ │ ├── metrics.rs /metrics (Prometheus)
│ │ ├── auth_middleware.rs Auth extraction & verification
│ │ ├── middleware.rs Version headers, Request-Id
│ │ ├── error.rs HyperbytedbError → HTTP response mapping
│ │ ├── response.rs InfluxDB error JSON formatting
│ │ ├── cluster.rs /cluster/metrics, /cluster/nodes
│ │ ├── peer_handlers.rs /internal/replicate, sync, membership, drain
│ │ ├── raft_handlers.rs /internal/raft/vote, append, snapshot, membership
│ │ └── statements.rs /api/v1/statements (statement summary)
│ │
│ ├── chdb/
│ │ ├── query_adapter.rs Single-session chDB wrapper (QueryPort)
│ │ └── pool.rs Multi-session pool with concurrency limit
│ │
│ ├── storage/
│ │ ├── layout.rs Re-exports StorageLayout from domain
│ │ ├── parquet_writer.rs Arrow schema + RecordBatch → Parquet bytes (ZSTD)
│ │ ├── local.rs Local filesystem StoragePort
│ │ └── s3.rs S3-compatible StoragePort (object_store)
│ │
│ ├── wal/
│ │ └── rocksdb_wal.rs RocksDB WAL implementation
│ │
│ ├── metadata/
│ │ └── rocksdb_meta.rs RocksDB metadata implementation
│ │
│ └── auth.rs MetadataAuthAdapter
│
├── application/
│ ├── ingestion_service.rs Line protocol → metadata → WAL
│ ├── peer_ingestion_service.rs Same + fan-out replication to peers
│ ├── query_service.rs InfluxQL dispatch, chDB execution, result formatting
│ ├── peer_query_service.rs Same + Raft-based mutation replication
│ ├── flush_service.rs Background WAL → Parquet flush
│ ├── compaction_service.rs Compaction orchestration: periodic merge, `compact_all`, verified compaction, self-repair, orphan cleanup
│ ├── compaction_merge.rs Parquet merge engine: Arrow streaming k-way merge by `time`, legacy full-buffer sort fallback
│ ├── retention_service.rs Background expired-data cleanup
│ ├── continuous_query_service.rs Background CQ scheduler
│ └── statement_summary.rs Query statement tracking
│
├── influxql/
│ ├── ast.rs All AST node types (Statement, SelectStatement, Expr, etc.)
│ ├── parser.rs Hand-rolled recursive descent InfluxQL parser
│ ├── to_clickhouse.rs AST → ClickHouse SQL translation
│ └── digest.rs Query digest/canonicalization
│
├── cluster/
│ ├── bootstrap.rs ClusterBootstrap: startup sync, Raft init
│ ├── membership.rs ClusterMembership, SharedMembership, NodeState, NodeInfo
│ ├── peer_client.rs HTTP client for write/mutation replication
│ ├── sync_client.rs Sync client for startup/full/reconnect sync
│ ├── sync.rs SyncManifest, VerifyRequest/Response, DivergentBucket
│ ├── merkle.rs Merkle tree construction for anti-entropy
│ ├── replication_log.rs RocksDB-backed replication/mutation ack tracking
│ ├── drain.rs DrainService: graceful node shutdown
│ ├── types.rs MutationRequest (replication uses line protocol + headers)
│ └── raft/
│ ├── mod.rs TypeConfig, HyperbytedbRaft type alias
│ ├── types.rs ClusterRequest, ClusterResponse
│ ├── log_store.rs RaftStore: RocksDB-backed Raft log/state
│ ├── network.rs HTTP-based Raft network transport
│ └── state_machine.rs Raft state machine, schema mutation application
│
└── bin/
└── hyperbytedb_debug.rs CLI for cluster status, topology, Raft, metrics
4. Write Path¶
The write path is optimized for low-latency ingestion. Data is durable the moment the WAL append returns, and becomes queryable after the next flush cycle. For an exhaustive treatment of every step, see Deep Dive: Write Path.
Client POST /write?db=mydb&precision=ns
│
▼
┌─────────────────────────────────────────┐
│ write.rs: handle_write() │
│ 1. Extract db, rp, precision params │
│ 2. Gzip decompress if needed │
│ 3. Call IngestionPort.ingest() │
└──────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ ingestion_service.rs │
│ 1. Verify database exists (metadata) │
│ 2. Parse line protocol body │
│ 3. Convert ParsedLine → Vec<Point> │
│ - Apply precision to timestamps │
│ - Default to current time if absent │
│ 4. Register field types + tag keys │
│ in metadata │
│ 5. Check cardinality limits │
│ 6. Store tag values for SHOW queries │
│ 7. Append WalEntry to WAL │
└──────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ rocksdb_wal.rs: append() │
│ 1. Atomic fetch_add sequence number │
│ 2. bincode::serialize(WalEntry) │
│ 3. WriteBatch: put to "wal" CF + │
│ update "last_seq" in "wal_meta" CF │
│ 4. Return sequence number │
└─────────────────────────────────────────┘
In cluster mode, PeerIngestionService wraps the base service and, after the local WAL append succeeds, fires off async HTTP POST requests to all peers via PeerClient.replicate_write(). The local write returns 204 immediately without waiting for replication.
Data types¶
The Point struct carries: - measurement: String — measurement name - tags: BTreeMap<String, String> — sorted tag key-value pairs - fields: BTreeMap<String, FieldValue> — field key-value pairs - timestamp: i64 — nanoseconds since Unix epoch
FieldValue has five variants:
| Variant | Discriminant | Arrow type | Parquet type |
|---|---|---|---|
Float(f64) | 0 | Float64 | DOUBLE |
Integer(i64) | 1 | Int64 | INT64 |
UInteger(u64) | 2 | UInt64 | INT64 (unsigned) |
String(String) | 3 | Utf8 | BYTE_ARRAY |
Boolean(bool) | 4 | Boolean | BOOLEAN |
Field types are registered on first write and enforced on subsequent writes. A write that sends an integer where a float was previously registered returns a FieldTypeConflict error (HTTP 400).
5. Query Path¶
For an exhaustive treatment of every step, see Deep Dive: Read Path.
Client GET /query?db=mydb&q=SELECT mean("value") FROM "cpu" WHERE time > now() - 1h GROUP BY time(5m)
│
▼
┌────────────────────────────────────────────────────────────────┐
│ query.rs: handle_query_impl() │
│ 1. Extract q, db, epoch, pretty, chunked, params │
│ 2. Substitute $param bind parameters if present │
│ 3. Call QueryService.execute_query() with timeout wrapper │
│ 4. Format response as JSON, CSV, or chunked │
└──────────────┬─────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ query_service.rs: execute_query() │
│ 1. tokio::time::timeout wraps entire execution │
│ 2. Parse InfluxQL string → Vec<Statement> │
│ 3. For each statement, dispatch: │
│ ┌─────────────────────────────────────────────┐ │
│ │ SHOW DATABASES → metadata.list_databases() │ │
│ │ SHOW MEASUREMENTS → metadata.list_meas() │ │
│ │ SHOW TAG KEYS → metadata.get_meas() │ │
│ │ SHOW TAG VALUES → metadata.list_tag_values() │ │
│ │ SHOW FIELD KEYS → metadata.get_meas() │ │
│ │ CREATE DATABASE → metadata.create_database() │ │
│ │ DROP DATABASE → metadata.drop_database() │ │
│ │ DELETE → metadata.store_tombstone() │ │
│ │ SELECT → see SELECT flow below │ │
│ └─────────────────────────────────────────────┘ │
└──────────────┬─────────────────────────────────────────────────┘
│ (SELECT only)
▼
┌────────────────────────────────────────────────────────────────┐
│ handle_select() │
│ 1. Extract measurement name from FROM clause │
│ 2. Handle regex measurements: query metadata for matches, │
│ execute UNION ALL across all matching measurements │
│ 3. Handle subqueries: translate inner SELECT first │
│ 4. Resolve default retention policy from metadata │
│ 5. Determine time range from WHERE clause │
│ 6. Generate Parquet glob pattern via StorageLayout │
│ 7. Load tombstones for the measurement │
│ 8. Translate InfluxQL AST → ClickHouse SQL │
│ (to_clickhouse::translate) │
│ 9. Execute SQL via chDB (QueryPort.execute_sql) │
│ 10. Parse JSONEachRow output → SeriesResult[] │
│ - Group by tag combinations │
│ - Apply epoch formatting to timestamps │
│ 11. Handle SLIMIT/SOFFSET (series-level pagination) │
│ 12. Handle INTO clause (write results to target measurement) │
└────────────────────────────────────────────────────────────────┘
Result formatting¶
chDB returns data in JSONEachRow format (one JSON object per row):
{"__time":"2024-01-15 10:00:00","host":"server01","mean_usage_idle":42.5}
{"__time":"2024-01-15 10:05:00","host":"server01","mean_usage_idle":38.2}
The query service transforms this into InfluxDB v1 series format: 1. Parse each line as a JSON object. 2. Rename __time back to time. 3. Convert ClickHouse datetime strings to nanosecond timestamps. 4. Apply the epoch parameter (convert to ns/us/ms/s integers or leave as RFC3339 strings). 5. Group rows by tag combination into separate SeriesResult objects. 6. Each SeriesResult gets a name (measurement), tags map, columns list, and values array.
6. WAL (Write-Ahead Log)¶
The WAL provides crash-safe durability for incoming writes before they're flushed to Parquet.
Implementation: RocksDbWal¶
- Backing store: RocksDB with two column families.
- Location: Configured via
storage.wal_dir(default./wal).
| Column Family | Purpose |
|---|---|
wal | Ordered WAL entries. Keys are big-endian u64 sequence numbers (8 bytes). Values are bincode-serialized WalEntry. |
wal_meta | Single key last_seq storing the current sequence number as big-endian u64. |
WalEntry structure¶
Serialized with bincode for compact binary encoding.
Operations¶
| Operation | Description |
|---|---|
append(entry) | Atomically increments sequence, serializes entry, writes to wal CF and updates last_seq in a single WriteBatch. Returns sequence number. |
read_from(seq) | Forward iterator from seq to end. Returns Vec<(u64, WalEntry)>. |
read_range(start, count) | Reads up to count entries starting at start. Used by flush service for chunked reads. |
truncate_before(seq) | Deletes all entries with sequence < seq using delete_range_cf. |
last_sequence() | Returns the current sequence number from the atomic counter. |
Key encoding¶
Sequence numbers are encoded as big-endian u64 so that RocksDB's lexicographic ordering preserves numerical order. This allows efficient range scans and truncation.
7. Parquet Storage¶
File layout¶
Example:
/var/lib/hyperbytedb/data/mydb/autogen/cpu/2024-01-15/10.parquet
/var/lib/hyperbytedb/data/mydb/autogen/cpu/2024-01-15/10_a3f2b1c8.parquet
The UUID suffix is appended when multiple flushes write to the same hour bucket, preventing overwrites.
Arrow schema¶
Each Parquet file has a consistent Arrow schema:
| Column | Arrow Type | Nullable | Notes |
|---|---|---|---|
time | Timestamp(Nanosecond, Some("UTC")) | No | Always present, always first |
<tag_key> | Utf8 | Yes | One column per tag key |
<field_name> | Varies by type | Yes | Float64, Int64, UInt64, Utf8, Boolean |
Schema is constructed from the measurement's MeasurementMeta (tag keys + field types) stored in the metadata store.
Parquet properties¶
- Compression: ZSTD level 1 (
Compression::ZSTD(ZstdLevel::try_new(1))) - Row group size: 65,536 rows (
set_max_row_group_size(65536)) - Statistics: Page-level (
EnabledStatistics::Page) - Encoding: Default Parquet encodings (dictionary for strings, delta for integers)
Storage backends¶
Two StoragePort implementations:
LocalStorage — Writes to and reads from the local filesystem. Creates intermediate directories as needed.
S3Storage — Uses the object_store crate to read/write Parquet files to any S3-compatible service (AWS S3, MinIO, Cloudflare R2). Configured via [storage.s3].
Both backends are transparent to the rest of the system — the flush service and compaction service interact only through the StoragePort trait.
8. Metadata Store¶
Implementation: RocksDbMetadata¶
- Backing store: RocksDB with one column family
metadata. - Location: Configured via
storage.meta_dir(default./meta). - Serialization: JSON for all values (human-inspectable with RocksDB tools).
Key schema¶
| Key Pattern | Value | Description |
|---|---|---|
db:{name} | JSON {"database": Database} | Database definition with retention policies |
meas:{db}:{name} | JSON MeasurementMeta | Field types (name → discriminant), tag keys |
tag_val:{db}:{meas}:{key}:{value} | Empty | Tag value existence (for SHOW TAG VALUES) |
pq:{db}:{rp}:{meas}:{min_time_hex} | JSON {"path", "min_time", "max_time"} | Parquet file registry |
user:{username} | JSON StoredUser | Username, password_hash, admin flag |
tombstone:{db}:{meas}:{uuid} | JSON {"predicate_sql", "created_at"} | DELETE tombstone records |
cq:{db}:{name} | JSON ContinuousQueryDef | Continuous query definitions |
MeasurementMeta¶
struct MeasurementMeta {
name: String,
field_types: HashMap<String, u8>, // field_name → FieldValue discriminant
tag_keys: Vec<String>,
}
Field types use the FieldValue discriminant (0=Float, 1=Integer, 2=UInteger, 3=String, 4=Boolean) to reconstruct Arrow schemas during flush.
Parquet file registry¶
Each Parquet file is registered with its time range:
{
"path": "/var/lib/hyperbytedb/data/mydb/autogen/cpu/2024-01-15/10_abc123.parquet",
"min_time": 1705312800000000000,
"max_time": 1705316399999999999
}
The key includes min_time as a zero-padded hex string for efficient time-range prefix scanning.
9. Flush Pipeline¶
The flush service is the bridge between the WAL and Parquet storage. It runs as a background Tokio task.
Lifecycle¶
- Timer tick every
flush.interval_secs(default 10s). - Read WAL entries from
last_flushed_seq + 1in chunks of 5,000 entries. - Group all points by
(database, retention_policy, measurement). - Partition each group by hour bucket (using the point's timestamp).
- Sub-batch large groups by
max_points_per_batch(auto-detected from available memory, clamped to 10K–500K). - For each sub-batch, in parallel via
spawn_blocking: - Look up
MeasurementMetafrom metadata. - Build Arrow schema.
- Convert
Vec<Point>toRecordBatch. - Write
RecordBatchto Parquet bytes (ZSTD level 1 compressed, 65,536-row groups). - Write all Parquet files in parallel via
tokio::spawn(I/O-parallel). - Register each file in the metadata Parquet file registry.
- Truncate WAL entries up to the max flushed sequence.
Memory-aware batching¶
When max_points_per_batch = 0 (default), the flush service calls auto_detect_batch_size():
- On Linux: reads
/proc/meminfoforMemAvailable, uses 25% of that. - Elsewhere: defaults to 100,000 points.
- Clamps to
[MIN_BATCH_POINTS=10_000, MAX_BATCH_POINTS=500_000].
Each point is estimated at 512 bytes for memory budgeting.
Parallelism¶
- CPU: Parquet conversion (schema building, RecordBatch creation, compression) runs in
spawn_blockingtasks on the Tokio blocking threadpool. - I/O: File writes run in concurrent
tokio::spawntasks. - Sequencing: All CPU work completes before I/O begins, ensuring no partial writes.
10. Compaction¶
Compaction merges small Parquet files for a measurement's time bucket into fewer, larger files. Orchestration lives in src/application/compaction_service.rs (when to run, metadata, storage callbacks, cluster verification). The actual Parquet merge is implemented in src/application/compaction_merge.rs using Apache Arrow only (no chDB). For behavior detail, see Deep Dive: Compaction.
Trigger conditions¶
- Runs every
compaction.interval_secs(default 30s). - Scans metadata for
(db, rp, measurement)triples. - Groups Parquet files by time bucket (
bucket_duration, default 1 hour). - Triggers when a measurement has >=
min_files_to_compacttotal files (default 2), then compacts each bucket with >= 2 files.
Process¶
Implementation: compaction_merge::merge_parquet_files (called from compaction_service).
- Open inputs for streaming: For each path,
StoragePort::prepare_parquet_streamreturns a filesystem path (local) or a temp file (S3 download), avoiding holding full Parquet bytes in memory on the happy path. - Merge (default):
merge_parquet_streamingopens oneParquetRecordBatchReaderper file, builds a unified schema from Parquet metadata, conforms batches, and performs a k-way merge ordered bytime(min-heap), assuming each input file is already sorted bytime(true for WAL flush and prior compacts). Accumulator batches flush when row count or estimated size hits limits, thenwrite_batch_with_limitemits one or more output Parquet blobs (ZSTD). - Fallback: If per-file or global time order is violated, the code re-reads full file bytes and runs the legacy path:
concat_batches+sort_to_indices/take+write_batch_with_limit. - Register and delete: Register new files in metadata, unregister old paths, delete old objects from storage (still orchestrated in
compaction_service). - Orphan cleanup: Remove on-disk Parquet with no metadata entry (in
compaction_service).
Steps 2–3 use the same schema-unification, conformance, and size-splitting ideas as before; see Deep Dive: Compaction for the full algorithm.
Tombstones are not applied during compaction. They are applied at query time only, by injecting AND NOT (predicate) clauses into generated SQL.
Safety¶
Compaction is idempotent. If it crashes mid-way, old files remain intact and the next compaction cycle retries. New files are written with UUID-suffixed names ({HH}_c{8-char-uuid}.parquet) to prevent collisions.
Cluster mode: hash verification and Parquet self-repair¶
When clustering is enabled and the compaction service is constructed with shared membership (src/bootstrap.rs), each compaction interval runs—after ordinary merges—verified compaction and optionally membership-driven self-repair:
- Verified compaction hashes per-origin Parquet in cold buckets (older than
compaction.verified_compaction_age_secs) against the authoring peer viaGET /internal/bucket-hash, using the samebucket_durationas local compaction. On mismatch it re-downloads origin-filtered files from the peer’sGET /internal/sync/manifest+/internal/sync/parquet/.... - Self-repair walks each active peer’s manifest (capped by
max_repair_checks_per_cycle) to find buckets where that peer authored data but this node’s hash differs—including empty vs non-empty—then fetches the same way.
Sync manifest rows carry origin_node_id for precise filtering; older binaries omit the field and peers fall back to _n{id}_ filename tags.
For protocol detail, metrics, and configuration, see Deep Dive: Self-Repair.
11. Query Engine (chDB)¶
HyperbyteDB uses chDB (embedded ClickHouse) as its query engine. chDB provides the full ClickHouse SQL dialect including window functions, aggregates, and the file() table function.
Session management¶
chDB::Session is: - Send but not Sync — requires Mutex wrapping. - Synchronous — all calls block the thread, so they must run in spawn_blocking.
Single session (pool_size = 1)¶
Each query: 1. Clones the Arc<Mutex<Session>>. 2. spawn_blocking → blocking_lock() → session.execute(sql, JSONEachRow). 3. Returns the UTF-8 result string.
Session pool (pool_size > 1)¶
Round-robin assignment. Each session has its own session_data_path subdirectory ({base}/pool_{i}). Multiple sessions allow concurrent queries without contending on a single Mutex.
Output format¶
All queries use OutputFormat::JSONEachRow — one JSON object per result row. This is parsed by the query service into InfluxDB v1 series format.
12. InfluxQL Parser¶
The parser is a hand-rolled recursive descent parser (no parser generator). It lives in src/influxql/parser.rs.
Parse flow¶
Input: "SELECT mean(\"value\") FROM \"cpu\" WHERE time > now() - 1h; SHOW DATABASES"
│
▼
split_statements(";")
│
┌───────────┼───────────┐
▼ ▼
parse_statement() parse_statement()
first token = "SELECT" first token = "SHOW"
│ │
▼ ▼
parse_select() parse_show()
│ │
▼ ▼
SelectStatement Statement::ShowDatabases
Statement dispatch¶
The parser examines the first keyword (case-insensitive) and dispatches:
| First token | Handler |
|---|---|
SELECT | parse_select() |
SHOW | parse_show() → further dispatch by second/third token |
CREATE | parse_create() → CREATE DATABASE, CREATE RETENTION POLICY, CREATE USER, CREATE CONTINUOUS QUERY |
DROP | parse_drop() → DROP DATABASE, DROP MEASUREMENT, etc. |
DELETE | parse_delete() |
ALTER | parse_alter() |
SET | parse_set_password() |
GRANT | parse_grant() |
REVOKE | parse_revoke() |
Expression parsing¶
The SELECT field list and WHERE clause use a precedence-climbing expression parser:
| Precedence | Operators |
|---|---|
| 1 (lowest) | OR |
| 2 | AND |
| 3 | =, !=, <>, <, <=, >, >=, =~, !~ |
| 4 | +, - |
| 5 | *, /, % |
| 6 (highest) | Unary -, NOT |
Atoms include: identifiers ("column" or bare), string literals ('value'), integer/float literals, duration literals (1h, 30s), now(), function calls (mean(...), derivative(...)), *, regex (/pattern/), and subqueries.
Duration parsing¶
| Suffix | Duration |
|---|---|
ns | Nanoseconds |
u | Microseconds |
ms | Milliseconds |
s | Seconds |
m | Minutes |
h | Hours |
d | Days |
w | Weeks |
AST types¶
Key AST nodes (in src/influxql/ast.rs):
Statement— enum of all statement typesSelectStatement— fields, from, into, condition, group_by, order_by, limit, offset, slimit, soffset, fill, timezoneField— expression + optional aliasExpr— recursive expression tree (identifiers, literals, function calls, binary/unary ops, subqueries)FunctionCall— name + argsGroupBy— list ofDimension(Time, Tag, Regex)FillOption— Null, None, Previous, Linear, Value(f64)Measurement— optional database, optional RP, name or regex
13. ClickHouse SQL Translator¶
The translator (src/influxql/to_clickhouse.rs) converts an InfluxQL SelectStatement AST into ClickHouse SQL that operates on Parquet files.
Translation pipeline¶
SelectStatement
│
▼
┌─────────────────────────────────────┐
│ SELECT clause │
│ - Time bucket: toStartOfInterval() │
│ - GROUP BY tags added to SELECT │
│ - Aggregate function mapping │
│ - Transform → window functions │
│ - fill(N) → ifNull() wrapping │
│ - Arithmetic expressions │
│ - Default aliases (mean_field) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ FROM clause │
│ - file('{glob}', Parquet) │
│ - Subqueries become inline SELECTs │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ WHERE clause │
│ - now() → now64() │
│ - Duration → INTERVAL │
│ - Epoch literals → fromUnixTimestamp│
│ - Regex =~ → match() │
│ - Tombstone predicates appended │
│ - String comparisons preserved │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ GROUP BY clause │
│ - time(5m) → toStartOfInterval() │
│ - Tag dimensions │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ ORDER BY clause │
│ - WITH FILL for fill modes │
│ - INTERPOLATE for fill(previous) │
│ and fill(linear) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ LIMIT / OFFSET │
└─────────────────────────────────────┘
Function mapping¶
| InfluxQL | ClickHouse |
|---|---|
MEAN(f) | avg(f) |
MEDIAN(f) | median(f) |
COUNT(f) | count(f) |
SUM(f) | sum(f) |
MIN(f) | min(f) |
MAX(f) | max(f) |
FIRST(f) | argMin(f, time) |
LAST(f) | argMax(f, time) |
PERCENTILE(f, N) | quantile(N/100.0)(f) |
SPREAD(f) | (max(f) - min(f)) |
STDDEV(f) | stddevPop(f) |
MODE(f) | topKWeighted(1)(f, 1) |
DISTINCT(f) | DISTINCT f |
Transform function translation¶
Transforms use ClickHouse window functions:
| InfluxQL | ClickHouse |
|---|---|
DERIVATIVE(f, 1s) | (f - lagInFrame(f, 1) OVER (ORDER BY __time)) / nullIf(dateDiff('second', lagInFrame(__time, 1) OVER ..., __time), 0) * scale |
NON_NEGATIVE_DERIVATIVE(...) | Same as above, wrapped in greatest(..., 0) |
DIFFERENCE(f) | f - lagInFrame(f, 1) OVER (ORDER BY __time) |
MOVING_AVERAGE(f, N) | avg(f) OVER (ORDER BY __time ROWS BETWEEN N-1 PRECEDING AND CURRENT ROW) |
CUMULATIVE_SUM(f) | sum(f) OVER (ORDER BY __time ROWS UNBOUNDED PRECEDING) |
ELAPSED(f, unit) | dateDiff('unit', lagInFrame(__time, 1) OVER ..., __time) |
Time bucket translation¶
-- InfluxQL: GROUP BY time(5m)
-- ClickHouse:
toStartOfInterval(time, INTERVAL 5 MINUTE) AS __time
-- InfluxQL: GROUP BY time(1h, 15m) -- offset
-- ClickHouse:
toStartOfInterval(time - INTERVAL 15 MINUTE, INTERVAL 1 HOUR) + INTERVAL 15 MINUTE AS __time
The internal alias __time avoids collision with the raw time column. It's renamed back to time in the result parser.
Fill translation¶
| Fill mode | ClickHouse |
|---|---|
fill(null) | ORDER BY __time WITH FILL FROM ... TO ... STEP INTERVAL ... |
fill(none) | No WITH FILL |
fill(0) | ifNull(agg, 0) + WITH FILL |
fill(previous) | WITH FILL + INTERPOLATE (col AS col) |
fill(linear) | WITH FILL + INTERPOLATE (col AS col USING LINEAR) |
14. Retention Enforcement¶
The RetentionService runs every 60 seconds:
- Lists all databases from metadata.
- For each database, iterates retention policies.
- For RPs with a finite
duration, calculates the cutoff time:now - duration. - Queries the Parquet file registry for files with
max_time < cutoff. - Deletes those files from storage and removes their metadata entries.
Data in the WAL that has not yet been flushed is not affected by retention. It will be filtered at flush time if the timestamps fall outside the retention window.
15. DELETE and Tombstones¶
DELETE uses a tombstone-based approach:
On DELETE execution¶
- Parse the DELETE statement to extract measurement name and WHERE predicate.
- Convert the WHERE predicate to a ClickHouse SQL fragment.
- Store a tombstone record in metadata:
- In cluster mode, replicate the DELETE mutation to all peers.
On query execution¶
Before translating a SELECT query, the query service loads all tombstones for the measurement and passes them to the translator. The translator appends AND NOT (predicate) for each tombstone to the WHERE clause.
On compaction¶
Tombstones are not applied during compaction. The compaction service merges Parquet files as-is without filtering rows. Tombstones remain in metadata and continue to be applied at query time until they are manually removed. This design keeps compaction simple and idempotent.
16. Continuous Queries¶
Storage¶
CQ definitions are stored in metadata under cq:{db}:{name} keys:
struct ContinuousQueryDef {
name: String,
database: String,
query_text: String, // The full SELECT ... INTO ... statement
resample_every_secs: u64, // Execution interval
resample_for_secs: u64, // Look-back window
created_at: String, // RFC3339 timestamp
}
Execution¶
The ContinuousQueryService runs a loop with a 10-second tick:
- Load all CQ definitions from metadata across all databases.
- For each CQ, check if
resample_every_secshas elapsed since the last execution. - If due, execute the query via the
QueryService. - The SELECT ... INTO ... clause in the query writes results to the target measurement.
Cluster behavior¶
CQ create/drop operations are replicated to all peers. Each node independently runs its own CQ scheduler. This means CQs may execute on multiple nodes simultaneously, which is safe because writes are idempotent (same timestamps + tags = same series).
17. Authentication Internals¶
Password storage¶
Passwords are hashed using Argon2 with random salts via SaltString::generate(OsRng). The resulting hash string (in PHC format) is stored in the metadata store under user:{username}.
Credential extraction order¶
The auth middleware checks three sources in order:
- Query parameters:
uandp - HTTP Basic auth:
Authorization: Basic <base64(user:pass)> - Token auth:
Authorization: Token user:pass
The first match wins. If none match and auth is enabled, the request is rejected with 401.
Base64 decoding¶
A minimal hand-rolled Base64 decoder is used (no external dependency) for parsing Basic auth headers.
Verification¶
Uses the default Argon2id variant with parameters from the stored hash.
18. Clustering and Replication¶
Model¶
HyperbyteDB uses master-master (peer-to-peer) replication for data writes, with Raft consensus (via openraft) for schema mutations. Every node accepts reads and writes. Data writes are replicated asynchronously to all peers. Schema-mutating operations (CREATE/DROP DATABASE, DELETE, user/CQ/RP management) are routed through Raft to ensure consistent ordering across the cluster. For a comprehensive treatment, see Deep Dive: Clustering.
Replicated operations¶
| Operation | Endpoint | Replication target |
|---|---|---|
| Write (line protocol) | /internal/replicate | All peers |
| CREATE DATABASE | /internal/replicate-mutation | All peers |
| DROP DATABASE | /internal/replicate-mutation | All peers |
| DELETE | /internal/replicate-mutation | All peers |
| CREATE USER | /internal/replicate-mutation | All peers |
| DROP USER | /internal/replicate-mutation | All peers |
| CREATE CONTINUOUS QUERY | /internal/replicate-mutation | All peers |
| DROP CONTINUOUS QUERY | /internal/replicate-mutation | All peers |
| CREATE RETENTION POLICY | /internal/replicate-mutation | All peers |
Replication protocol¶
- Client writes to node A.
- Node A persists locally (WAL + metadata).
- Node A returns 204 to the client.
- Node A spawns an async task that POSTs to each peer's
/internal/replicateendpoint. - Each POST includes a
X-Hyperbytedb-Replicated: trueheader. - The receiving node checks for this header; if present, it persists locally but does not re-replicate (preventing loops).
MutationRequest types¶
enum MutationRequest {
CreateDatabase(String),
DropDatabase(String),
CreateRetentionPolicy { db, rp },
CreateUser { username, password_hash, admin },
DropUser(String),
Delete { database, measurement, predicate_sql },
CreateContinuousQuery { database, name, definition },
DropContinuousQuery { database, name },
}
Failure handling¶
- Replication is fire-and-forget with logging. If a peer is unreachable, the error is logged at WARN level.
- There is no retry queue or WAL replay for failed replications.
- On peer recovery, data can be re-synchronized via backup/restore from a healthy node.
Network requirements¶
- All nodes must be reachable by all other nodes on their
cluster_addrand port. - The
peerslist should not include the node's own address (filtered at startup). - HTTP timeout for replication requests: 10 seconds.
19. Background Services¶
HyperbyteDB runs five background services as Tokio tasks:
| Service | Interval | Purpose |
|---|---|---|
| Flush | flush.interval_secs (default 10s) | WAL → Parquet |
| Compaction | compaction.interval_secs (default 30s) | Merge small Parquet files; in cluster mode, same tick may run verified compaction + membership self-repair |
| Retention | 60s (fixed) | Delete expired Parquet files |
| Continuous Query | 10s (fixed) | Execute CQ schedules |
| Cluster Heartbeat | 60s (fixed, cluster mode only) | Log cluster status |
All services listen on a watch::Receiver<bool> for graceful shutdown. On ctrl+c:
- The shutdown signal is sent via
watch::channel. - Each service finishes its current iteration.
- The flush service performs one final flush.
- The main task awaits all service handles.
- Logs "HyperbyteDB shut down cleanly".
20. Error Handling¶
HyperbytedbError¶
All internal errors are represented by the HyperbytedbError enum:
| Variant | HTTP Status | When |
|---|---|---|
DatabaseNotFound(name) | 404 | Query or write to non-existent DB |
RetentionPolicyNotFound(name) | 404 | Reference to non-existent RP |
FieldTypeConflict{field, measurement, got, expected} | 400 | Write sends wrong type for existing field |
LineProtocolParse{line, reason} | 400 | Malformed line protocol |
QueryParse(msg) | 400 | Invalid InfluxQL syntax |
AuthFailed | 401 | Bad credentials |
DatabaseRequired | 400 | /write without db parameter |
MissingParameter(name) | 400 | /query without q parameter |
CardinalityExceeded{...} | 422 | Tag cardinality limit hit |
QueryTimeout | 408 | Query exceeded query_timeout_secs |
Wal(msg) | 500 | RocksDB WAL error |
Storage(msg) | 500 | File I/O or S3 error |
Chdb(msg) | 500 | chDB execution error |
Metadata(msg) | 500 | RocksDB metadata error |
Internal(msg) | 500 | Serialization or other internal error |
Error responses follow InfluxDB v1 format:
21. Observability¶
Metrics¶
Uses the metrics crate with metrics-exporter-prometheus:
| Metric | Type | Labels | Description |
|---|---|---|---|
hyperbytedb_write_requests_total | counter | — | Write requests received |
hyperbytedb_query_requests_total | counter | — | Query requests received |
hyperbytedb_query_errors_total | counter | — | Failed queries |
hyperbytedb_query_duration_seconds | histogram | — | Query latency distribution |
hyperbytedb_ingestion_points_total | counter | — | Points ingested |
hyperbytedb_flush_duration_seconds | histogram | — | Flush cycle duration |
hyperbytedb_flush_points_total | counter | — | Points flushed to Parquet |
Tracing¶
Uses tracing + tracing-subscriber with configurable filter levels via the RUST_LOG environment variable or the [logging] config section.
Structured JSON logging is available with format = "json".
Health endpoint¶
GET /health returns:
Always returns 200 as long as the HTTP server is running.
22. Concurrency Model¶
HyperbyteDB is built on Tokio with a multi-threaded runtime (#[tokio::main] with features = ["full"]).
Thread usage¶
| Work | Thread type | Notes |
|---|---|---|
| HTTP request handling | Tokio async workers | Non-blocking |
| InfluxQL parsing | Tokio async workers | CPU-bound but fast |
| WAL operations | Tokio async workers | RocksDB ops are synchronous but fast |
| chDB query execution | spawn_blocking pool | chDB Session is synchronous |
| Parquet conversion | spawn_blocking pool | CPU-intensive Arrow operations |
| Parquet I/O | tokio::spawn async tasks | Concurrent file writes |
| Peer replication | tokio::spawn async tasks | Non-blocking HTTP POSTs |
Synchronization¶
| Resource | Mechanism |
|---|---|
| WAL sequence number | AtomicU64 (lock-free) |
| chDB Session | tokio::sync::Mutex (one per session) |
| Last flushed sequence | tokio::sync::Mutex<u64> |
| Shutdown signal | tokio::sync::watch channel |
23. Dependencies¶
Core runtime¶
| Crate | Version | Purpose |
|---|---|---|
tokio | 1.x | Async runtime |
axum | 0.8 | HTTP framework |
axum-server | 0.7 | TLS support |
tower / tower-http | 0.5 / 0.6 | Middleware (tracing, CORS, timeout) |
hyper | 1.x | HTTP transport |
Storage¶
| Crate | Version | Purpose |
|---|---|---|
rocksdb | 0.22 | WAL and metadata store |
arrow | 54 | In-memory columnar format |
parquet | 54 | Columnar file format |
object_store | 0.12 | Local + S3 filesystem abstraction |
chdb-rust | 1.3 | Embedded ClickHouse query engine |
Serialization¶
| Crate | Version | Purpose |
|---|---|---|
serde / serde_json | 1.x | JSON serialization |
bincode | 1.x | Binary WAL entry serialization |
serde_urlencoded | 0.7 | Form-encoded POST body parsing |
Parsing and protocol¶
| Crate | Version | Purpose |
|---|---|---|
influxdb-line-protocol | 2.x | Line protocol parsing |
regex | 1.x | Regular expression support |
Configuration¶
| Crate | Version | Purpose |
|---|---|---|
figment | 0.10 | Config from TOML + env vars |
clap | 4.x | CLI argument parsing |
Observability¶
| Crate | Version | Purpose |
|---|---|---|
tracing / tracing-subscriber | 0.1 / 0.3 | Structured logging |
metrics / metrics-exporter-prometheus | 0.24 / 0.16 | Prometheus metrics |
Auth and crypto¶
| Crate | Version | Purpose |
|---|---|---|
argon2 | 0.5 | Password hashing |
rand_core | 0.6 | Cryptographic RNG for salt generation |
Utilities¶
| Crate | Version | Purpose |
|---|---|---|
chrono | 0.4 | Date/time handling |
uuid | 1.x | Request IDs, Parquet file suffixes |
bytes | 1.x | Zero-copy byte buffers |
futures | 0.3 | Async stream utilities |
async-trait | 0.1 | Async trait methods |
thiserror | 2.x | Error derive macros |
anyhow | 1.x | Top-level error handling |
flate2 | 1.x | Gzip decompression |
reqwest | 0.12 | HTTP client for peer replication |
openraft | 0.10 | Raft consensus for schema mutations |
indexmap | 2.x | Insertion-ordered maps |
crc32fast | 1.x | CRC32 checksums for Parquet files |
sha2 | 0.10 | SHA-256 hashing for Merkle trees |
24. Statement Summary¶
The StatementSummary service tracks recently executed InfluxQL statements for debugging and observability. When enabled (statement_summary.enabled = true), it records the normalized query text, digest, execution time, and error status for up to max_entries (default 1,000) recent statements in a bounded ring buffer. Results are exposed via GET /api/v1/statements.
25. Debug Binary¶
The hyperbytedb-debug binary (src/bin/hyperbytedb_debug.rs) provides a CLI for inspecting cluster state without connecting to the HTTP API. Subcommands include cluster status, topology, health checks, replication state, Raft status, manifest inspection, and metrics.
26. Kubernetes Operator¶
The hyperbytedb-operator/ directory contains a Go-based Kubernetes operator built with Kubebuilder. It defines a HyperbytedbCluster CRD for declarative multi-node cluster management, handling StatefulSet creation, peer configuration, and rolling updates.
27. Deterministic Simulation Testing (HyperSim)¶
HyperbyteDB is tested with a TigerBeetle-style Deterministic Simulation Testing framework called HyperSim, living in the sibling crate hypersim/. HyperSim runs the real HyperbyteDB domain code (write path, WAL sequencing, compaction, retention, query routing) inside a single-threaded, tick-driven simulator with the network, storage, clock, and process lifecycle replaced by deterministic in-process models.
The port-adapter boundary is the simulation seam¶
The hexagonal architecture described in § 2 is what makes simulation viable. HyperSim binds alternative adapters to the same port traits:
| Port | Production adapter | Simulation adapter (#[cfg(feature = "sim")]) |
|---|---|---|
WalPort | RocksDbWal | MemoryWal (BTreeMap-backed) |
StoragePort | Parquet + object_store | MemoryStorage (BTreeMap<String, Bytes>) |
HttpTransport | ReqwestTransport | SimTransport (routes through SimNetwork) |
| Wall clock | SystemTime | VirtualClock |
RocksDB cannot run inside a deterministic simulator — its background threads, wall-clock rate limiters, and direct syscalls break every DST guarantee. The in-memory adapters are not test mocks; they are first-class adapters whose role is "drive the simulator." This mirrors exactly how TigerBeetle's VOPR replaces the storage engine at the port boundary.
Port-contract tests pin the two adapter families together¶
hyperbytedb/tests/wal_port_contract.rs runs an arbitrary seeded op stream against both MemoryWal and RocksDbWal and asserts byte-for-byte output equivalence after every operation. This ensures HyperSim findings transfer to the production binary — any divergence caught here is a bug in one adapter or the other. A StoragePort contract test is planned as a follow-up.
CI integration¶
On every PR, CI runs 100 deterministic simulations (1 virtual hour each, ~5 minutes wall time on 8 cores), alternating safety-mode and liveness-mode fault profiles. Every run is reproducible from a single seed; failure traces are uploaded as GitHub artifacts for offline replay with hypersim replay.
See ../../../hypersim/README.md for the full architecture, fault model, CLI reference, and determinism checklist for contributors.
Deep Dive Documents¶
For detailed technical documentation on specific subsystems, see:
- Deep Dive: Write Path -- line protocol ingestion through Parquet file creation
- Deep Dive: Read Path -- InfluxQL parsing, ClickHouse SQL translation, and query execution
- Deep Dive: Compaction -- multi-file Parquet merge, schema evolution, orphan cleanup, and cluster verification hook
- Deep Dive: Self-Repair -- bucket hashing, sync manifest provenance, verified compaction, membership-driven slice repair
- Deep Dive: Clustering -- Raft consensus, replication, anti-entropy, and graceful drain
- Developer guide — contributing, building, testing, and extending HyperbyteDB