Deep Dive: Write Path¶
This document traces the HyperbyteDB write path from HTTP request to durable storage in embedded chDB MergeTree tables. It covers line protocol ingestion, WAL durability, the background flush pipeline, native table writes, and cluster replication.
Table of Contents¶
- Overview
- HTTP Write Handler
- Ingestion Service
- WAL Append
- Flush Pipeline
- chDB Native Sink
- Cluster Write Replication
- Metrics
1. Overview¶
The write path has two phases:
Phase 1 — Synchronous (client-blocking): HTTP handling, line protocol parsing, metadata registration, and WAL append. The client receives 204 No Content once the WAL append completes. Data is durable at this point.
Phase 2 — Asynchronous (background): The flush service reads WAL entries on a timer, groups them by measurement, INSERTs batches into chDB MergeTree tables, and truncates the WAL when safe.
Client POST /write
|
v
+-------------------------------+
| HTTP Handler (write.rs) |
| - Validate params |
| - Decompress gzip |
| - Cluster state check |
+-------------------------------+
|
v
+-------------------------------+
| IngestionService |
| - Parse line protocol |
| - Register metadata |
| - Append to WAL |
+-------------------------------+
|
v (204 returned to client)
|
v (background, every flush.interval_secs)
+-------------------------------+
| FlushService |
| - Read WAL entries |
| - Group by (db, rp, meas) |
| - INSERT via PointsSinkPort |
| - Truncate WAL |
+-------------------------------+
|
v
+-------------------------------+
| ChdbNativeAdapter |
| ReplacingMergeTree tables |
| under chdb.session_data_path |
+-------------------------------+
Key source files: adapters/http/write.rs, application/ingestion_service.rs, application/flush_service.rs, adapters/chdb/native_adapter.rs.
2. HTTP Write Handler¶
File: src/adapters/http/write.rs
Entry point¶
POST /write?db=mydb&rp=autogen&precision=ns
Steps¶
- Validate parameters —
dbis required. Optionalrp(defaults to the database default retention policy) andprecision(nanoseconds by default). - Decompress body — Supports gzip-compressed payloads (
Content-Encoding: gzip). - Cluster gate — In cluster mode, rejects writes when the node is draining or not accepting traffic.
- Delegate to ingestion — Calls
IngestionPort::ingest()with the raw body. In cluster mode,PeerIngestionServicewraps the base service.
Response¶
Returns 204 No Content on success. Errors follow InfluxDB v1 JSON format ({"error": "..."}).
3. Ingestion Service¶
File: src/application/ingestion_service.rs
Parse formats¶
| Format | Parser |
|---|---|
| Line protocol (default) | parse_line_body_to_points() via influxdb-line-protocol |
| MessagePack | parse_msgpack_body_to_points() |
Columnar MessagePack (columnar-ingest feature) | Fast path: metadata from wire batch, then WAL serialization |
Metadata registration¶
Before WAL append, prepare_batch_metadata() (or prepare_columnar_metadata() for columnar ingest):
- Verifies the database exists.
- Registers field types and tag keys for each measurement.
- Enforces cardinality limits (
max_tag_values_per_measurement,max_measurements_per_database). - Uses an in-memory schema cache to avoid redundant metadata reads.
Field types are enforced on subsequent writes — a type conflict returns HTTP 400.
WAL entry construction¶
WalEntry {
database: db.to_string(),
retention_policy: retention_policy.clone(),
points,
origin_node_id: 0, // set by replication apply path on peers
}
4. WAL Append¶
Port: WalPort
Adapter: RocksDbWal (adapters/rocksdb/wal.rs)
Structure¶
| Column Family | Purpose |
|---|---|
wal | Ordered entries keyed by big-endian u64 sequence |
wal_meta | last_seq counter |
Entries are bincode-serialized WalEntry values. Sequence numbers use big-endian encoding so RocksDB lexicographic order matches numeric order.
Operations used by the write path¶
| Operation | Caller | Purpose |
|---|---|---|
append(entry) | IngestionService | Durably store incoming points |
read_range(start, count) | FlushService | Read up to 5,000 entries per chunk |
truncate_before(seq) | FlushService | Remove flushed entries |
last_sequence() | FlushService | Snapshot upper bound for a flush tick |
The WAL provides crash-safe durability between client acknowledgment and chDB INSERT.
5. Flush Pipeline¶
File: src/application/flush_service.rs
Port: FlushPort (used by cluster drain)
Timer¶
Runs every flush.interval_secs (default 10s) as a Tokio background task. Also listens on a shutdown watch channel for graceful stop.
Flush cycle¶
- Snapshot — Read
last_sequence(); skip if nothing new sincelast_flushed. - Read chunk —
read_range(cursor + 1, 5000)up to the snapshot sequence. - Group — Bucket points by
(database, retention_policy, measurement, origin_node_id). - Sub-batch — Split large groups by
max_points_per_batch(10k–500k; auto-detected from available memory when config is 0). - Write — Spawn parallel tasks calling
PointsSinkPort::write_points()for each sub-batch. - Truncate —
truncate_before(safe_seq + 1)wheresafe_seqrespects cluster replication acks.
Cluster-aware truncation¶
When replication is enabled, truncation waits for peer WAL acks so lagging peers can still catch up:
- Uses
ReplicationLog::min_max_wal_ack_for_peers()across active peers. - If some peers have acked and others are still at 0, holds the WAL (returns safe seq 0).
- If all peers are at ack 0, applies
MAX_WAL_RETENTION_ENTRIES(500k) as a safety valve. - Pure replica nodes (no locally originated writes) skip the ack barrier.
- Stale peers (configurable heartbeat policy) can be excluded from the barrier.
Drain¶
FlushPort::drain() loops flush until the WAL is empty — used during graceful cluster drain and shutdown.
6. chDB Native Sink¶
Adapter: ChdbNativeAdapter (adapters/chdb/native_adapter.rs)
Port: PointsSinkPort
Table naming¶
Each (database, retention_policy, measurement) maps to one physical table. Names are sanitised via domain/chdb_naming (for example mydb_autogen_cpu).
Schema management¶
On flush, the adapter:
- Loads
MeasurementMetafrom the metadata store. - Creates or alters the MergeTree table to match registered tag and field columns.
- INSERTs the batch with
ReplacingMergeTreeordering on(time, tag columns…).
Tables live under chdb.session_data_path (configured in [chdb]).
Query visibility¶
Data becomes queryable after flush INSERT completes. Until then it exists only in the WAL. With the default 10s flush interval, wait briefly after writing before querying.
7. Cluster Write Replication¶
Files: application/peer_ingestion_service.rs, adapters/cluster/peer_client.rs
Port: ReplicationPort
In cluster mode, after the local WAL append succeeds:
PeerIngestionServiceserialises the line protocol body.PeerClientPOSTs to each active peer's/internal/replicateendpoint viaReplicationPort.- Peers apply the write through their own ingestion path with
origin_node_idset to the source node.
Replication is asynchronous — the client receives 204 without waiting for peer confirmation. Failed sends are logged; hinted handoff can queue writes for unreachable peers.
WAL truncation on the origin node coordinates with peer acks (see §5).
For the full replication and sync protocol, see Deep Dive: Clustering.
8. Metrics¶
| Metric | Type | Description |
|---|---|---|
hyperbytedb_ingestion_points_total | counter | Points ingested (label: db) |
hyperbytedb_wal_appends_total | counter | WAL append operations |
hyperbytedb_ingest_parse_seconds | histogram | Line protocol parse time |
hyperbytedb_ingest_metadata_register_seconds | histogram | Metadata registration time |
hyperbytedb_ingest_wal_append_seconds | histogram | WAL append time |
hyperbytedb_flush_points_total | counter | Points flushed to chDB |
hyperbytedb_flush_runs_total | counter | Flush cycles completed |
hyperbytedb_flush_duration_seconds | histogram | Flush cycle duration |
hyperbytedb_flush_errors_total | counter | Flush failures |
hyperbytedb_native_rows_written_total | counter | Rows INSERTed by native adapter |
hyperbytedb_wal_last_sequence | gauge | Last flushed WAL sequence |
Related documents¶
- Architecture — hexagonal overview and sequence diagrams
- Read path — query execution
- Clustering — replication, sync, drain