Administration¶
This guide covers operational tasks for running HyperbyteDB in production: monitoring, backup/restore, compaction tuning, cluster management, and the debug CLI.
Monitoring¶
Prometheus Metrics¶
HyperbyteDB exposes a Prometheus-compatible metrics endpoint at GET /metrics on the same port as the API (default 8086). There is no separate metrics port.
Key metrics:
| Metric | Type | Description |
|---|---|---|
hyperbytedb_write_requests_total | counter | Total write requests received |
hyperbytedb_write_errors_total | counter | Failed write requests |
hyperbytedb_write_payload_bytes | histogram | Raw payload size in bytes |
hyperbytedb_write_duration_seconds | histogram | Write handler latency |
hyperbytedb_query_requests_total | counter | Total query requests received |
hyperbytedb_query_errors_total | counter | Failed queries |
hyperbytedb_query_duration_seconds | histogram | Query execution latency |
hyperbytedb_ingestion_points_total | counter | Total points ingested |
hyperbytedb_flush_runs_total | counter | Flush cycles completed |
hyperbytedb_flush_errors_total | counter | Failed flush cycles |
hyperbytedb_flush_points_total | counter | Points flushed to Parquet |
hyperbytedb_flush_duration_seconds | histogram | Flush cycle duration |
hyperbytedb_parquet_bytes_written_total | counter | Total Parquet bytes written |
hyperbytedb_parquet_files_written_total | counter | Total Parquet files written |
hyperbytedb_parquet_files_count | gauge | Current file count per measurement |
hyperbytedb_wal_last_sequence | gauge | Last flushed WAL sequence |
hyperbytedb_compaction_runs_total | counter | Compaction cycles completed |
hyperbytedb_compaction_duration_seconds | histogram | Compaction cycle duration |
hyperbytedb_compaction_files_merged_total | counter | Input files consumed by compaction |
Cluster-specific metrics:
| Metric | Type | Description |
|---|---|---|
hyperbytedb_replication_writes_total | counter | Write replication attempts |
hyperbytedb_replication_errors_total | counter | Failed write replications |
hyperbytedb_replication_duration_seconds | histogram | Replication latency |
hyperbytedb_cluster_node_state | gauge | Node state (0=Joining through 5=Leaving) |
hyperbytedb_cluster_peers_active | gauge | Number of active peers |
hyperbytedb_uptime_seconds | gauge | Node uptime |
hyperbytedb_self_repair_hash_mismatch_total | counter | Bucket hash mismatches detected |
hyperbytedb_self_repair_origin_fetch_total | counter | Successful slice repairs |
Prometheus scrape configuration¶
scrape_configs:
- job_name: 'hyperbytedb'
static_configs:
- targets: ['hyperbytedb:8086']
metrics_path: /metrics
scrape_interval: 15s
For clusters, scrape each node individually.
Logging¶
Logs are written to stderr. Control verbosity with the [logging] config section:
| Level | Use case |
|---|---|
error | Production: errors only |
warn | Production: errors + warnings |
info | Default: startup, shutdown, periodic summaries |
debug | Development: query details, flush activity |
trace | Deep debugging: all internal operations |
Set format = "json" for structured output compatible with log aggregation tools (Loki, Elasticsearch, etc.).
Health endpoint¶
GET /health returns:
Always returns 200 as long as the HTTP server is running. In cluster mode, a node in Draining or Leaving state still responds to /health but rejects writes.
Backup and Restore¶
Create a backup¶
The backup directory contains:
| Directory | Contents |
|---|---|
wal/ | RocksDB checkpoint of the WAL |
meta/ | RocksDB checkpoint of metadata |
data/ | Copy of all Parquet data files |
manifest.json | Timestamp, WAL sequence, file list |
Backups can run while HyperbyteDB is serving traffic. RocksDB checkpoints are consistent point-in-time snapshots.
Restore¶
# 1. Stop HyperbyteDB
# 2. Restore (overwrites configured directories)
hyperbytedb restore --input /backups/hyperbytedb-20240115
# 3. Start HyperbyteDB
hyperbytedb serve
Restore overwrites the configured data_dir, wal_dir, and meta_dir.
Compaction Tuning¶
Compaction merges small Parquet files into larger ones, improving query performance and reducing file count.
Key parameters¶
| Parameter | Default | Tuning guidance |
|---|---|---|
interval_secs | 30 | Lower for write-heavy workloads (more frequent merges); higher to reduce CPU usage |
min_files_to_compact | 2 | Higher values delay compaction, allowing more files to accumulate for bigger merges |
target_file_size_mb | 256 | Larger files improve scan performance but increase memory usage during compaction |
bucket_duration | "1h" | Set to "1d" for wide time-range query workloads to reduce file count |
Trigger on-demand compaction¶
This runs compact_all(), which compacts every measurement regardless of min_files_to_compact.
Monitor compaction¶
Watch hyperbytedb_parquet_files_count per measurement. After compaction, file counts should decrease. If they keep growing, check: - hyperbytedb_compaction_errors_total for failures - Logs for compaction error details - Whether min_files_to_compact is set too high
Cluster Operations¶
Debug CLI¶
The hyperbytedb-debug binary queries a running cluster over HTTP for on-call inspection (membership, replication lag, manifests, compaction). Source: src/bin/hyperbytedb_debug.rs. Build: cargo build --release --bin hyperbytedb-debug.
Global options:
| Option | Long | Default | Description |
|---|---|---|---|
-n | --nodes | (required) | Comma-separated addresses, e.g. host1:8086,host2:8086 |
--timeout | 5 | HTTP timeout in seconds | |
--scheme | http | http or https for all node URLs | |
--lag-warn | 100 | For replication: lag at or above this (after zero) is yellow | |
--lag-critical | 1000 | For replication: lag at or above this is red (clamped vs warn) |
--nodes may be set with HYPERBYTEDB_NODES.
hyperbytedb-debug \
-n "10.0.0.1:8086,10.0.0.2:8086,10.0.0.3:8086" \
--scheme http --timeout 10 \
status
Subcommands:
| Command | Description |
|---|---|
status | Cluster status from every node (/cluster/metrics): node id, state, membership version, peer counts |
topology | Membership as reported by each node |
health | Liveness/health and latency per node |
replication | WAL positions, ack lag, mutation log summary per node |
raft | Raft consensus state (when enabled) |
manifest | Data manifest from one node (see options below) |
metrics | Prometheus text from all nodes, filtered (default hyperbytedb_cluster) |
diff | Compare membership views across nodes |
compact | Trigger aggressive compaction via POST /internal/compact (see below) |
manifest: --node <addr> chooses the node (default: first in --nodes).
metrics: --filter <substring> (default hyperbytedb_cluster).
compact: --node <addr> compacts only that node; --all runs against every address in --nodes; --compact-timeout (default 300 seconds) caps per-node wait.
hyperbytedb-debug -n "127.0.0.1:8086" compact
hyperbytedb-debug -n "a:8086,b:8086,c:8086" compact --node b:8086
hyperbytedb-debug -n "a:8086,b:8086,c:8086" compact --all
Graceful drain¶
To remove a node from the cluster without data loss:
The drain procedure: 1. Sets node state to Draining (rejects new writes with 503). 2. Flushes all WAL entries to Parquet. 3. Waits for replication acks from all peers (up to 60 seconds). 4. Verifies data consistency via Merkle tree comparison. 5. Notifies peers of departure. 6. Sets state to Leaving.
Self-repair¶
In cluster mode, the compaction service runs automatic Parquet self-repair: - Verified compaction hashes cold data buckets against authoring peers and replaces divergent files. - Membership-driven repair discovers missing data slices from peer manifests.
These run on the compaction interval. Tune with verified_compaction_age_secs, self_repair_enabled, and max_repair_checks_per_cycle in the [compaction] config section. See Configuration for details.
Background Services¶
HyperbyteDB runs several background services as Tokio tasks:
| Service | Interval | Purpose |
|---|---|---|
| Flush | flush.interval_secs (10s) | WAL → Parquet |
| Compaction | compaction.interval_secs (30s) | Merge small Parquet files; cluster verification/repair |
| Retention | 60s (fixed) | Delete expired Parquet files |
| Continuous Query | 10s (fixed) | Execute CQ schedules |
| Heartbeat | heartbeat_interval_secs (2s, cluster) | Peer liveness detection |
| Anti-entropy | anti_entropy_interval_secs (60s, cluster) | Merkle tree data verification |
All services shut down gracefully on ctrl+c: the flush service performs a final flush, then all service handles are awaited.
See Also¶
- Configuration — Full reference for all tuning parameters
- Troubleshooting — Diagnosing common issues
- Common workflows — Backup procedures, monitoring setup