Skip to content

Administration

This guide covers operational tasks for running HyperbyteDB in production: monitoring, backup/restore, compaction tuning, cluster management, and the debug CLI.


Monitoring

Prometheus Metrics

HyperbyteDB exposes a Prometheus-compatible metrics endpoint at GET /metrics on the same port as the API (default 8086). There is no separate metrics port.

Key metrics:

Metric Type Description
hyperbytedb_write_requests_total counter Total write requests received
hyperbytedb_write_errors_total counter Failed write requests
hyperbytedb_write_payload_bytes histogram Raw payload size in bytes
hyperbytedb_write_duration_seconds histogram Write handler latency
hyperbytedb_query_requests_total counter Total query requests received
hyperbytedb_query_errors_total counter Failed queries
hyperbytedb_query_duration_seconds histogram Query execution latency
hyperbytedb_ingestion_points_total counter Total points ingested
hyperbytedb_flush_runs_total counter Flush cycles completed
hyperbytedb_flush_errors_total counter Failed flush cycles
hyperbytedb_flush_points_total counter Points flushed to Parquet
hyperbytedb_flush_duration_seconds histogram Flush cycle duration
hyperbytedb_parquet_bytes_written_total counter Total Parquet bytes written
hyperbytedb_parquet_files_written_total counter Total Parquet files written
hyperbytedb_parquet_files_count gauge Current file count per measurement
hyperbytedb_wal_last_sequence gauge Last flushed WAL sequence
hyperbytedb_compaction_runs_total counter Compaction cycles completed
hyperbytedb_compaction_duration_seconds histogram Compaction cycle duration
hyperbytedb_compaction_files_merged_total counter Input files consumed by compaction

Cluster-specific metrics:

Metric Type Description
hyperbytedb_replication_writes_total counter Write replication attempts
hyperbytedb_replication_errors_total counter Failed write replications
hyperbytedb_replication_duration_seconds histogram Replication latency
hyperbytedb_cluster_node_state gauge Node state (0=Joining through 5=Leaving)
hyperbytedb_cluster_peers_active gauge Number of active peers
hyperbytedb_uptime_seconds gauge Node uptime
hyperbytedb_self_repair_hash_mismatch_total counter Bucket hash mismatches detected
hyperbytedb_self_repair_origin_fetch_total counter Successful slice repairs

Prometheus scrape configuration

scrape_configs:
  - job_name: 'hyperbytedb'
    static_configs:
      - targets: ['hyperbytedb:8086']
    metrics_path: /metrics
    scrape_interval: 15s

For clusters, scrape each node individually.

Logging

Logs are written to stderr. Control verbosity with the [logging] config section:

Level Use case
error Production: errors only
warn Production: errors + warnings
info Default: startup, shutdown, periodic summaries
debug Development: query details, flush activity
trace Deep debugging: all internal operations

Set format = "json" for structured output compatible with log aggregation tools (Loki, Elasticsearch, etc.).

Health endpoint

GET /health returns:

{"status": "pass", "message": "ready for queries and writes"}

Always returns 200 as long as the HTTP server is running. In cluster mode, a node in Draining or Leaving state still responds to /health but rejects writes.


Backup and Restore

Create a backup

hyperbytedb backup --output /backups/hyperbytedb-$(date +%Y%m%d)

The backup directory contains:

Directory Contents
wal/ RocksDB checkpoint of the WAL
meta/ RocksDB checkpoint of metadata
data/ Copy of all Parquet data files
manifest.json Timestamp, WAL sequence, file list

Backups can run while HyperbyteDB is serving traffic. RocksDB checkpoints are consistent point-in-time snapshots.

Restore

# 1. Stop HyperbyteDB
# 2. Restore (overwrites configured directories)
hyperbytedb restore --input /backups/hyperbytedb-20240115
# 3. Start HyperbyteDB
hyperbytedb serve

Restore overwrites the configured data_dir, wal_dir, and meta_dir.


Compaction Tuning

Compaction merges small Parquet files into larger ones, improving query performance and reducing file count.

Key parameters

Parameter Default Tuning guidance
interval_secs 30 Lower for write-heavy workloads (more frequent merges); higher to reduce CPU usage
min_files_to_compact 2 Higher values delay compaction, allowing more files to accumulate for bigger merges
target_file_size_mb 256 Larger files improve scan performance but increase memory usage during compaction
bucket_duration "1h" Set to "1d" for wide time-range query workloads to reduce file count

Trigger on-demand compaction

curl -sS -XPOST 'http://localhost:8086/internal/compact'

This runs compact_all(), which compacts every measurement regardless of min_files_to_compact.

Monitor compaction

Watch hyperbytedb_parquet_files_count per measurement. After compaction, file counts should decrease. If they keep growing, check: - hyperbytedb_compaction_errors_total for failures - Logs for compaction error details - Whether min_files_to_compact is set too high


Cluster Operations

Debug CLI

The hyperbytedb-debug binary queries a running cluster over HTTP for on-call inspection (membership, replication lag, manifests, compaction). Source: src/bin/hyperbytedb_debug.rs. Build: cargo build --release --bin hyperbytedb-debug.

Global options:

Option Long Default Description
-n --nodes (required) Comma-separated addresses, e.g. host1:8086,host2:8086
--timeout 5 HTTP timeout in seconds
--scheme http http or https for all node URLs
--lag-warn 100 For replication: lag at or above this (after zero) is yellow
--lag-critical 1000 For replication: lag at or above this is red (clamped vs warn)

--nodes may be set with HYPERBYTEDB_NODES.

hyperbytedb-debug \
  -n "10.0.0.1:8086,10.0.0.2:8086,10.0.0.3:8086" \
  --scheme http --timeout 10 \
  status

Subcommands:

Command Description
status Cluster status from every node (/cluster/metrics): node id, state, membership version, peer counts
topology Membership as reported by each node
health Liveness/health and latency per node
replication WAL positions, ack lag, mutation log summary per node
raft Raft consensus state (when enabled)
manifest Data manifest from one node (see options below)
metrics Prometheus text from all nodes, filtered (default hyperbytedb_cluster)
diff Compare membership views across nodes
compact Trigger aggressive compaction via POST /internal/compact (see below)

manifest: --node <addr> chooses the node (default: first in --nodes).

metrics: --filter <substring> (default hyperbytedb_cluster).

compact: --node <addr> compacts only that node; --all runs against every address in --nodes; --compact-timeout (default 300 seconds) caps per-node wait.

hyperbytedb-debug -n "127.0.0.1:8086" compact
hyperbytedb-debug -n "a:8086,b:8086,c:8086" compact --node b:8086
hyperbytedb-debug -n "a:8086,b:8086,c:8086" compact --all

Graceful drain

To remove a node from the cluster without data loss:

curl -sS -XPOST 'http://node-to-remove:8086/internal/drain'

The drain procedure: 1. Sets node state to Draining (rejects new writes with 503). 2. Flushes all WAL entries to Parquet. 3. Waits for replication acks from all peers (up to 60 seconds). 4. Verifies data consistency via Merkle tree comparison. 5. Notifies peers of departure. 6. Sets state to Leaving.

Self-repair

In cluster mode, the compaction service runs automatic Parquet self-repair: - Verified compaction hashes cold data buckets against authoring peers and replaces divergent files. - Membership-driven repair discovers missing data slices from peer manifests.

These run on the compaction interval. Tune with verified_compaction_age_secs, self_repair_enabled, and max_repair_checks_per_cycle in the [compaction] config section. See Configuration for details.


Background Services

HyperbyteDB runs several background services as Tokio tasks:

Service Interval Purpose
Flush flush.interval_secs (10s) WAL → Parquet
Compaction compaction.interval_secs (30s) Merge small Parquet files; cluster verification/repair
Retention 60s (fixed) Delete expired Parquet files
Continuous Query 10s (fixed) Execute CQ schedules
Heartbeat heartbeat_interval_secs (2s, cluster) Peer liveness detection
Anti-entropy anti_entropy_interval_secs (60s, cluster) Merkle tree data verification

All services shut down gracefully on ctrl+c: the flush service performs a final flush, then all service handles are awaited.


See Also