Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Retention and Garbage Collection

Overview

Each node bounds disk growth by deleting messages older than a fixed RETENTION_WINDOW. Retention is time-based only -- no read-progress or per-chat policy. The rule is symmetric across all nodes, requires no network coordination, and applies uniformly to every chat and every message type.

Rule

A message is eligible for deletion when its HLC stamp's wall-clock component falls past the cutoff:

hlc.physical_ms <= cutoff_ts
cutoff_ts = now_ms - RETENTION_WINDOW

Each node computes cutoff_ts locally from its own clock. No cutoff is persisted in the DB or exchanged over the wire. Storing the HLC in packed form (physical_ms in the upper 48 bits, logical in the lower 16) means the comparison is still a plain millisecond test -- identical in shape to the legacy ts: u64 rule.

Parameters

Defined as constants in crates/node/src/retention.rs:

ConstantDefaultMeaning
RETENTION_WINDOW30 daysMessages whose HLC physical_ms is older than this are deleted
GC_INTERVAL1 hourSteady-state interval between GC cycles
GC_SHORT_INTERVAL60 sFollow-up interval after a cycle that hit the batch limit
GC_BATCH_LIMIT100 000Max msg_ids removed in one cycle (bounds cycle duration)
GC_CHUNK_SIZE1 000msg_ids per xor_batch invocation -- bounds Merkle write-lock contention

RETENTION_WINDOW is hard-coded. Changing it requires recompilation and coordinated rollout across the network (see "Clock skew and divergence" below for why mismatched windows are safe but degrade sync convergence).

GC Cycle

retention::gc_cycle(db, merkle_msgs, now) is the unit of work; the background loop retention::run_gc_loop schedules it.

One cycle does:

  1. cutoff_ts = cutoff_ts_at(now). Updates retention_cutoff_ts_ms Prometheus gauge.
  2. Range-delete in CF messages for every chat enumerated from chats_meta: delete_range_cf([chat_id, HlcTimestamp::ZERO, 0], [chat_id, HlcTimestamp::from_parts(cutoff_ts + 1, 0), 0)). The packed HLC puts physical_ms in the upper 48 bits, so every message with hlc.physical_ms <= cutoff_ts -- regardless of logical or seq -- falls strictly below the end key.
  3. Chunked scan of CF seen_msg: collect up to GC_CHUNK_SIZE msg_ids whose embedded hlc.physical_ms is <= cutoff_ts. For each chunk:
    • delete_seen_msg_batch removes them from seen_msg.
    • MerkleTree::xor_batch cancels their contribution from the in-memory messages tree (one write lock per chunk).
  4. Repeat step 3 until either the scan returns less than the chunk size (nothing more to remove) or removed_count >= GC_BATCH_LIMIT.
  5. Update counters: gc_messages_deleted_total, gc_chats_processed_total, and gc_cycle_duration_seconds.

The cycle returns CycleStats { removed_count, chats_processed, hit_limit }. If hit_limit is true, the loop schedules the next cycle after GC_SHORT_INTERVAL instead of the steady-state GC_INTERVAL -- so backlogs drain quickly without forcing a single cycle to do unbounded work.

Why GC writes via direct Arc, not the DB writer pipeline

Most paths (gossip, HTTP API, sync) converge in process_db_op so a single thread owns Merkle mutations. The GC cycle bypasses that channel and takes the Merkle write lock directly, in chunks of GC_CHUNK_SIZE. This keeps the worst-case put_message latency during GC bounded by one chunk (~100 microseconds for 1 000 ids), because the write lock is released between chunks and the DB-writer/sync consumers can interleave.

Backward-compatibility for legacy seen_msg entries

Older seen_msg entries pre-date the change that stores the messages-CF key as the value (44-byte payload including the 8-byte packed HLC). Legacy entries with empty values are treated as physical_ms = 0 by extract_hlc_physical_from_seen_value, which means they are always picked up on the next GC cycle and naturally reclaimed -- no migration script.

Sync Soft-Filter

The retention boundary is enforced independently on both sides of the sync protocol. This is the property that makes "deleted messages never resurface from peers".

Responder side

When answering Merkle sync requests for SyncDomain::Messages:

  • get_bucket_msg_ids_filtered(db, bucket, cutoff_ts) is used instead of the plain helper. msg_ids whose stored HLC has physical_ms <= cutoff_ts are stripped from the response before the wire send.
  • get_messages_cbor_batch_filtered(db, ids, max_bytes, cutoff_ts) applies the same filter when serving FetchAndPush payloads.

The filter reads the packed HLC from the seen_msg value and pulls out physical_ms (no extra DB lookup) -- O(scanned bucket size).

Receiver side

In sync::handler::store_synced_messages:

  • For every (msg_id, cbor) pair received via FetchAndPush, decode MsgV1, compare msg.hlc.physical_ms() with the local cutoff_ts_now().
  • If physical_ms <= cutoff_ts: drop the message, increment sync_messages_rejected_total, never enqueue DbOp::PutMessage.

The receiver re-checks independently to defend against:

  • Clock skew: peer's cutoff sits a few seconds behind ours.
  • Mismatched parameters: peer compiled with a different RETENTION_WINDOW.
  • Malicious peer: deliberately re-pushes aged-out records.

Why filtering applies only to the Messages domain

SyncDomain::Members and SyncDomain::Identity are mutable CRDT-like domains where records are inherently small and naturally bounded by membership and identity churn. No retention is currently applied to those domains; their sync helpers stay un-filtered.

Merkle Tree Implications

The in-memory Merkle tree for messages is bit-identical to the serialised state of seen_msg:

  • On startup, the tree is rebuilt from for_each_msg_id (yields (msg_id, physical_ms) pairs -- the physical component is currently ignored at startup but is available for future per-cutoff trees).
  • On PutMessage, tree.insert(msg_id) is called once.
  • On retention GC, tree.xor_batch(removed_chunk) cancels the removed msg_ids in one pass per chunk. XOR is its own inverse, so the same primitive used for insert is reused for removal (xor_batch is the symmetric batch form of xor_leaf plus a single recompute_all over the affected level-1 subtrees).

This means after GC the tree root does change. Peers that have not yet GC'd see a different root in the next sync tick; the soft-filter on both sides ensures the discrepancy resolves to "no aged messages were transmitted" rather than "the deleter re-downloads them".

Storage Layout Changes

seen_msg value semantics

The value of CF seen_msg is the 44-byte messages-CF key ([chat_id:32][hlc_packed:u64 BE][seq:u32 BE]); retention uses the extract_hlc_physical_from_seen_value helper to pull the upper 48 bits (physical_ms) out of bytes 32..40. No extra DB lookup is needed -- the read path was already in place for the legacy ts: u64 slot; the HLC migration only changed the interpretation of those 8 bytes.

Physical deletion

range_delete_old_messages(db, chat_id, cutoff_ts) calls delete_range_cf on CF messages from [chat_id, HlcTimestamp::ZERO, 0] to [chat_id, HlcTimestamp::from_parts(cutoff_ts + 1, 0), 0). RocksDB processes this as a tombstone; physical reclamation happens during compaction.

delete_seen_msg_batch(db, msg_ids) removes the matching seen_msg entries in a single WriteBatch.

for_each_chat_id(db, callback) enumerates chats_meta keys so the cycle can issue one range-delete per chat. Empty chats are no-ops in RocksDB and contribute negligible overhead.

Metrics

All under the p2pmes_ Prometheus namespace:

MetricTypeMeaning
gc_cycle_duration_secondsHistogramWall-clock time of one cycle
gc_messages_deleted_totalCounterCumulative msg_ids removed
gc_chats_processed_totalCounterCumulative chats range-deleted
retention_cutoff_ts_msGaugeMost recent cutoff_ts in ms
sync_messages_rejected_totalCounterAged messages dropped by receiver

Only the receiver side of the sync filter produces a metric. The responder side is silent because the underlying DB helpers return only the post-filter result; instrumenting it would require either changing the return shape or doing a double scan.

Clock Skew and Divergence

Each node computes cutoff_ts independently. Skew between nodes leads to a transient window where:

  • Node A (clock 10s fast) considers a message expired.
  • Node B (clock right) still considers it fresh.

What happens in this window:

  • Sync responder filters by local cutoff: A drops the message from bucket responses; B includes it.
  • Sync receiver filters by local cutoff: A rejects the message if B sends it; B accepts it if A sends it (A never will).
  • Eventually consistent: once both clocks reach the message's ts + RETENTION_WINDOW, both nodes agree it is expired and drop it.

No node is ever forced to keep a message it considers expired, and no node is ever forced to discard a message it considers fresh.

A node with a wildly wrong clock simply removes its own data sooner or later than the rest of the network; it cannot poison its peers because the cutoff is never transmitted -- everyone applies their own.

Edge Cases

  • Chats with all-aged history: range-delete reclaims all entries, chats_meta survives with stale last_seq / last_ts -- this is intentional (those fields are monotonic; user_inbox semantics rely on them).
  • Concurrent PutMessage during GC: the DB writer and GC both operate on RocksDB which is thread-safe; the Merkle tree is serialised by its RwLock. PutMessage may briefly block on the Merkle lock during a chunk apply (GC_CHUNK_SIZE bounds this).
  • GC during sync session: bucket responses prepared before the cycle may include msg_ids that GC has since removed; subsequent FetchAndPush will fail to find the payload (None from seen_msg) and the message is simply skipped. Sync converges on the next tick.
  • Node restart mid-cycle: nothing persists about a "current cycle"; the next start-up rebuilds the Merkle tree from current seen_msg, and the next GC tick resumes work. No coordination needed.

See Also

  • docs/SYNC.md -- Merkle protocol, where the soft-filter sits.
  • docs/STORAGE.md -- seen_msg value layout, retention helpers.
  • docs/ARCHITECTURE.md -- background-task layout.
  • crates/node/src/retention.rs -- implementation and unit tests.