Retention and Garbage Collection
Overview
Each node bounds disk growth by deleting messages older than a fixed
RETENTION_WINDOW. Retention is time-based only -- no read-progress
or per-chat policy. The rule is symmetric across all nodes, requires no
network coordination, and applies uniformly to every chat and every
message type.
Rule
A message is eligible for deletion when its HLC stamp's wall-clock component falls past the cutoff:
hlc.physical_ms <= cutoff_ts
cutoff_ts = now_ms - RETENTION_WINDOW
Each node computes cutoff_ts locally from its own clock. No cutoff
is persisted in the DB or exchanged over the wire. Storing the HLC
in packed form (physical_ms in the upper 48 bits, logical in the
lower 16) means the comparison is still a plain millisecond test --
identical in shape to the legacy ts: u64 rule.
Parameters
Defined as constants in crates/node/src/retention.rs:
| Constant | Default | Meaning |
|---|---|---|
RETENTION_WINDOW | 30 days | Messages whose HLC physical_ms is older than this are deleted |
GC_INTERVAL | 1 hour | Steady-state interval between GC cycles |
GC_SHORT_INTERVAL | 60 s | Follow-up interval after a cycle that hit the batch limit |
GC_BATCH_LIMIT | 100 000 | Max msg_ids removed in one cycle (bounds cycle duration) |
GC_CHUNK_SIZE | 1 000 | msg_ids per xor_batch invocation -- bounds Merkle write-lock contention |
RETENTION_WINDOW is hard-coded. Changing it requires recompilation
and coordinated rollout across the network (see "Clock skew and
divergence" below for why mismatched windows are safe but degrade
sync convergence).
GC Cycle
retention::gc_cycle(db, merkle_msgs, now) is the unit of work; the
background loop retention::run_gc_loop schedules it.
One cycle does:
cutoff_ts = cutoff_ts_at(now). Updatesretention_cutoff_ts_msPrometheus gauge.- Range-delete in CF
messagesfor every chat enumerated fromchats_meta:delete_range_cf([chat_id, HlcTimestamp::ZERO, 0], [chat_id, HlcTimestamp::from_parts(cutoff_ts + 1, 0), 0)). The packed HLC putsphysical_msin the upper 48 bits, so every message withhlc.physical_ms <= cutoff_ts-- regardless oflogicalorseq-- falls strictly below the end key. - Chunked scan of CF
seen_msg: collect up toGC_CHUNK_SIZEmsg_ids whose embeddedhlc.physical_msis<= cutoff_ts. For each chunk:delete_seen_msg_batchremoves them fromseen_msg.MerkleTree::xor_batchcancels their contribution from the in-memory messages tree (one write lock per chunk).
- Repeat step 3 until either the scan returns less than the chunk
size (nothing more to remove) or
removed_count >= GC_BATCH_LIMIT. - Update counters:
gc_messages_deleted_total,gc_chats_processed_total, andgc_cycle_duration_seconds.
The cycle returns CycleStats { removed_count, chats_processed, hit_limit }.
If hit_limit is true, the loop schedules the next cycle after
GC_SHORT_INTERVAL instead of the steady-state GC_INTERVAL -- so
backlogs drain quickly without forcing a single cycle to do unbounded
work.
Why GC writes via direct Arc, not the DB writer pipeline
Most paths (gossip, HTTP API, sync) converge in process_db_op so a
single thread owns Merkle mutations. The GC cycle bypasses that channel
and takes the Merkle write lock directly, in chunks of GC_CHUNK_SIZE.
This keeps the worst-case put_message latency during GC bounded by one
chunk (~100 microseconds for 1 000 ids), because the write lock is
released between chunks and the DB-writer/sync consumers can interleave.
Backward-compatibility for legacy seen_msg entries
Older seen_msg entries pre-date the change that stores the
messages-CF key as the value (44-byte payload including the 8-byte
packed HLC). Legacy entries with empty values are treated as
physical_ms = 0 by extract_hlc_physical_from_seen_value, which
means they are always picked up on the next GC cycle and naturally
reclaimed -- no migration script.
Sync Soft-Filter
The retention boundary is enforced independently on both sides of the sync protocol. This is the property that makes "deleted messages never resurface from peers".
Responder side
When answering Merkle sync requests for SyncDomain::Messages:
get_bucket_msg_ids_filtered(db, bucket, cutoff_ts)is used instead of the plain helper. msg_ids whose stored HLC hasphysical_ms <= cutoff_tsare stripped from the response before the wire send.get_messages_cbor_batch_filtered(db, ids, max_bytes, cutoff_ts)applies the same filter when serving FetchAndPush payloads.
The filter reads the packed HLC from the seen_msg value and pulls
out physical_ms (no extra DB lookup) -- O(scanned bucket size).
Receiver side
In sync::handler::store_synced_messages:
- For every (msg_id, cbor) pair received via
FetchAndPush, decodeMsgV1, comparemsg.hlc.physical_ms()with the localcutoff_ts_now(). - If
physical_ms <= cutoff_ts: drop the message, incrementsync_messages_rejected_total, never enqueueDbOp::PutMessage.
The receiver re-checks independently to defend against:
- Clock skew: peer's cutoff sits a few seconds behind ours.
- Mismatched parameters: peer compiled with a different
RETENTION_WINDOW. - Malicious peer: deliberately re-pushes aged-out records.
Why filtering applies only to the Messages domain
SyncDomain::Members and SyncDomain::Identity are mutable CRDT-like
domains where records are inherently small and naturally bounded by
membership and identity churn. No retention is currently applied to
those domains; their sync helpers stay un-filtered.
Merkle Tree Implications
The in-memory Merkle tree for messages is bit-identical to the
serialised state of seen_msg:
- On startup, the tree is rebuilt from
for_each_msg_id(yields(msg_id, physical_ms)pairs -- the physical component is currently ignored at startup but is available for future per-cutoff trees). - On
PutMessage,tree.insert(msg_id)is called once. - On retention GC,
tree.xor_batch(removed_chunk)cancels the removed msg_ids in one pass per chunk. XOR is its own inverse, so the same primitive used for insert is reused for removal (xor_batchis the symmetric batch form ofxor_leafplus a singlerecompute_allover the affected level-1 subtrees).
This means after GC the tree root does change. Peers that have not yet GC'd see a different root in the next sync tick; the soft-filter on both sides ensures the discrepancy resolves to "no aged messages were transmitted" rather than "the deleter re-downloads them".
Storage Layout Changes
seen_msg value semantics
The value of CF seen_msg is the 44-byte messages-CF key
([chat_id:32][hlc_packed:u64 BE][seq:u32 BE]); retention uses the
extract_hlc_physical_from_seen_value helper to pull the upper 48
bits (physical_ms) out of bytes 32..40. No extra DB lookup is
needed -- the read path was already in place for the legacy ts: u64
slot; the HLC migration only changed the interpretation of those 8
bytes.
Physical deletion
range_delete_old_messages(db, chat_id, cutoff_ts) calls
delete_range_cf on CF messages from
[chat_id, HlcTimestamp::ZERO, 0] to
[chat_id, HlcTimestamp::from_parts(cutoff_ts + 1, 0), 0). RocksDB
processes this as a tombstone; physical reclamation happens during
compaction.
delete_seen_msg_batch(db, msg_ids) removes the matching seen_msg
entries in a single WriteBatch.
for_each_chat_id(db, callback) enumerates chats_meta keys so the
cycle can issue one range-delete per chat. Empty chats are no-ops in
RocksDB and contribute negligible overhead.
Metrics
All under the p2pmes_ Prometheus namespace:
| Metric | Type | Meaning |
|---|---|---|
gc_cycle_duration_seconds | Histogram | Wall-clock time of one cycle |
gc_messages_deleted_total | Counter | Cumulative msg_ids removed |
gc_chats_processed_total | Counter | Cumulative chats range-deleted |
retention_cutoff_ts_ms | Gauge | Most recent cutoff_ts in ms |
sync_messages_rejected_total | Counter | Aged messages dropped by receiver |
Only the receiver side of the sync filter produces a metric. The responder side is silent because the underlying DB helpers return only the post-filter result; instrumenting it would require either changing the return shape or doing a double scan.
Clock Skew and Divergence
Each node computes cutoff_ts independently. Skew between nodes leads
to a transient window where:
- Node A (clock 10s fast) considers a message expired.
- Node B (clock right) still considers it fresh.
What happens in this window:
- Sync responder filters by local cutoff: A drops the message from bucket responses; B includes it.
- Sync receiver filters by local cutoff: A rejects the message if B sends it; B accepts it if A sends it (A never will).
- Eventually consistent: once both clocks reach the message's
ts + RETENTION_WINDOW, both nodes agree it is expired and drop it.
No node is ever forced to keep a message it considers expired, and no node is ever forced to discard a message it considers fresh.
A node with a wildly wrong clock simply removes its own data sooner or later than the rest of the network; it cannot poison its peers because the cutoff is never transmitted -- everyone applies their own.
Edge Cases
- Chats with all-aged history: range-delete reclaims all entries,
chats_metasurvives with stalelast_seq/last_ts-- this is intentional (those fields are monotonic;user_inboxsemantics rely on them). - Concurrent PutMessage during GC: the DB writer and GC both
operate on RocksDB which is thread-safe; the Merkle tree is
serialised by its
RwLock. PutMessage may briefly block on the Merkle lock during a chunk apply (GC_CHUNK_SIZEbounds this). - GC during sync session: bucket responses prepared before the
cycle may include msg_ids that GC has since removed; subsequent
FetchAndPush will fail to find the payload (None from
seen_msg) and the message is simply skipped. Sync converges on the next tick. - Node restart mid-cycle: nothing persists about a "current cycle";
the next start-up rebuilds the Merkle tree from current
seen_msg, and the next GC tick resumes work. No coordination needed.
See Also
docs/SYNC.md-- Merkle protocol, where the soft-filter sits.docs/STORAGE.md--seen_msgvalue layout, retention helpers.docs/ARCHITECTURE.md-- background-task layout.crates/node/src/retention.rs-- implementation and unit tests.