Global Picture: Healthy

Global Picture: HealthyWhat we believe is true right now and the questions we need to answer.

Syncing new content breaks my app.

Reads affect Writes, not the other way around.

Reads affect checkpoints, so WAL

Reads read WAL.

WAL size not truncated should only affect storage not contention or pool exhaustion.

Tracker of the concurrency is expensive.

Observability of concurrent systems is hard.

Why do we have Write Contention

Network Contention=Pool exhaustion

Request Storm from the Query block.

Feed is being called more often.

We do a lot of overfetching: we want metadata, but the client asks for all the document graph. We have a get info that will load it, without replaying all the changes.

There are probably three interacting problems:

Heavy repeated work

Discovery computes local “haves” / store state repeatedly:

discovery A
  peer 1: compute haves
  peer 2: compute haves
  peer 3: compute haves
  ...

But the local DB state is mostly identical for all peers in the same discovery.

So the daemon does expensive read/CPU/I/O work many times.

Heavy work causes slow reads and slow writes

If feed/discovery reads are expensive:

they consume CPU

they hit disk/page cache

they hold DB connections

they increase latency for everything else

Even if reads don’t block WAL writers directly, they still consume shared resources.

Writer contention turns slowdown into failures

While the system is overloaded, writers pile up:

PutMany
peer updates
domain cache
sync writes

Then SQLite’s single writer lock becomes the visible failure point:

SQLITE_BUSY

So SQLITE_BUSY may be a symptom of broader overload, not the only root cause.

Important distinction

Optimizing discovery/feed reduces the amount of work.

Writer queue/admission prevents write storms from causing SQLITE_BUSY.

Both are needed.

Best next technical direction

Cache/reuse local haves per discovery

Compute once per discovery/key/snapshot.

Reuse across peers.

Avoid doing the same DB scan 30 times.

Limit discovery fanout

Don’t connect/sync with 30 peers if 5–8 are enough.

Stop early when content is found.

Deduplicate concurrent frontend discovery calls for the same document.

Move best-effort writes out of hot paths

connect() peer DB update into peerWriter.

Domain updates can be delayed/batched.

Add write admission/fairness

Prevent many goroutines waiting inside BEGIN IMMEDIATE.

Recap

The deeper root is heavy discovery/feed work: CPU + I/O overload makes reads slow, then writer contention turns that overload into SQLITE_BUSY.

SQLite busy is likely the visible symptom, not the whole disease.

Recomputing “haves” per peer per discovery is likely wasteful.

Fix strategy: cache/dedupe discovery work, reduce fanout, batch best-effort writes, and add writer admission control.

Do you like what you are reading? Subscribe to receive updates.

Unsubscribe anytime