Data flow profiler

Data flow profilerA small profiler that follows one piece of data all the way from the daemon to the screen, so we can see where things get slow or get lost.

Data flow profiler

Problem

Right now we only see parts of the pipeline. The daemon has eztrc, the desktop app has a console flag. But nothing follows one piece of data from start to finish: from the moment a blob arrives in the daemon, through the channel between the daemon and the desktop, into the window, and finally onto the screen.

So when a new comment takes too long to show up, or it never shows up, we don't know where it died. Was it the daemon? The poll in the desktop? The message to the window? The refetch? The render? We are guessing every time.

I want a small tool that just answers this question by looking at one page.

Solution

The idea is simple. Both sides drop little timestamps along the way, all tagged with the same key. The daemon collects them in memory and shows a page where you can see the full timeline per piece of data.

One shared key

We use hm://<account>/<path>?v=<version_cid>. The daemon already has this when it indexes a blob, the desktop already has this when it asks for a doc. Same string on both sides. No new IDs.

How the timestamps get to the daemon

One new gRPC method, Telemetry.RecordCheckpoints. The desktop side batches stamps and sends them every second. The daemon keeps them in a ring buffer (in memory only, like trcstats). No database, no disk.

On by default, free to turn off

Profiler runs by default. To turn it off, set SEED_PROFILER=0. The hot path is one time.Now() plus one append to a small buffer, so the cost is basically nothing. If pprof shows otherwise we fix it before merging.

The checkpoints

This is the list of stamps and where they fire. For each one: the side that does it, and the area of code.

backend.blob_indexed — daemon (Go). Blob indexer / activity feed indexer, where observe_time is set.

backend.feed_emitted — daemon (Go). Activity API ListEvents, for each NewBlobEvent.

backend.grpc_request_received — daemon (Go). gRPC server interceptor (preferred) or entry of GetDocument / GetEntity / GetAccount / capability lookups.

backend.grpc_response_sent — daemon (Go). Same place, on the way out.

main.feed_event_received — desktop, Node side. app-sync.ts fetchNewEvents, per event from the poll.

main.invalidation_broadcast — desktop, Node side. app-sync.ts processEvents, right before each appInvalidateQueries.

renderer.invalidation_received — desktop, window side. app-invalidation.ts, on each window.

renderer.refetch_start — desktop, window side. React Query refetch for useEntity / useAccount / useCapabilities.

renderer.grpc_call_start — desktop, window side. Connect transport interceptor (grpc-client.ts).

renderer.grpc_call_end — desktop, window side. Same interceptor, on response.

renderer.cache_updated — desktop, window side. Hook onSuccess for entity / account / capability queries.

renderer.component_rendered — desktop, window side. useEffect when data first becomes defined, in the doc / profile / capability render components.

The "new blob arrived" path uses every stage from top to bottom. The "user opened a doc" path skips the feed stages and starts at renderer.grpc_call_start.

Refreshes and the same data more than once

People refresh. The app opens the same doc again later. The transport retries by itself. The same hm:// URL shows up again and again. If we just appended new stamps to one timeline per URL, the first attempt (the one that died) would disappear under the next one. And the dead one is exactly the interesting one.

So the daemon does not keep "one timeline per URL". It keeps one timeline per attempt, numbered gen 1, gen 2, and so on. A new attempt opens a new generation. Old ones get closed and stay around so you can still look at them.

A new generation opens when a "starting" stamp comes in:

backend.blob_indexed (daemon got a new blob)

backend.grpc_request_received (daemon got a fresh request)

renderer.grpc_call_start (window started a new call)

Everything else just continues the current generation.

A generation closes in one of three ways:

It reached renderer.component_rendered → complete (green).

A newer attempt for the same resource started before this one finished → coalesced (blue). Normal during bursts.

30 seconds passed with no new stamp → abandoned (red if late stage, grey if early). This is the bug bucket.

You can refresh the same doc five times and see five generations side by side. The dead ones stay visible. Multiple rows for the same URL is already a signal — something is hitting it many times.

The /debug/journeys page

Loopback only. Two views at the top:

All — every trace, every generation. Columns: key, gen, last stage, status, total time, per-stage deltas. Sortable. Gaps over 200ms get highlighted in red.

Broken paths — only abandoned traces, grouped by last_stage. So you see at a glance "12 dead at renderer.refetch_start, 3 at main.invalidation_broadcast". This is the page you open when something is not working. It points straight to the part of the pipeline that is losing data.

A small banner at the top of "Broken paths" counts the most common death sites in the last 5 minutes. A quick look tells you if something is flaky right now.

Scope

One developer, phased:

Proto + telemetry server + ring buffer + /debug/journeys page (with both views).

Backend stamp sites (blob_indexed, feed_emitted, grpc_request_received / _sent).

Desktop side: profiler module + Connect interceptor + Node-side stamps + window-side stamps.

End-to-end checks: happy path, abandoned, coalesced, refresh / generation correctness, cost check on pprof.

We start with documents, profiles and capabilities. They share the same flow shape (event → invalidation → refetch → render) so they go together. Comments come next, in a separate step.

Rabbit Holes

Hot-path cost going up by accident. One innocent map allocation in a stamp and the budget is gone. We need a pprof check before merging.

Generation rules under retries. The "which stamp opens a new generation and which one just continues" must be exact, or timelines will hide each other.

Carrying the key through the invalidation message between the desktop's two sides. The current message doesn't have a place for the URL. We need to add one. Small change but it touches a shared shape.

The Connect interceptor needs a small map of "this method puts the account / path / version in these request fields". Boring, easy to get wrong, but small.

renderer.component_rendered must fire when the data is actually painted, not when the hook returned. Needs a useEffect on data first becoming defined, not on hook call.

No Gos

No disk or database storage. Ring buffer in memory only. Same as trcstats.

No comments now. They come later.

No public exposure. /debug/journeys stays on loopback.

No free-form attributes on the hot path. Stamps are (key, stage, ts) only.

No alerts or SLO dashboards. Just the page and the broken paths view. If we need alerts later, that's a different project.

Do you like what you are reading? Subscribe to receive updates.

Unsubscribe anytime