Vol. I — Graph Theory·Chương IV·Graph Theory Ngoài Đời Thật

GT - Chapter IV

Traversal and Reachability: What Touches What,
turning the dependency question from a meeting into a second

A breaking change is ready to deploy, and nobody can answer "which services will be affected" within an hour. Impact analysis, blast radius, data lineage, and permission inheritance are all the same problem class: reachability. The core skill is not BFS but three decisions: edge direction by question, cycles as data, and materialize versus on-demand.

Tech · 11 tháng 6, 2026 · 23 phút đọc

Four o'clock on a Thursday afternoon, a small service is about to receive a breaking change. The code is reviewed, the tests are green, the changelog is written. The person about to press the button is a careful engineer, so before pressing it, he asks the question every decent tech lead asks: which services will be affected?

The room goes quiet for a beat. Then someone opens Slack to ask the neighboring team, someone else says let me grep for it, a third person digs up the wiki page with the architecture diagram, last updated eleven months ago. An hour later, three people have asked seven teams, the list of affected services has nine names on it, plus two question marks and a line reading "billing might call this too, needs checking." Nobody dares sign off. The deploy slips to tomorrow morning.

Not because that room lacked good people. That room was full of good people. The answer was not in anyone's head; it was scattered through code, through config, through old conversations, and partly in the memory of people who left the company last year.

a pause

In the system you work on, how long does the question "if we deploy this, what is affected" take to answer today, and how much would you trust the answer?

Think about it for a moment.

Run chapter two's test on this situation and the result comes back almost instantly: relationships are the primary data, the depth is unknown^° in advance, the question is transitive. Graph problem, no debate. More specifically, this is a reachability^° question: from one node, following "depends on" edges, which set of nodes can be reached.

And that Thursday afternoon's question has twins all over an engineering organization. If this table is wrong, which dashboards are wrong because of it. If we remove this library, what breaks, the question the entire JavaScript ecosystem once answered with one morning of broken npm builds. Why can this user read that file. All of them are one skeleton wearing different clothes, exactly the kind of disguise Part I taught us to see through.

What deserves attention is the distance between two states. On one side, a question three engineers together could not answer in an hour. On the other, a question a system answers in under a second, fast enough to wire into CI, to run before every deploy, fast enough that nobody reschedules to tomorrow morning. That distance is not algorithmic. The algorithm for traversing a graph is in the sophomore textbook, and you have known it for years. The distance lies in whether the organization has the graph to ask.

Snapshots and Film

Before talking about how to build that graph, in fairness, the three mechanisms that meeting room used are not foolish. They are the three spontaneous solutions nearly every organization passes through, and each wins for a stage.

Grep wins while dependencies are direct code calls inside one repo. Function names, endpoint names, topic names sit somewhere in text, and at finding text, nothing beats grep. Asking on Slack wins while the organization is small, while collective memory still covers the whole system, while the author of service A still sits two rows from the author of service B. As for the "which service calls which" spreadsheet, whatever team created it deserves credit, because they were the first people in the organization to see the problem and attempt a systematic answer.

All three share one great advantage: their cost is nearly zero. No infrastructure, no project, nobody to convince. And below a certain threshold of system size, they genuinely are the correct solution, a point we will return to at the end of the chapter, seriously rather than as a courtesy.

But all three also share two breaking points, and the first is technical: they only see direct dependencies. Grep finds who calls A. It does not find who calls the callers of A. The tech lead's question is transitive: A changes, so B wobbles, B wobbles, so C shakes, and the incident, if it detonates, usually detonates at C, the place nobody in that meeting room thought of. Grep answers one hop. The real question has no hop count.

The second breaking point runs deeper, and it explains why that spreadsheet always dies. The dependency graph changes faster than any^° document about it. Every pull request that adds a call is a graph change, every new service is a fresh cluster of edges, and no manual process can keep that pace. The spreadsheet is not wrong because its maintainer was lazy. It is wrong because it is a snapshot, while the thing it tries to describe is a film. A static document about a living structure rots by design, not by anyone's fault.

Which means: the thing to build is not a better document about dependencies. The thing to build is a place where dependencies declare themselves, or reveal themselves, and a structure that can answer the transitive question on top. That is, a graph, living at the system's own pace, treated as infrastructure rather than as documentation.

“
An organization that cannot answer "what depends on what" will answer it with an incident.
”

Three Decisions That Are Not in the Book

Set five questions side by side: which services are affected when I deploy (impact analysis), how far can this incident spread (blast radius), which packages does installing this one drag along (dependency resolution), if this table is wrong which reports are wrong (data lineage), why does this user have read access to that file (permission inheritance). Five problems, five owning teams, platform, SRE, build, data, security, five private vocabularies. One skeleton: a set of nodes, one kind of directed edge, and the question "from this node, following this relationship, what does the closure contain."

Permission inheritance deserves one extra sentence, because it is the quietest member of this family. Authorization systems in the style of Zanzibar, which we touched in chapter two, answer a reachability question at production frequency: does a path exist from this user to that resource, through membership and inheritance relations, or not. Every time you successfully open a file on a document-sharing system, a reachability question has just been answered, too fast for anyone to call it by that name.

Now, the part this chapter will not teach: how to traverse. BFS, DFS, marking visited, you have studied them, implemented them, possibly interviewed other people with them. In real systems, the choice between BFS and DFS is almost never the decision that matters. The three decisions below are, and none of the three appears in the sophomore syllabus, because all three only show up once the graph stops being handed to you in the problem statement.

The first decision: the direction of the edge is the direction of the question. The same real relationship, "service A calls service B," can be drawn in two directions: A depends on B, or B is depended on by A. Neither direction is wrong about reality. This is chapter three's lesson returning in operational clothes: the model^° is cut along the question, not along reality. The mechanism is that the impact question, "if we change B, who breaks," traverses against the call direction. If your graph only stores the forward direction, every reverse question is a full scan to find who points at B, and "full scan on every question" is just grep with better structure. The check for this decision is very down-to-earth: write the most frequently asked question on paper, look at which node the traversal starts from and which way it walks. If both directions carry high-frequency questions, store both directions, and accept that every write now writes to two places.

Figure I — The relationship is stored in its natural direction, 'who calls whom.' The organization's most expensive question, 'if we change B, who breaks,' walks against the arrows: from B back to A and C, then from C back to D. The incident, if it detonates, usually detonates at D, two hops upstream, where grep cannot see.

a pause

The reachability question you meet most often at work: which node does it start from, and does it walk with or against the direction stored in your system?

Think about it for a moment.

The second decision: treat cycles as data, not as exceptions. Theory loves DAGs, directed graphs without cycles, because on a DAG everything is tidy: traversals are guaranteed to stop, a topological order exists, every proof is short. Real data does not read theory. Two services calling each other is a real thing, two tables feeding each other through two pipelines is a real thing, and a traversal that does not mark visited will run laps around them until memory runs out. But that is the shallow layer. The deeper one: in a dependency graph, a cycle is usually not a data error but an architectural signal. Two things calling each other are usually one thing that was split in the wrong place. So a good reachability system does not merely survive cycles; it surfaces them as a kind of finding, something worth a human's attention. The check: when your system meets a cycle, what does it do, die, ignore, or report? Only the third option turns the graph from a question-answering tool into a system-seeing tool.

The third decision: materialize, or compute on demand. The transitive answer, the set of all nodes reachable from X, can be precomputed and stored (graph people call it the transitive closure^°), or computed each time it is asked. This is a classic trade-off, freshness exchanged for speed. Precomputed answers come back in milliseconds, but every change to the graph means updating what was stored. Computing at question time is always fresh, but pays by depth, on every question. There is no universally correct answer, only an answer correct for the ratio of two frequencies: a graph that changes hourly but is asked every second should materialize; a graph that changes every second but is asked weekly should compute on demand. Topological sort, the textbook's topological order, seen from this angle turns out to be a special form of materialization: the precomputed build order is a packaged answer to a whole family of "what must finish before what" questions. The check: divide question frequency by change frequency; that number decides for you.

Three decisions, and not one of them is an algorithm. The two organizations below answered these three questions in nearly opposite ways, on two different kinds of graph, and that very opposition is the best evidence that we are looking at a problem class, not a trick belonging to any one domain^°.

Bazel, Paying at Write Time

Google's problem is that Thursday afternoon's problem, scaled up several orders of magnitude. A monorepo nearly the whole company lives in, thousands of engineers committing in parallel, and one question repeating at industrial frequency: given this change, what must be rebuilt, which tests must rerun. Answer too broadly and the build system dies under load, every commit rebuilding half the world. Answer too narrowly and it is worse: a change slips through without running the tests of the things that depend on it, and the bug walks straight into the product.

The people inside saw early that "what needs rebuilding" is exactly reachability on the dependency graph, with one very large if: if that graph exists, is complete, and can be trusted. And this is where the foundational decision of Blaze, later Bazel, was made. They did not infer dependencies from code, the most natural approach and forever an approximation, because code has dynamic imports, reflection, dependency paths static analysis cannot see. They required every build target to declare its dependencies explicitly in a BUILD file. Want to use a library, declare it. Undeclared means unseen, and the build fails at the author's own desk. The codebase, by law rather than by hope, becomes a declared graph.

Hold the three decisions of the previous section against this and they appear in their original form. Direction: engineers declare in the natural direction, I depend on X, but the system's most expensive question, given a change at X what must be rebuilt, walks the other way, so the system has to serve both directions as first-class infrastructure. Cycles: banned at the door. Bazel's build graph is a DAG by law, a dependency cycle is an error blocked at declaration time, and this is a privilege worth remembering, because as we are about to see, not every domain gets to ban. Materialize: with a question-to-change ratio that high, asked on every build by thousands of engineers, the system leans far toward precomputation; the action map and the caches built on the graph are materialization at industrial scale, paying the analysis cost up front so each build touches only what is affected.

The transferable lesson from this case needs careful phrasing, because it is not "use Bazel." The lesson is: the ability to answer a reachability question in a second is not free, and Google chose to pay for it with discipline at write time. Every engineer, every day, maintains BUILD files, in exchange^° for a system that instantly answers the question other organizations answer with a meeting. Cheap to read because it was expensive to write. There is no other direction.

And this solution has real limits, which do not diminish it. Declaration discipline is a daily maintenance cost on every engineer, wrong or superfluous declarations become silent errors only good tooling catches, and the whole approach is feasible partly thanks to a very particular context, one monorepo, and a culture of investing in internal tooling at a rarely seen intensity. Transplanting the solution wholesale into another organization usually fails. Transplanting the pay-at-write-time principle does not.

Lineage, Excavating a Graph

Now flip to a world where the conditions are nearly reversed.

Midnight, a source table in the data warehouse changes schema, a column changes type. The next morning, the question is not which table broke; that is known. The question is which dashboards, which reports, which models are now silently wrong. And a few weeks later the opposite direction arrives: a number on the executive dashboard looks suspicious, and someone has to answer where this column was computed from, through which transformations. The first question is downstream reachability, the second is upstream reachability, both on a graph data people call lineage, the genealogy of data. This is the data world's version of the afternoon nobody dared deploy, except the incident has already gone off.

What makes this world utterly different from Bazel's: there are no BUILD files, and there never will be. Data in a large company flows through thousands of pipelines, handwritten SQL, data scientists' notebooks, drag-and-drop tools, a cron job someone wrote three years ago. Requiring all the authors of all of that to explicitly declare "this job reads these tables, writes those" is a cultural war that cannot be won. So lineage platforms, Databook at Uber, the systems built around the open OpenLineage standard, take the remaining road: excavation. No declarations required; instead, parse query logs and job metadata to infer the "reads-from, writes-to" edges. The graph is not declared; it is reconstructed from traces, the way an ancient city is reconstructed from what remains.

The three decisions are still the three decisions, but the answers change with the conditions. Direction: here you no longer get to pick one direction as primary, because both directions carry high-frequency questions, impact flows downstream and provenance flows upstream, so a lineage graph almost has to serve both directions from the start. Cycles: cannot be banned. Bazel has the right to refuse a dependency cycle at the door, but a lineage platform has no door to guard; it records the world as the world is, and two pipelines feeding each other is something the world does. The only remaining option is the third one from earlier: survive, and report. Materialize: a graph built from logs is, in itself, a form of materialization with a lag; the graph you are querying is the system as of the last collection run, and that lag must be told to users honestly instead of hidden.

But this case adds something to the problem class that Bazel never faces: the trustworthiness of each edge. Edges inferred from parsing can be missing, SQL generated at runtime, jobs running outside the collection system, and they can be spurious. An excavated graph is always an approximation of the real system, and every reachability answer on it inherits that approximation in full. Mature lineage systems handle this with honesty: publishing coverage, marking each edge's provenance, instead of pretending completeness. An answer of "here are 9 affected dashboards, within the 85% of pipelines the system can see" deserves more trust than a perfectly round "here are 9 dashboards."

Set the two cases side by side and what emerges is not two techniques but two prices. Bazel buys an absolutely correct graph with the discipline of thousands of people at write time. Lineage platforms buy an approximately correct graph with excavation infrastructure and honesty about coverage. Two very different prices, paid for the same asset: the ability to answer "what touches what" in a second instead of a meeting. The class is one. The road to the graph is a decision, and that decision belongs to the conditions of the domain, not to the textbook.

Figure II — Two roads to the same asset. On one side, a graph declared through the daily discipline of thousands of engineers, absolutely correct, cycles blocked at the door. On the other, a graph excavated from traces, approximately correct, cycles reported, coverage stated honestly.

Three Ways It Breaks

I have watched reachability systems break in plenty of ways, but most of them reduce to three patterns, and all three are worth telling slowly enough to recognize from a distance.

The first way: edges pointing against the question. It looks like this. A team finishes a "which service calls which" graph, the demo runs beautifully, the question "what does A depend on" answers instantly, everyone applauds. Three months later, it turns out the question the organization actually needs is the other direction, "who is calling A," because people want to know the impact before changing A, not after. And every time that question is asked, the system scans the entire graph. It is still correct about the data, just useless in operation, and users quietly drift back to asking on Slack. Why it happens: at build time, people draw along the natural declaration direction or the direction data flows, and nobody writes the high-frequency questions down first; the first decision was not made wrong, it was skipped. How to catch it: before building, list the three questions that will be asked most, with their starting nodes and directions; after building, measure whether the slowest question is the same as the most frequent one. The saddest part of this failure is that the fix is usually cheap, one more indexed direction and done; the expensive part is the three months nobody noticed the price being paid.

The second way: forgetting cycles, out of faith in clean data. It looks like this: a traversal hangs on production, or returns an absurdly inflated result, while every test stays green. The tests are green because they run on sample data, and sample data is drawn by hand, and hand-drawn data is tame; nobody draws an infinite loop into their own fixtures. Why it happens: the belief that "dependencies surely have no cycles" is a reasonable belief about well-designed systems and a wrong one about real systems; add the habit formed at school, where problem statements politely hand us clean DAGs. How to catch it: make cycle detection a first-class step of the data ingestion pipeline, not a defensive if-clause inside the algorithm, and treat each new cycle as a finding to be logged and reviewed by a human. The cycle you do not go looking for will eventually come looking for you, usually at the worst possible time.

The third way: materializing, then failing to feed it. It looks like this: the transitive closure is precomputed, queries run dreamily fast for months, until the day a user discovers the answer is missing a service that went to production a quarter ago. Trust in this kind of system collapses once and collapses for good, because its only value is being trusted; people go back to asking Slack, and the system becomes the most expensive spreadsheet in the company. Why it happens: the materialization decision was made on half the facts; people looked at read speed and forgot to sign the obligations clause, that the real cost of materializing is not the first computation but every subsequent change to the graph, for as long as the system lives. How to catch it: before precomputing anything, be able to answer two questions, what triggers recomputation, and how much staleness is acceptable. If you cannot answer them, on-demand, slow but honest, beats fast but lying. And a decent system should expose the age of its answer, computed when, from which snapshot, as part of the answer itself.

Three ways of breaking, and if you look closely, they are the shadows of exactly three decisions. The system does not break where the algorithm is wrong. It breaks where one of the three decisions was skipped, made on half the facts, or handed to luck to make on someone's behalf.

When One Query Is Enough

After two cases the size of Google and Uber, this chapter owes you a rebalancing, because the image likely settling in your head is "doing reachability properly means building a platform." It does not. Bazel and the lineage platforms are answers to pain at industrial frequency, asked thousands of times an hour, by thousands of people and machines. Most teams do not live at that frequency, and the chapter's real test is not "is this reachability" but "is this pain worth infrastructure."

There are three conditions which, when all true, make the no-infrastructure solution healthy. A small graph, a few thousand nodes fitting comfortably in one query or one script that reads everything into memory. A rare question, once a quarter for an audit, where a few minutes of waiting is an entirely reasonable price. And relaxed freshness, where an answer correct as of yesterday is good enough for the purpose at hand. When all three hold, a recursive CTE^° over the relational tables you already have is the correct solution, not a temporary one, not technical debt, and nobody owes anyone an apology.

And here is the sentence worth pausing on longest in this section: writing that query is still doing graph work. Its author still has to choose the direction of the recursion, still has to guard against cycles so the query terminates, still implicitly chooses on-demand over precomputed. All three decisions still run in full; they just run inside one query instead of inside a platform. The pattern lives in the decisions, not in the infrastructure. That is why this chapter teaches three decisions and not a tool.

The only thing that humble choice needs alongside it is an open pair of eyes for escalation signals. The question going from quarterly to weekly. The asker going from a human to a service calling automatically. The freshness requirement going from "yesterday is fine" to "right now." Each signal is one of the three conditions leaving the room, and a team that recognized the problem early upgrades deliberately, along exactly the trajectory we spoke of at the end of chapter two, instead of patching in a panic in the middle of an incident.

Return to that Thursday afternoon one last time. Suppose the organization had its graph, whether a platform or just one query the whole team trusts. The tech lead asks, the screen returns eleven names in one second, no question marks, and the deploy rolls before five o'clock. A question that used to cost a meeting now costs a glance.

But look closely at that list. It says what can be reached. It says nothing about how, which route is short, which route is trustworthy, which route should be taken. For that afternoon's question, "reachable" was everything that needed knowing. There are other questions, and there are more of them than we think, that only begin exactly there.

■

One sentence to remember

An organization that cannot answer "what depends on what" will answer it with an incident.

from Traversal and Reachability: What Touches What

the author’s choice · not an algorithm

Tiếp tục trong Graph Theory Ngoài Đời Thật

← Chương trước

Chương III

Node và edge là quyết định thiết kế

Chương tiếp →

Chương V

Đường đi ngắn nhất, nhưng "ngắn" theo nghĩa nào

or xem toàn bộ tập

A quiet word

What question does this leave you with?

Not published. Never shown to other readers.

In this neighborhood

hand-picked by the author

Opens the door

Nodes and Edges Are Design DecisionsTech · 24 min

“The previous chapter taught that a model is cut along the questions, not along reality. This chapter is the first time that principle goes operational: the direction of the dependency edge is the first modeling decision of Part II, and it decides which question gets answered in milliseconds and which gets answered in a meeting.”

→

The same thread

The Telltale SignsTech · 24 min

“Chapter two handed over the recognition test and stopped there. This chapter runs that test on a concrete situation in exactly one sentence, then continues where chapter two left off: solving. Reachability is the first problem class the reader walks end to end, from requirement to running system.”

→

If you’d rather wander, the full archive is here.

← Node và edge là quyết định thiết kếKếtin →Traversal và reachability: câu hỏi "cái gì chạm tới cái gì" →