Vol. I — Graph Theory·Chương I·Graph Theory Ngoài Đời Thật

GT - Chapter I

You Have Been Solving Graph Problems All Week,
why good engineers fail to see the graph problem right in front of them

A 40-line SQL query running twelve minutes to answer a one-line business question. The problem is not the query. The problem is that nobody in the room realized they were standing inside a graph.

Tech · 10 tháng 6, 2026 · 23 phút đọc

On a Tuesday morning, the risk team sends a backend engineer a request exactly one sentence long: list every customer connected to fraudulent account X, within three hops.

The question is simple to the point of being almost insulting. Everyone understands what three hops means: customer A shares a card with B, B transfers money to C, so C is two hops from A. The data is already sitting in the database, an accounts table, an account_links table, every row a relationship the system has been recording all along, not a single piece missing. The engineer reads the request, sizes it up mentally, and replies in the channel that it will be ready by the afternoon.

The solution practically writes itself. One hop is one JOIN. Two hops, two JOINs. Three hops, three JOINs, plus a condition to exclude accounts already visited so nothing gets counted twice. Done before lunch, results matching a few cases the risk team had already investigated by hand, everyone in the channel reacts with a heart.

The first slip comes the following week. The risk team returns: great results, but these people like to hide behind four or five layers of intermediaries. Can it do five hops?

Each additional level of depth is another self-JOIN, and at this point stacking JOINs by hand starts to smell. Five hops is five nested layers, each repeating nearly the same block of conditions, one fix requiring five edits. The decent thing to do is rewrite the whole query as a recursive CTE, so depth becomes a parameter instead of the hard-coded shape of the statement. The engineer does the decent thing. The SQL swells from eight lines to forty: an anchor clause, a recursive clause, a cycle guard, because real-world link data always has cycles, two accounts pointing at each other is an everyday occurrence, and a final filtering stage. Still readable, if you squint. Still correct, thoroughly tested. Except that during review, a colleague stares at the query for a while and asks, half joking, why a one-line question needs forty lines to ask. Nobody answers, because the joke does not seem to require one.

On production, it runs for twelve minutes.

The indexes are all in place. The execution plan has been read and reread. The links table holds a few tens of millions of rows, large but not unusually large, and the hardware wants for nothing. Twelve minutes, for a question the risk team can state in one line of human language.

There is something here worth sitting with for longer than the ordinary irritation at a slow query. Every ingredient of the solution is present: the data is complete and clean, the engineer is skilled, the tool is a relational database, a thing refined over forty years that runs most of the world's data. Nothing is missing, nothing is shoddy. So what exactly in this problem is resisting all of those good pieces?

The engineer in the story does not ask that question. They open the execution plan again. And that is the most important detail in the whole story: the person who wrote that SQL did not think they were doing graph theory. They thought they were doing JOINs.

A Healthy Reflex

If you are thinking "so keep optimizing, obviously," you are thinking exactly like a professional.

That reflex is trained over years: slow query, read the plan; bad plan, check the indexes; indexes fine, consider denormalizing; not enough, materialized view; still not enough, cache; and if all of that still fails, go back and negotiate with the business. Would three hops do? Are five hops truly necessary? Every step in that chain is reasonable, every step has a textbook, every step has rescued thousands of systems out there. This chapter has no intention of mocking that chain. It is a healthy professional reflex, and in our story, it even wins a battle.

The engineer adds a composite index aligned with the traversal direction, rewrites the recursion to filter earlier, moves the cycle guard up front instead of leaving it at the end. Twelve minutes drops to ninety seconds. A nearly tenfold improvement, the kind worth praising in sprint review, and it does get praised. What deserves the praise is not the number but how it was earned: no new technology, no extra machines, no lowering of the business requirement, just a deeper understanding of one's own tool. That is exactly the kind of victory this profession teaches us to chase, and it reinforces a quiet belief: this problem belongs to the familiar world, dig a little deeper and it stays under control. The story should have ended here, the way thousands of query-tuning stories end, quietly, with nobody needing to think any^° further.

It does not end, because the risk team does not stand still. The fraudsters do not stand still, so the risk team cannot.

The week after, a new request: besides transaction links, count device links too. Two accounts logging in from the same phone are considered connected. One more edge type, another rewrite of the recursion, the query gets a bit slower, gets re-optimized a bit. Then shipping-address links. Then the question flips direction: not "who is connected to X" but "across the entire customer base, who sits within three hops of any account on the blacklist." Every new request sounds tiny in the mouth of the person asking, a sentence added at the end of a meeting, and every one of them forces a substantial rewrite of a query already optimized down to the last spare inch.

By the third month, the shape of the problem starts to show, not in the code but in the rhythm of the work. The cost of each change does not fall with experience, as it normally would. It rises. The more deeply the query is optimized for the current shape of the question, the more brittle it becomes against the next shape. The engineer realizes they have been optimizing something very specific: this query, for this question, at this depth. Meanwhile the risk team lives in a different world, where what they need is an entire species of question, and that species sprouts a new variant every week.

There is an asymmetry in this race, and it does not favor the engineer. Each new variant of the question costs the risk side one sentence. It costs the engineering side a rewrite, a re-optimization, a re-test. One side pays in words, the other side pays in weeks. When two parties are talking about the same thing and the cost of stating it differs by that much, the problem is usually not the competence of the side paying more. It is that this side is being forced to talk about the thing in a language that was never built for it.

Optimizing correctly, and still losing. If optimization skill is not what is losing, then what is?

The Problem Is Not the Query

Put two things side by side and look slowly.

The risk team's question is stated in relationships: who is connected to whom, through how many intermediaries, spreading how far. The entire content of the question lives in the connecting verbs: connected to, through, within. The relational database, despite its name, charges by a different unit: every time you want to take one more relational step, it performs one more JOIN, which means one more act of placing this entire dataset next to that entire dataset and filtering out the matching pairs.

These two cost structures do not inhabit the same world. In the tabular representation of relationships, "one more step" is a global operation, its cost tied to the size of the whole links table, even when all you care about is the neighborhood of a single account. In a representation where each node holds its own list of neighbors, "one more step" is a local operation, a look around from where you stand, its cost tied to how many neighbors are actually there.

Same question, but one side pays by the size of the world, the other by the size of the neighborhood.

Figure I — The risk team's question, drawn the way it was stated: a spread outward from X, one ring of neighbors at a time. Each kind of link (shared card, transfer, device, address) is just a different edge type on the same structure.

An everyday picture helps. Asking "who are the neighbors of house number 14" the relational way is picking up the phone book of the entire city, flipping page by page, underlining everyone whose registered address sits next to number 14. An index makes the flipping much faster, but the essence is still consulting a city directory to answer a question about one street. The graph way is walking to house number 14 and knocking on the two doors beside it. Ask once and the two approaches barely differ. But when the question becomes "now the neighbors of the neighbors, then one ring further, until we find someone we know," the first approach consults the city directory again at every ring, while the second simply keeps walking.

Figure II — Two cost structures for the same three-hop question. On the left, every additional hop puts the entire links table on the scale. On the right, every hop touches only the neighbors actually present where you stand.

While the depth stays fixed and shallow, the two prices have not yet had time to diverge. Three JOINs on a well-indexed table are nothing. But the risk team's question has a property nobody noticed at first: the depth is not a constant. Three hops, then five, then "follow it as far as it spreads." The depth is part of the data. It belongs to the fraudsters, not to the person writing the query. And the relational model^° has no natural way to state a question whose number of JOINs is only known at runtime. The recursive CTE^° is SQL's way of admitting this, an honest but strained admission, like a language borrowing words from another language to talk about something it has no word of its own for.

At this point the whole story can be renamed. The problem was never performance. Performance was merely the earliest symptom. The problem is a language mismatch: the risk team asks in the tongue of relationships and propagation paths, while the engineer is forced to translate into the tongue of sets and joins. Each translation deforms the question a little, and the worst deformation lands precisely on the most important part, that depth which cannot be known in advance. Ninety seconds or twelve minutes is just a measurement of the gap between the two tongues.

The engineer in this story is not weak. They are simply paying down a decision nobody in the room recognized as a decision: choosing a tabular representation of the data for a question that lives by paths.

But if this really is a structural mismatch, not the fault of one person or one team, then it must have left traces elsewhere. Different companies, different decades, domains with nothing in common should have stumbled into exactly this spot, in exactly this way. Have they?

The Same Skeleton

They have. And the three stumbles below were not chosen because they are famous. They were chosen because they resemble each other in no surface detail, only in exactly one place.

In the early 2000s, PayPal nearly died of fraud. Not the scattered, few-transactions kind, but organized crime treating the young payment system as a cash machine, the rate of losses at times directly threatening the company's survival. Every obvious layer of defense had been tried, and the fraudsters walked through all of them the same way: any rule that examined transactions one at a time, they learned and ducked under the threshold. That was precisely what kept Max Levchin's team up at night. Each fraudulent transaction, viewed alone, usually looked perfectly normal: moderate amount, account with history, nothing for a rule to grab onto. The anomaly appeared only when you placed the accounts next to each other: this cluster sharing a range of card numbers, that cluster pointing to the same address, another behaving so identically it could not be coincidence. The fraudsters were not operating accounts. They were operating networks of accounts, and the network was the one thing they could not make look innocent, because the network itself was their money machine. The PayPal team built an internal tool that let investigators trace and see those networks, and named it Igor, after a Russian fraudster who had once taunted them.

In the accounts of that period, almost nobody uses the words graph theory. They talk about tracing accounts that stick together. Hold onto this act of not-naming. We will need it shortly.

Jump to an entirely different domain^°. LinkedIn, in the years it was building People You May Know, faced a question that sounds even gentler than the risk team's: suggest, for each user, the people they might know. The simplest intuition, which turned out to also be the strongest, is friends of friends: people who share many connections with you. Two hops. Only two, the depth is not even unknown^°. But do the multiplication in your head and those two hops change character. A person has a few hundred connections, each of which has a few hundred connections, so the two-hop zone of an ordinary member is already tens of thousands of candidates to score and rank. Multiply by hundreds of millions of members, each needing their own list, lists that must stay fresh while the network changes by the hour, and the gentle friend-of-friend problem becomes one of the heaviest workloads in the entire company, something no query, however optimized, could carry as a query. LinkedIn ultimately had to build dedicated infrastructure to serve traversal over the connection graph, and PYMK is cited in their own technical writing as one of the reasons graph infrastructure became core infrastructure.

A friend-suggestion feature, the kind of thing that sounds like a weekend exercise, quietly shaped the architecture of an entire company.

The third case is the reversal. In March 2016, a developer unpublished his packages from npm after a naming dispute. Among them was left-pad, a package that did exactly one thing, padding characters onto the front of a string, eleven lines of code. Within hours, the builds of thousands of JavaScript projects around the world went red, including some of the ecosystem's giants.

Almost none of those victims had ever heard of left-pad. They did not depend on it, in any sense a human means by the word depend. They depended on A, which depended on B, which depended on left-pad, a chain of links recorded in full inside lockfiles, machine-readable line by line. Only no human had ever looked at it as a whole. The first two cases are about failing to see the graph and therefore failing to solve the problem. This case is the other face of the same coin: failing to see the graph and therefore not knowing you are standing on it, until the moment one link vanishes and the whole floor tilts.

A payments company fighting organized crime, a professional network suggesting connections, an open-source ecosystem whose builds collapsed over eleven lines of code. Three overlapping decades, three domains that do not share a single noun, three different endings: one place survived by seeing it in time, one built its infrastructure around the seeing, one only saw after the fall. But set the three stories side by side and strip away the surface skin, and one skeleton remains: the most important information lives not in the individual records but in how the records connect, the crucial question always takes the form of propagation across multiple hops, and the number of hops is either unknown in advance, or known but transformed into something else entirely when multiplied by scale. In all three places, that skeleton existed long before anyone named it. It was only seen when it hurt enough. And if this repetition is real, it drags along a rather uncomfortable question about the person reading this: where else is that skeleton lying right now, not yet painful enough to show itself?

You Have Been Inside the Graph All Along

Now we can return to the question left hanging at the start of the chapter. Every ingredient was good, skilled engineer, sufficient data, mature tool. So what was resisting?

The answer is that nothing was resisting. The question was aimed at the wrong place from the beginning, and aimed wrong in exactly the way we have been trained to aim wrong. For years, our question about graph theory has been the student's question before an exam: what is this subject for? That question assumes graph theory is a tool on a shelf, waiting for the day a problem labeled "graph" falls from the sky so it can be taken down and used. That day never comes, because problems in the wild do not wear labels.

The question was never what graph theory is for. The question is: of the problems you are already solving with other tools, how many were graphs all along?

“
A graph problem never introduces itself as a graph problem. It just shows up as a query that keeps getting slower.
”

Try an inventory, not of your toolbox, but of your own backlog.

The org chart in the HR system, and the question that arrives every vacation season, "who does this request escalate to when both the manager and the skip-level are out of office": that is a walk along a tree, with a data-dependent stopping condition. The permission system, where users belong to groups, groups belong to larger groups, groups are granted roles, roles open access to folders, folders inherit from parent folders, and the innocent-sounding question "who, in the end, can read this file" turns out to require tracing five or six layers of inheritance: that is reachability^°. The microservice diagram on Confluence that has been stale since last quarter, and the question that silences the room before every major deploy, "if we change this API, which services break": still reachability^°, just walking against the arrows. The "customers who bought this also bought" feature the growth team keeps asking for: a bipartite graph between people and products, projected onto one side. The data team's question, "if this source table is wrong, which dashboards are lying": propagation across the lineage graph. The fraud story that opened this chapter. The pathfinding algorithm inside the delivery feature. That entity_relationships table swelling month by month somewhere in your schema, paired with a while loop in the application layer that keeps fetching one more level until there is nothing left to fetch.

The list does not need to be complete. It only needs to be long enough that one thing becomes hard to deny: you do not occasionally encounter graph problems. You solve them weekly, with JOINs, with loops, with caches, with meetings. And most of those solutions still run, which is exactly what makes all of this hard to see. No alarm sounds when a graph problem is solved with some other tool. There is only a thin layer of friction coating everything: queries slightly slow, code slightly tangled, every new request slightly more expensive than it deserves to be, a few business questions quietly filed under "this one is hard, later." Friction kills no one, so friction gets investigated by no one.

And only now does the detail held back from the PayPal case release its full flavor: they did not call what they were doing graph theory, they called it tracing accounts that stick together. The engineer at the start of this chapter is the same. They did not think they were doing graph theory. They thought they were doing JOINs. Neither of them lacked knowledge. Ask that engineer to define BFS and they answer fluently, could probably write it on a whiteboard without a draft, having drilled it to pass interview rounds. What they lack is not in the algorithms book. What they lack is the bridge between that book chapter and this morning's JIRA ticket, the reflex of looking at a requirement stated in human language and recognizing the mathematical structure standing inside it in plain clothes. The theory you learned at school and the problem you meet at work are the same thing wearing two different outfits, and nobody ever taught us the course on recognizing the outfit.

One thing needs saying clearly, before the excitement runs ahead of itself. Seeing the graph does not mean everything in the inventory above should be uprooted from Postgres and replanted into some graph database this quarter. Seeing the structure is one thing. Deciding whether it is worth the price of solving along that structure is another thing entirely, a harder one, and most of this book exists because of that second thing. This chapter does exactly one job: it moves the place the question is asked from. You used to stand outside the toolshed looking in, asking what each tool is for. From now on you stand among your own problems, asking which of them was never the shape you assumed.

Nobody Teaches the Drawing Step

Every algorithms textbook, every lecture, every interview-prep book opens its graph section with the same ritual: given a graph G with vertex set V and edge set E.

The first two words are the most expensive words, and they are given away for free. Given a graph. Throughout our school years, the graph was always given: the problem statement drew the circles and arrows, or, more generously, handed over the adjacency list. Our work began after that point: traverse it, color it, find the shortest path on it. We were graded on the second part, so we became good at the second part.

Out there, nobody gives anyone a graph. Out there, there is only an account_links table with a few tens of millions of rows, a directory of logs, a lockfile, a sentence added at the end of a risk-team meeting. The graph is present in all of those things, complete and exact, but it exists the way a painting exists before being unrolled: every brushstroke is there, only nobody has seen the picture yet. The distance between an engineer and the solution turns out to rarely be an algorithmic distance. It is the distance between the rolled-up canvas and the painting, between data lying in rows and structure waiting to be recognized. The two steps that come before every algorithm, recognizing that there is a graph, and deciding how to draw it, are two steps that appear in no syllabus, no interview round, no exam we ever passed.

This misalignment is nobody's fault in particular. It is a byproduct of how knowledge gets packaged for transmission. Algorithms package well: defined input, gradable output, provable complexity, so they make it into curricula, into exams, into interview rounds. But the seeing, the moment a person hears "connected within three hops" and the vertices and edges rise up behind the words, does not package that way, so it gets left to experience, which is to say left to luck, to whether you happen to sit next to the right person, trip over the right incident, at a moment when you are still curious enough to ask why.

But if recognition is a skill, then it must be learnable. There must be signs, signs sitting right inside the way a requirement is phrased, checkable like a checklist, with no need to wait for intuition to flash after ten years of stumbling. The story that opened this chapter had at least two such signs lying in the open since that Tuesday morning, before the first line of SQL.

a pause

Reread the Tuesday-morning request: "list every customer connected to fraudulent account X, within three hops." Which two signs in that very sentence foretold everything that followed?

Think about it for a moment.

As for our engineer, they never got the chance to reread it like that. After the third round of optimization the query ran fast enough for the risk team to stop complaining, the ticket was closed, the sprint ended, everyone moved on. A happy ending by every standard of the profession. And that may be the most regrettable thing in the whole story: the pain stopped just in time, right before it could force someone to ask the right question.

■

One sentence to remember

A graph problem never introduces itself as a graph problem. It just shows up as a query that keeps getting slower.

from You Have Been Solving Graph Problems All Week

the author’s choice · not an algorithm

Tiếp tục trong Graph Theory Ngoài Đời Thật

← Chương trước

Đây là chương đầu tiên.

A quiet word

What question does this leave you with?

Not published. Never shown to other readers.

In this neighborhood

hand-picked by the author

Same lens, opposite end of the telescope

Seven Bridges, Six Degrees, and the Hidden Structure of EverythingScience · 77 min

“That essay travels from mathematics out into the world: Euler throws away the map and keeps the structure, and the structure resurfaces in power grids and pandemics. This one travels the opposite way, from a very ordinary JIRA ticket back to the mathematical structure standing inside it in plain clothes. Read together, they are the same object seen from both ends.”

→

Recognition before tools

You Are Not Missing a Pattern, You Are Missing a QuestionTech · 22 min

“That essay argues that what separates two engineers is not their catalog of patterns but the first question they ask of a requirement. This chapter is a concrete instance of exactly that claim: the engineer is not missing BFS, they are missing the question "is this a graph?" Same gap, two names for it.”

→

If you’d rather wander, the full archive is here.

Kếtin →Bạn đã giải graph problem cả tuần nay →