← quay lại đọcBản in thử6 × 9 in · khổ · trang đối diện22 tr · 4,506 chữ
ii
Maxubrq · GT - Chapter II
The Telltale Signs
“The signs of a graph problem are not in the data. They are in how the question is phrased.”
iii
Maxubrq · GT - Chapter II

The Telltale Signs

five signals that the requirement in front of you is a graph problem
Two requirements that sound nearly identical: one is a lookup, the other is a graph problem that will haunt your team every quarter. The boundary is not in the data or the technology. It is in how the question is phrased, and it can be checked with five signs.
bởi maxubrq
tháng 6 năm 2026
The Telltale Signs

Thursday afternoon backlog grooming, two tickets side by side on the shared screen. Ticket one: find all orders belonging to VIP customers. Ticket two: find all accounts that received money from account X through at most five intermediaries. The same verb, "find all," the same database, the same team in the room, and the product owner reads both in exactly the same tone, as if they were two variants of one task.

On the surface they are. Both filter data by a condition. Both can be estimated in the planning session: ticket one gets three points, ticket two gets five because it sounds "a bit more complex." Both, in the estimator's head, end in a query.

But these two tickets are not two variants of one task. Ticket one is a lookup: one relational step, fixed depth, written in the morning, done by the afternoon, never returns. Ticket two carries the genes of the twelve-minute query from the previous chapter, the one that grew another layer every time the risk team asked "and what about six intermediaries?" One ticket will close within the day. The other, if solved like the first, will come back to haunt this team every quarter.

Tương tác — thử trên web
Diagram
a question

Before reading on, try answering this yourself: if you had to tell these two tickets apart during the planning session, what criterion would you use? Not a feeling, but a criterion you could state in words.

answer.Not by the data, because the same schema serves both tickets. Not by the technology, because both run on Postgres. The criterion lives in the wording of the requirement itself: is the depth fixed or open, and does the question live on connecting verbs?
If your answer was a feeling ('this one sounds more dangerous'), that is precisely the gap this chapter wants to close: turning the feeling into criteria you can say out loud and check during planning.

The question worth asking is: tell them apart by what? Not by the data, because one schema serves both tickets; the accounts and transfers tables have no idea which question they are about to be asked. Not by the technology, because both run on Postgres, it is just that one runs comfortably and the other runs in agony. If the boundary is not in the data and not in the tool, it must sit at another layer, a layer we rarely look at deliberately: how the question is phrased.

The previous chapter ended on a gap, that recognition is a skill nobody teaches. This chapter is one way to fill that gap. Not with intuition that arrives after ten years, but with a five-question test that runs in a minute, on the very sentence the product owner just read aloud.

Postgres or Neo4j Is a Question That Arrives Too Early

Before the test, it is worth pausing on the classification method most of us already use, because it is common enough to be nearly invisible.

It happens in architecture meetings, in the form of questions that sound very professional: should this data live in Postgres or Neo4j, is this a use case for a graph database, should we move the social data into a dedicated graph store. Notice what they share: all of them revolve around where the data is kept. Classifying the problem by classifying the data's container.

This classification is attractive for three legitimate reasons. First, it is immediately actionable: pick the technology and there is concrete work to do, a proof of concept to build, a demo to schedule. Second, it matches how the whole industry organizes knowledge: documentation is written per technology, conference talks are named after technologies, vendors sell by technology, so thinking in technologies is thinking downstream. Third, it has been right elsewhere: time-series data, consider a TSDB; full-text, consider a search engine. The reflex of "this kind of data, that kind of store" has been rewarded often enough to become a habit.

Then it breaks, in two places.

The first break: the same data serves both graph questions and non-graph questions. Return to the account_links table of the previous chapter. The question "how many links does account X have" is a count, one step, a perfectly healthy GROUP BY, nothing graph about it. The question "which fraud cluster does X connect to, through how many hops" is a deep traversal over the same table, the same rows. One table, two natures, depending on the question. Classifying by data forces each table to have exactly one nature, when the nature never belonged to the table. It belongs to the question.

The second break is subtler: classifying by technology locks the decision to the status quo. A team running Postgres will tend to see every requirement as a relational problem, because the relational solution is the one that needs no budget request. A team that just stood up Neo4j will tend the other way: everything starts looking like a graph, because the new store needs use cases to justify its existence. In both cases the tool is defining the problem. It should be the other way around.

Neither failure is anyone's incompetence. They are what happens when a classification scheme is asked to do a job it was never designed for. The container of the data was never evidence about the structure of the question.

So if we classify neither by data nor by technology, what is left to hold onto? What is left is precisely the thing given the least respect in a planning session: the wording of the requirement, exactly as stated. It turns out the sentence the product owner reads aloud, before any° engineer has had time to translate it into tables and columns, already carries enough signal to classify it. What is needed is knowing how to listen.

The Five Signs

The five signs below require no intuition. Each is a question you put to the requirement, answerable with yes or no, and each comes with the reason it is a sign, because a checklist without its mechanism only works on cases identical to the examples.

Relationships Are the Primary Data, Not a Side Attribute

The first sign asks: does this question care about the connections themselves, or about the attributes of individual entities? There is a quick thought experiment: imagine deleting every attribute, names, balances, creation dates, keeping only the bare entities and the connections between them. If the question still means something, the relationships are the primary data.

Sentences you will hear in meetings when this sign is present: which customers were referred by customers who churned. Have these two employees ever shared a manager in the past. Both sentences live entirely on connections, "referred by," "shared a manager"; the attributes of the individuals barely participate.

The counterexample, which sounds similar but is not: which customers have total spending above a hundred million. The purchasing relationship is present in the data, every order is a connection between a customer and a product, but the question merely aggregates an attribute along those connections and then throws the connections away. The connection here is a pipe, not the content.

Why this is a sign: when relationships are the primary data, any representation that hides them inside foreign keys forces the question to take a detour, recreating the connection with a join every time it wants to touch one. The question asks directly about the thing the model° treats as a side detail. The mismatch starts there.

The Traversal Depth Is Unknown When the Query Is Written

The second sign asks: is the number of relational steps a constant at the time the code is written, or does it depend on the data at the time it runs?

This sign shows itself in the wording. Listen for these phrases in a requirement: until, the entire chain, directly and indirectly, all the way to the root, up to N steps where N is a user-adjustable parameter. Walk up the approval tree until someone with authority is found. All dependencies of this service, direct and indirect. Every such phrase is a confession that nobody knows the number in advance.

The counterexample: fetch the order along with its shipping information. It sounds like walking through two tables, and it is, but it is two fixed, known steps; two JOINs today, still two JOINs in ten years. Fixed depth is not a graph problem. It is a schema doing exactly what schemas are for.

Why this is a sign: this is the gene of the twelve-minute query. Fixed depth translates into JOINs, and the translation is done once. Open depth has no static translation at all; every value of N is a different query, and the recursive CTE° is just a way of wrapping that truth so it is easier not to look at. Of the five signs, this is the most reliable one, with the fewest false positives. When you hear "until," you are almost certainly standing in front of a graph problem.

a question

You now have two signs. Try running them on the two tickets from the start of the chapter before reading on: which ticket matches which sign?

answer.Ticket one matches neither: the relationship is just a pipe to an attribute, and one step is one step, forever. Ticket two matches sign two clearly; 'up to five intermediaries' is a number with a history of escalation. Sign one is fainter: the question still resolves to a list of specific accounts, but its content lives on the transfer connections.
If you hesitated over sign one for ticket two, that hesitation is healthy. The signs do not all fire equally strongly on every requirement, and the full test still has three questions to go.

Transitivity in the Phrasing

The third sign asks: does the requirement contain words that mean propagation? Through, indirectly, via, spreads to, drags along, cascading impact, inherits. These are the verbs and prepositions of composition; they describe relationships built out of relationships.

Typical meeting sentences: this user can view the document because the user belongs to a group, the group belongs to an organization, and the organization was granted read access to the whole folder. Or: if this table is wrong, which dashboards are wrong because of it. Both sentences carry a chain of "because... which... therefore" running through several layers of relationships, and the answer does not live at any single layer.

The counterexample: which users have the admin role. The permission is granted directly, one user_roles table, one filter, nothing transitive about it. The word "permission" appears in both sentences, but the structure of the two questions is completely different, and that is the point of this whole chapter: do not classify by the nouns in the sentence, classify by the structure of the sentence.

Why this is a sign: a transitive relationship means the answer lives on the path, not on the record. A system that only indexes records will have to recreate the path by hand, one step at a time, and with that we are back at the second sign, because how long the path is, nobody usually knows in advance. These two signs travel together: transitivity is a property of the question, open depth is its operational consequence.

The Connection Structure Itself Carries the Information

The fourth sign asks: in this question, does who-connects-to-whom matter more than who-is-who? That is, is the information being sought a property of position in the network rather than a property of the entity° itself?

Typical meeting sentences: which service, if it dies, severs the most business flows. Does this group of accounts transact in a closed circle among themselves. To answer the first, you have to know where the flows run, which means looking at the whole network; no column in the services table contains the answer. To answer the second, you have to compare the density of transactions inside the group against transactions leaving it, again a property of the shape, not of any row.

The counterexample, subtler than the earlier ones: which service has the most callers. It sounds very much like "which service matters most," but counting direct callers is a one-step aggregate°; a GROUP BY answers it. Only when the question needs relative position in the whole network, the choke point, the single bridge between two halves of the system, does the structure truly carry the information. The difference between "most connected" and "most dangerously placed" is a difference this book will return to later; for now it is enough to note that they are two different questions.

Why this is a sign: this kind of information does not exist in any row of any table. It is a global property, visible only when the entire dataset is viewed as a network. No local optimization can produce something that was never in the pieces.

Questions About Paths, Cycles, or Clusters

The fifth sign is the most direct one: the requirement asks explicitly about the geometric objects of a network. Which route, is there a cycle, which groups stick together.

Typical meeting sentences: through which routes does money travel from A to B. Does this approval chain contain a cycle, as in A approves B's expenses while B approves A's. What behavioral clusters do our customers naturally fall into. Path, cycle, cluster: these three words have no definition outside the language of networks.

The counterexample: how many transactions are there between A and B. That question asks about one direct edge, a count over one relationship; it asks nothing about paths.

Why this is a sign: when the business says these words, they are speaking graph without knowing it. They do not say "run a cycle detection algorithm," they say "these two approving each other's expenses, is that a problem," but the two sentences are one. The engineer's job at that moment is not to translate the question back into table language to fit the tools at hand, but to recognize the question's native language and keep it intact for as long as possible.

The five signs, looked at together, are not independent. They are five faces of the same thing: the question is phrased over relationships and over the structure of relationships, rather than over the attributes of individual entities. A requirement matching one sign deserves a long look. Matching two or more is close to certainty.

“

The signs of a graph problem are not in the data. They are in how the question is phrased.

”

Now return to the two tickets from the start of the chapter and run the test. Ticket one, find the VIP customers' orders: the relationship is just a pipe to an attribute, the depth is a fixed single step, there is no transitive word, nothing about position in a network, no path, no cycle, no cluster. Zero out of five. Ticket two, find accounts that received money from X through at most five intermediaries: the depth is a parameter, sign two matches. "Through intermediaries" is transitivity in its purest form, sign three matches. And the question is really about the paths money travels, sign five matches. Three out of five. The test runs through all five questions inside the planning session itself, before anyone has assigned a story point°.

Twitter and the Follow Question

The test is tidy on paper. It is worth watching how it behaves on a real problem, at real scale, where the answer is not always a round, satisfying "graph."

In the early 2010s, Twitter's follow relationship lived in MySQL. And the first thing to say, in fairness: it lived well. This is not a story about a naive mistake waiting to be corrected, because most of the questions the system had to answer at the time were one-step questions: fetch someone's follower list, check whether A follows B, count followings. Run the five-sign test on those questions: fixed depth, nothing transitive, nothing about shape. A relational table in MySQL was the correct solution, not a stopgap.

Then the scale changed, and more importantly, the questions changed. Mutual follows, the intersection of two connection sets, to display "people you both follow." Fanout for the timeline, from one posting user out to millions of followers. Intersections, unions, differences over enormous relationship sets, at tens of thousands of operations per second. Run the test again: sign one now matches outright, the follow relationship is the primary data, the connections themselves are the product, not metadata. Sign four matches partially; some questions are beginning to touch structure.

But what about sign two? This is where the case becomes worth studying. The questions Twitter had to answer millions of times a minute were still one-step questions, just one step over enormous data with complex set operations. Friend-of-friend, traversal of two or more hops, was not on the product's critical path.

And the Twitter team did something that, in hindsight, was exactly running this test with a clear head, even if they never called it that: they built FlockDB, a home-grown graph store, and deliberately did not support multi-hop traversal.

FlockDB was announced and open-sourced by Twitter in 2010, with an introduction on their engineering blog. The project's documentation is explicit that the system was designed for very large adjacency lists, one-step queries with set operations, and high throughput, and that it does not aim at multi-hop traversal or graph analytics. FlockDB's later fate, and the build-versus-buy lesson in it, belong to chapter 9.
Not because they could not, but because they had determined which part of the problem was a graph and which part was not. Twitter's real graph problem at the time was one-step-relationships-at-staggering-scale. Deep traversal was the thing a "graph store" seemingly ought to have, except nobody was paying for it.

Correct recognition, it turns out, is not shouting the word graph. Correct recognition is drawing the boundary: this part of the problem is a graph, that part is not, and the solution only needs to cover the first part.

The opposite case sits at Google, in a domain° almost nobody calls a graph: authorization. The question "can this user view this document" sounds like a filter, one row in a permissions table. Until you read carefully how permissions actually work: the user can view it because the user belongs to a group, the group belongs to an organization, the organization was granted access to the whole folder, and the folder inherits permissions down to every file inside it. All of that is sign three, transitivity in its purest form, plus sign two, because how deeply memberships nest is something nobody has decreed in advance. Google restated the authorization problem as relation tuples, and the permission-check question as "does a path exist from the user to the object through these tuples," then built a global system around that phrasing and named it Zanzibar.

The paper "Zanzibar: Google's Consistent, Global Authorization System," USENIX ATC 2019, describes the relation-tuple model and the check API serving hundreds of Google services. Only the recognition part is told here; the reachability° view of this same problem returns in chapter 4.

Tương tác — thử trên web
Diagram

Two cases, two opposite conclusions, from the same test. Twitter looked at a problem that seemed entirely graph and found the part of it that was not. Google looked at a problem that seemed not graph at all and found the graph buried inside. The test is valuable in both directions, and the second direction is the hard one, because there is no word like "network" in the requirement to prompt you. It is also where the test earns its keep: the first direction merely saves money, the second finds the problems that have been quietly mispriced for years.

Three Ways to Misdiagnose

Any usable skill is also misusable. The three mistakes below are common enough that nearly every team will meet at least one, and each has its own way of being caught.

The first is the false positive: seeing a many-to-many table and shouting graph. It looks like this: in an architecture review, someone points at user_groups, order_items, the familiar join tables, and says our data is really a graph, with a new-technology proposal close behind. What makes this mistake hard is that mathematically, it is correct. Every many-to-many relationship is a bipartite graph; there is no arguing with that. But that mathematical correctness is exactly the trap, because representable as a graph and worth asking as a graph are two different things. How to catch it: rerun sign two. The overwhelming majority of many-to-many tables in a system serve only one-step questions: which users are in this group, which items are in this order. Fixed depth, fixed forever. The graph exists; the questions about it do not.

The second is the false negative, and it usually wears the gentlest clothes: "it's just a tree." The org chart, the product category tree, folder-inherited permissions. Because a tree is a familiar structure, it does not trigger the graph reflex, and the usual solution is a loop in the application layer, one query per iteration, go up to the parent, up to the next parent, until the root. It works. It slows as the tree deepens. Then one day, real data does what real data always does: an employee with a dual reporting line answers to two managers in two branches, a category gets cross-linked into two trees at once, and the structure that promised to be a tree quietly grows one extra edge into a graph with a cycle. The walk-up-to-parent loop hangs, or worse, runs forever. How to catch it: do not classify by the data's current shape, classify by the question. "Walk up until" is sign two; "inherits through" is sign three. If the tree matches the test, the tree is a graph problem, however gentle it currently looks, and the solution must be ready for the day it stops being a tree.

The third happens after recognition has succeeded: labeling, then stopping. The whole meeting room agrees, this is a graph problem, everyone nods, the meeting moves on, and everyone leaves carrying a feeling of having understood. Three weeks later, when the solving starts, it turns out each person had been thinking of a different problem on the same graph. One was thinking of finding everything X touches, another of finding the best route between X and Y, another of finding the clusters. The label "graph problem" sounds like a diagnosis but is really the name of a family of diseases. How to catch it: before leaving the meeting, force a restatement of the requirement as an explicit question over relationships, in the form "from X, find every node reachable via edges of type E, within at most N steps, under constraint T." That restated sentence, not the label, is the real product of the recognition skill. The label produces a feeling of completion. The restatement produces the next piece of work.

❦

When a Graph Problem Does Not Need a Graph Solution

At this point one thing needs saying plainly, to head off a misreading that may have been accumulating since the start of the chapter: the five-sign test answers the question "is this a graph problem." It does not answer the question "should we build a graph solution." These two questions are separate, and they will stay separate for the rest of this book.

There are requirements that match the signs, that are genuinely graph problems, and that should still be solved with non-graph tools. Three conditions make that healthy.

The first condition is genuinely shallow, fixed traversal. If the question is two hops, two hops today, and everything known about the business says it will stay two hops, then two JOINs are the correct solution, not a temporary one. The word "temporary" only enters when the depth has a history of escalating.

The second condition is low frequency. A reachability question that runs once a quarter for an audit report can tolerate a recursive CTE that takes a few minutes, and it should. The pain has to be frequent enough to be worth paying for infrastructure; an ugly query that rarely runs is far cheaper than a beautiful system that must be operated all year round.

The third condition is small data. A graph of a few tens of thousands of nodes fits comfortably in the memory of a Python script on a laptop. Read the data out, build the graph in RAM, ask the question, throw it away. Do not architect what a script can solve.

But the condition that really matters is not any one of them alone; it is the product of all three with something that is not in the data at all: the trajectory of the question. The requirement is two hops today, but if it comes from a risk team whose history shows that "one more hop, please" always arrives, then that number two has a short life expectancy. Early recognition is not for building early. Early recognition is so that when the requirement escalates, the team is not surprised, and the decision to change tools is a planned move rather than an evacuation.

A reader who has come this far is holding the test, and it works immediately: one free afternoon running it over the last twenty tickets in the backlog will tell you how many graph problems your system has been hiding.

But between "recognizing this is a graph problem" and "having a graph to ask" there is one more step, a step that sounds so obvious nobody thinks it needs teaching: deciding what is a node and what is an edge. Two teams that correctly recognize the same problem can still build two different models from the same data, and one of the two models will answer the question. The other will be mute.

✱ hết bản in thử