← quay lại đọcBản in thử6 × 9 in · khổ · trang đối diện22 tr · 4,505 chữ
ii
Maxubrq · GT - Chapter III
Nodes and Edges Are Design Decisions
“A model does not describe reality. It describes the questions you intend to ask reality.”
iii
Maxubrq · GT - Chapter III

Nodes and Edges Are Design Decisions

why two teams looking at the same reality draw two graphs that are not equivalent
Two teams model the same flight data. One model answers the connection question; the other is mute. Modeling is not redrawing reality. It is choosing how to cut reality along the questions you need answered, and every cut opens some doors while closing others.
bởi maxubrq
tháng 6 năm 2026
Nodes and Edges Are Design Decisions

A travel company receives a single source of flight data, clean and complete: flight number, departure airport, arrival airport, departure time, arrival time. Two engineering teams are assigned to build an itinerary search feature. Both teams have already done the previous chapter's work; they ran the test, and the test answered clearly: relationships are the primary data, the question speaks of paths, the depth is unknown° in advance. This is a graph problem, no argument needed.

Team A draws the model° that nearly anyone who has taken an algorithms course would draw. Airports are nodes. A flight is an edge connecting two airports, carrying departure and arrival times as attributes. The drawing goes up on the whiteboard in ten minutes, and everyone who looks at it nods, because it is exactly the drawing in every textbook, every lecture slide, every illustration of a graph that has ever existed. Hanoi connects to Bangkok, Bangkok to Doha, Doha to Lisbon. A miniature route map. Nothing about it invites a second look.

Team B draws something that looks far stranger. In their model, each flight is a node. Flight VN615, Hanoi to Bangkok on Tuesday morning, is a node. Flight QR837, Bangkok to Doha on Tuesday afternoon, is another node. Edges do not connect airports to airports; edges connect flights to flights, and an edge exists only when two flights can actually be connected, meaning the first lands at the very airport the second departs from, and the time between them falls within an allowed window. The airport, the protagonist of every route map, all but vanishes from the drawing, surviving only as an annotation on a node. An outsider looking at this would say Team B drew it wrong, or at least drew it backwards.

Then the business question arrives, as it always does, in one human sentence: find itineraries from Hanoi to Lisbon, at most two connections, each layover no longer than four hours and no shorter than ninety minutes.

Team B answers with a straight traversal. Every constraint of the question has been baked into the structure; edges only exist between connectable flights, so walking Team B's graph is walking the valid itineraries. The machine merely steps.

Team A is stuck. Not stuck as in slow; stuck as in unable to express it. The timing information in their model lives on the edges, but the connection constraint is a constraint between two consecutive edges, the arrival time of this edge against the departure time of that one. A standard traversal steps from node to node and looks at one edge at a time; it has no concept of looking at two adjacent edges at once. Team A starts writing patch code, stashing the just-taken flight in the algorithm's memory, filtering manually at each step, and every line of patch code is a confession that the model cannot say what the question needs said.

Pause here a moment, because this paradox deserves to be looked at directly, before any° explanation gets the chance to dull it. The same data, down to the row. Two equally capable teams. Both correctly recognized this as a graph problem. And if modeling is the act of faithfully redrawing reality, then two faithful drawings of the same reality must be equivalent; they must answer the same questions. They are not equivalent. One speaks, one is mute.

Team A did not draw reality wrong. Airports are real, flights connecting two airports are real, everything on their drawing can be verified out in the world. They simply drew, correctly, a reality that does not contain the answer.

Nouns and Verbs

If you ask an engineer who has just moved to graphs how to decide what is a node and what is an edge, odds are you will hear a heuristic passed down through generations of documentation: read the requirement, underline the nouns, those are nodes; underline the verbs, those are edges.

This heuristic is not bad advice. Try it on any three requirements. A customer buys a product: two nodes and an edge, done. An employee reports to a manager: two nodes, one edge, an org chart appears. An account transfers money to an account: a transaction graph takes shape. Three tries, three immediately usable models, no meeting required, no debate. For a first model of a first graph problem, this is honestly most of what a team needs.

And its value lies somewhere else too, somewhere less often said aloud. A team standing in front of a blank whiteboard needs a way to begin, and nouns and verbs give them a shared language, a procedure mechanical enough that everyone draws nearly the same thing instead of sliding into a philosophical argument about the nature of the data. A heuristic that helps people begin is a valuable heuristic, regardless of how correct it is.

Now hold it against the opening story. Team A is precisely the team that followed the heuristic most faithfully. Airport is a noun, so obvious it requires no thought. Flying from here to there is a verb. The heuristic made no syntactic mistake anywhere; it was applied correctly step by step, and it produced a mute model. The problem is not that the heuristic was misused. The problem is that nowhere in its entire procedure is there a step that forces the drawer to stop and ask: what question do I need to answer?

There is a second limit, deeper. Natural language does not classify the world according to your query needs. The same thing can be a noun in this sentence and an entire relational clause in that one. Flight VN615 is a noun when you say "flight VN615 is delayed," and a verb when you say "Vietnam Airlines flies Hanoi to Bangkok every morning." Grammar bends to the speaker's point of view, one sentence at a time. A heuristic built on grammar inherits that arbitrariness whole.

Which means: the boundary between node and edge is not in the grammar, and it is not inherent in the things themselves either. If it were inherent in the things, a flight could not be Team A's edge and Team B's node at the same time, two drawings equally faithful to one sky.

So something can stand on both sides of the boundary. Something allows a relationship, a line between two dots, to step across and become a dot. The chapter's question is no longer why Team A failed. The question is: what force pushes a relationship into becoming an entity°?

When a Relationship Becomes an Entity

The phenomenon has a name, and the name is worth knowing because you will meet it again throughout the rest of this book: reification°, turning a relationship into an entity.

Before using the term, say it in human language. A relationship begins its life as a line between two dots. It stays a line, until one of three things happens. When it needs attributes of its own, and those attributes participate in the conditions of a question. When it needs to be referenced, pointed at, named directly. Or when it needs to participate, itself, in another relationship. The moment any of those three happens, the line can no longer carry the weight. It has to become a dot.

The term reification comes from knowledge representation, appearing explicitly in the W3C's RDF specification (RDF reification, turning a statement into an object so that statements can be made about it). In classical data modeling, its close relative is the associative entity in ER modeling. One intuition, two naming traditions.

Run the concept backwards into the opening case and everything clicks. A flight is a relationship between two airports, and it lives perfectly well as a relationship, until the business question needs a constraint between two consecutive flights. The connection constraint is a relationship whose two endpoints are themselves relationships. An edge cannot point at an edge; no sane formal system allows it. The moment the question demands it, the flight is forced to become a node, and Team B did not draw something strange. Team B drew the question.

“

A relationship is a line between two dots, until your question forces it to become a dot.

”

The three signals above are worth keeping as a small test, in the same spirit as the five-sign test you already hold. The relationship needs its own attributes, and those attributes are not decoration but sit inside the query's conditions: a flight's departure time is not there to look nice, it sits in the filter clause of every connection question. The relationship needs to be referenced: "move the passenger to a later flight" can only be said when each flight has an identity of its own to point at. The relationship participates in other relationships: connections, codeshares, one flight substituting for another. If your requirement matches any one of the three, that relationship is already a candidate for a node's seat.

And once you see this movement, you see it everywhere, not just in aviation. An employee works for a company, a gentle relationship, two dots and a line, until the day HR needs to ask about the employment period, the title at the time, the salary at the time, the reason it ended, and the contract splits off into an entity standing between person and company. An account transfers money to an account, gentle, until the day an investigator needs to gather twelve specific transactions into one case file and annotate each individually, and the transaction becomes a node, with an identity, with a record. A doctor prescribes medication to a patient, gentle, until dosage and date are needed, and then the question of interaction between two prescriptions, a relationship between two relationships, exactly the shape we just met in the connection story.

At this point everything seems clear. The noun-verb heuristic is the starting point, reification is the upgrade move when the question demands it, the case of Team A and Team B is closed. But look carefully and this solution has just opened a new chasm, far wider than the hole it filled.

If every relationship can become a node when needed, and, looking in the opposite direction, every attribute can be split out into a relationship pointing at a value node, then in principle you can reify forever. Any model can be upgraded into a more detailed, richer, "more correct" model. There is no natural stopping point anywhere in the theory. So the people who do this seriously, at a scale where each modeling decision is worth millions of machine hours, where do they stop, and what do they lean on to stop?

That question cannot be answered on a whiteboard. We have to go look.

❦

The Deliberately Poor Model

By the late 2000s, Facebook's social graph had outgrown every familiar measure. People, pages, posts, comments, photos, likes, check-ins, all connected to each other, and every time someone opened their home page, hundreds of small questions fired into that graph: who are this person's friends, which posts are new, who liked this, which comment is most recent. Multiply that by billions of users and the problem shows its true face: this is not a problem of representing things beautifully. It is a problem of representing things so that billions of small questions per second can be answered at all.

The Facebook team settled on a model that, set next to the academic knowledge graph systems of the same era, looks surprisingly poor. In the system called TAO, the world contains only two kinds of things: objects, nodes with a type and a bag of fields, and associations, typed, directed edges connecting two objects. No type hierarchy, no edges pointing at edges, no nested structure. Two concepts, and that is all.

Source: "TAO: Facebook's Distributed Data Store for the Social Graph," USENIX ATC 2013. The paper describes the objects-and-associations model, the inverse association mechanism, and the role of time in association queries. TAO's caching and consistency layers belong to chapter 11, where the physical limits of graphs at scale are faced directly.

Poor, but not naive. And one detail is worth knowing before going on, because it dissolves a common misunderstanding: this entire graph, at the storage layer, lives on MySQL, a purely relational system, plus an enormous cache layer above it. There is no graph database here at all. The objects-and-associations model is a decision about how to see the data, not a decision about the technology that holds it, and the two separate so completely that one of the largest graphs on the planet sits comfortably inside relational tables. Modeling happens at the level of thought. The technology comes later, and it is a different story.

Back to the model. Two decisions inside it deserve a long, careful look, because both are pure modeling decisions. Neither requires a graph database, a new query language, or any technology beyond what the team already operated. They are decisions about shape, made in full view of the questions.

The first: every association has an inverse maintained in parallel. If A likes B, then B liked-by A exists in the system at the same time, two records, updated together. From a thrift perspective this is extravagance; every edge costs double on write. But the TAO team knew the shape of their questions: the product asks in both directions, who liked this post and which posts has this person liked, at equally enormous frequency, and reads outnumber writes by orders of magnitude. They paid double on the write side to buy two-way questions at one-way prices on the read side. A price that only makes sense at exactly that read-write ratio, which is to say, for exactly that set of questions.

The second: time is a first-class attribute of the edge, and associations are kept in time order. Why time, and not any other of the hundreds of possible attributes? Because the product's real questions almost always have the word "recent" hidden inside them: the newest comments, the latest likes, the activity that just happened. The model is bent around the shape of the questions, decision by decision, and attributes that serve no question are not invited into the structure.

Now hold TAO up against the three reification signals from the previous section. In Facebook's domain°, plenty of relationships are strong candidates. A comment, after all, is a relationship between a person and a post. But a comment needs its own attributes inside queries, needs to be referenced, and needs to participate in other relationships: people like a comment, reply to a comment. All three signals match. And exactly as the framework predicts, a comment in TAO is an object, not an association. Reification happened, selectively, quietly, precisely where the questions required it and nowhere else.

And the price, because every model has one. A poor model means that structurally rich questions are not customers here. Deep multi-hop traversal, complex patterns, global analyses: TAO does not serve them and does not apologize for it; those needs travel a different road, through other systems built for other work. A good model, it turns out, is not the model that answers every question. It is the model that knows exactly which questions it serves, and declines the rest on purpose.

TAO is the answer to "where do you stop reifying" from a team that knew its questions extremely well: a few dozen query shapes, repeated billions of times per second, stable across years. They stopped very early, and stopping early was their strength. But that is one extreme. What happens at the other extreme, where determining the questions is itself the hardest part of the problem?

Is a City a Node or a Property

Airbnb stands at that extreme. Around the middle of the 2010s, they began building an internal knowledge graph connecting listings to places, experiences, events, amenities, with the ambition that search and recommendation could understand "a ski holiday near Tokyo" as an intent, rather than matching disconnected keywords. Unlike Facebook, this team did not have a few dozen fixed questions repeating billions of times. They had a question space that kept growing along with the product: location search today, travel-style suggestions next quarter, and next year something nobody in the room has thought of yet.

Source: the Airbnb engineering blog's series on their internal knowledge graph, including "Scaling Knowledge Access and Retrieval at Airbnb" and the post on contextualizing data with a knowledge graph. The "city as node or property" situation below is an illustrative example reconstructed from the kind of ontology decision those posts describe, not a verbatim account of one specific debate.

And the hardest debate for a team like this, retold by the people inside it, is not a technical debate. It is an ontology debate, meaning a debate about what kinds of things the world inside their system consists of. The question sounds small and splits the room in half: is a city a node, or a property of a listing?

a question

Before reading the analysis, pick a side and write down your reason: should the city be a node, or should it be a property?

answer.It cannot be decided with the information in the question. Property buys a cheap filter ('which listings are in Tokyo') and closes every question in which the city has relationships of its own. Node opens the relational questions and charges entity resolution and curation costs. One clause is missing: what kinds of questions search and recommendation will need to answer over the next two years.
If you found yourself wanting to ask 'for what, exactly?' before choosing a side, you have already arrived at the chapter's central claim, before the prose has had the chance to say it.

It is worth going to the very bottom of this example, because it is the "where do you stop reifying" question from the previous section, dressed in real life.

If the city is a property, a string sitting on each listing, then "which listings are in Tokyo" is answered with a filter: cheap, fast, no new infrastructure. But every question in which the city plays the role of something with relationships of its own goes mute. Which ski regions is Tokyo near. Which cultural region does this city belong to, when is its high season. People who love Kyoto tend to also love which cities. A string has no neighbors, no surroundings, belongs to nothing; it merely equals or does not equal another string.

If the city is a node, all of those questions open up. Tokyo can now connect to regions, to seasons, to events, to similar cities; the connection structure around it begins to carry information. But the price arrives immediately, and it arrives before the benefits: every listing now has to link to the correct city node. "Tokyo," "Tōkyō," "東京," and ten other spellings in data typed in by millions of hosts, all of it has to resolve to one identity. The entity resolution° problem inserts itself between the data and every query, curation costs rise, and a new layer of complexity stands in front of even the simplest questions that a filter used to answer.

The point of the story is this: there is no correct answer waiting inside the geography. Tokyo, by itself, is neither a node nor a property. Tokyo is just Tokyo. The decision belongs entirely to Airbnb, to which kinds of questions they want search and recommendation answering over the next two years, and how much curation cost they are willing to pay for it. The same city, two modeling fates, depending on the product strategy of whoever draws it.

Now set the two companies side by side and let the comparison speak. Facebook chose a deliberately poor model, two concepts, refusing every structural richness. Airbnb chose a deliberately rich ontology, accepting curation costs to buy a wide question space. Two choices opposed in nearly every respect, and both are right, each right for its own problem.

The difference between them, looked at closely, is not in the domain. It is in the maturity of the questions. Facebook knew its questions to the letter, a few dozen query shapes stable across years of product, so they dared to cast a rigid model around exactly that set and discard everything else. Airbnb did not yet know all of its questions; their question space was still growing with the product, so they bought breadth and paid in curation. The degree of certainty about the questions determines the degree of rigidity of the model. And the only thing consistent across both sides is not in either model. It is in the starting point: both teams began from the questions, not from the domain.

And that is the moment to realize the chapter's question was posed wrong from the start, wrong in exactly the way most of us were taught. The question was never "what does this domain look like," because whatever the domain looks like, it will still admit countless different faithful models. The question is always "what do I need to ask it."

“

A model does not describe reality. It describes the questions you intend to ask reality.

”

The first consequence of this view arrives immediately, and it is a little cold. A living product's questions change over time, while the model freezes at design time. Which means the best model for today's questions can be entirely mute before next year's, not because someone drew carelessly, but because the nature of drawing is to lock in a set of questions. Team A of the opening did not fail for lack of skill. They failed because they drew before knowing the question, and most of us were taught to draw in exactly that order: data first, questions later, the map first, the journey after.

❦

A Map of Questions

Restate what has just come to light, in its full form: modeling is not redrawing reality. Modeling is choosing how to cut reality along the questions that need answering. And the cutting has five knives: choosing what is a node is choosing what can be referenced, choosing what is an edge is choosing what can be traversed, choosing an edge's direction is choosing which questions are cheap and which are expensive, choosing weights is choosing what can be compared with what, choosing what is a property is choosing what gets hidden from the structure, present but taking no part in the shape. Every cut opens a few doors and closes a few others; no cut is free.

And the knives are not independent. A relationship demoted to a property can no longer be traversed; an attribute promoted to a node starts demanding resolution and curation. Every model is a budget allocated across the five, and what sets the budget is, again, the questions.

Pick up one of those knives to make it concrete, the knife of direction. A "service A calls service B" graph with edges pointing in the call direction answers "who does A call" very cheaply, and answers "who calls A" very expensively, even though the two questions describe the same reality seen from two sides. Whoever drew the edge direction decided, on behalf of their whole organization, which question would be answered in milliseconds and which would be answered in a meeting, and most decisions of that kind are made without anyone realizing a decision is being made.

This view also overturns a linguistic habit of the trade that we rarely notice. When a system hits a dead end, teams tend to say "we modeled it wrong," with the connotation of an error, a carelessness that should have been avoided. But seen from the side of the questions, most of those "wrong models" are not wrong in the sense of a mistake. They are models faithful to a set of questions that has expired. They once spoke very clearly, to the questions of three years ago. Seen this way, migrating a model stops being a confession. It is the sign of a product still alive, still generating new questions faster than the drawing can freeze.

There is one more layer, and this one sits where our education begins. Every algorithms textbook opens with the same incantation: given a graph G with vertex set V and edge set E. The previous chapter pointed out that the words "given a graph" hide the fact that out in the world nobody gives you a graph. But they hide a second thing, better concealed: V and E are not natural facts. They are the result of a chain of decisions someone made on your behalf, according to their questions, in their time, and very reasonably so, for them. Every time you reuse an existing graph, a dependency graph generated by a tool, a social graph defined by another system, a knowledge graph bought from a vendor, you inherit someone else's set of questions without reading the will. The graph answers fluently whatever its maker once needed to ask, and falls inexplicably silent before what you need to ask, and that silence looks exactly like your fault.

A reader who has come this far is standing in a specific place. The problem has been correctly recognized, thanks to the previous chapter's test. The model has been cut along the questions, thanks to this one. And precisely because it was cut along the questions, the questions are now beginning to show shapes of their own. There are questions shaped like "from here, what can be touched." Questions shaped like "between these two places, which way is best, and what does best mean." Questions shaped like "in this whole network, who stands in the important positions." Questions shaped like "of all these things, which belong together." Each shape is a family of problems, with its own way of being solved, its own price, and its own traps, the kind usually discovered only from the inside.

As for Team A, to close their story. They will redraw; teams like that always redraw. This time they will ask the question before putting pen to board, and their second drawing will look strange to outsiders, exactly the way Team B's drawing once looked strange to them. Perhaps that is the most reliable sign of a model cut along real questions. It rarely looks like the illustration in the textbook.

✱ hết bản in thử