Vol. I — Graph Theory·Chương XIII·Graph Theory Ngoài Đời Thật

GT - Chapter XIII

LLMs as Graph Builders, and the Quietly Wrong Graph,
when the cost to build drops two orders of magnitude but the cost to trust does not

LLMs have made knowledge graphs a hundred times cheaper to build. But a cheaply built graph carries a failure mode that an expensively built graph does not: it looks complete while being quietly wrong in places no one examines. This chapter is about designing a pipeline that treats every edge as a claim with provenance, not as a settled fact.

Tech · 17 tháng 6, 2026 · 22 phút đọc

From Luxury Good to Commodity

In 2010, if a pharmaceutical company wanted to build a knowledge graph from its entire internal research corpus, the realistic answer was: you need a team. A group of ontologists to design the schema, classify entity^° types, and define each relationship category. A group of data stewards to read every document, annotate mentions, and flag individual claims. A group of domain^° experts to review every significant edge before it entered the graph. A few million dollars. A few years. The result was a graph where every edge had been seen by a human being. Narrow in scope, but trustworthy within that scope.

In 2024, the same task takes an afternoon. A document set, an API call, a prompt asking an LLM to extract entities and relationships, and the output goes straight into the database. Tens of times cheaper, hundreds of times faster.

The question this chapter asks is not whether that is good. The question is what it changes, in which direction, and where the cost that nobody accounts for actually sits.

To answer that, we need to look back at three generations of knowledge graphs, because each generation did not just change the price -- it changed the kind of error that users have to live with.

The first generation was fully manual curation. Google Knowledge Graph, launched in 2012, was built on Freebase with an editorial team curating from Wikipedia infoboxes and verified semantic web triples. High cost, slow process, scope limited by what a human team could handle. The characteristic failure mode of this generation was legible: an entity not in the database was not in the graph. Users knew exactly what was missing, because "not yet added" is an observable state, not a hidden one.

The second generation was the traditional extraction pipeline: Named Entity Recognition combined with Relation Extraction, using rule-based systems or models trained on annotated corpora. Cheaper than the first generation, but still expensive -- domain experts needed to write rules, annotated data was required per domain, and the system was brittle against new writing styles or domains absent from training. The failure mode here was different in a useful way: users could still debug by reading the source sentence. Low recall on rare entities, relation extraction errors on complex sentence structures, and when it was wrong, it was wrong in ways that could be traced back to a specific rule or pattern.

The third generation is LLM extraction, and this is where the economics reverse. A single prompt can extract entities and relationships from arbitrary text, with no per-domain training data, no rule writing, no domain expert in the loop. Cost drops two orders of magnitude from the first generation. And alongside that appears a failure mode that neither previous generation had: duplicate entities that never get merged, relationships invented from conditional sentences that read as entirely plausible, accuracy that varies unevenly between document regions well-represented in the model^°'s training data and regions that are not. All of this, in a rendered graph, looks complete.

That is the central point about the third generation: the new failure mode is not "knows nothing about this domain" like the first, not "this sentence is too complex" like the second. The new failure mode does not announce itself. The graph is full of edges, the edges have reasonable-sounding names, the entities have properties. An end user has no way to distinguish which edges are trustworthy from which were generated from a model's reading of a conditional clause. And downstream systems cite that graph as structured fact.

A graph that looks complete is more dangerous than a graph that is obviously unfinished, because nobody questions what looks finished.

What Everyone Usually Does

The demo happened on a Wednesday. The product team presented a knowledge graph built from thousands of internal legal documents: which companies were connected to which, who the founders were, where the contracts between parties sat. The visualization was clean, queries returned fast, and leadership nodded.

Three weeks later, the team started using it in earnest to answer analyst questions. Three weeks after that came the first real debug session.

"John Smith," "J. Smith," and "Smith, John" were three separate nodes in the graph. Every query about this person returned fragmented results, each query catching one slice and missing the other two. Nobody on the team knew which node was canonical, nobody knew where to start merging, and nothing in the pipeline recorded which documents those three strings had come from.

An "ACQUIRED" edge connected Company A to Company B, with implicitly high confidence because the pipeline had no separate confidence mechanism. The source of that edge was this sentence in a contract: "In the event that Company A were to acquire Company B, the following indemnification clauses shall apply." The LLM read "Company A acquire Company B" and extracted it as a completed event. The acquisition never happened.

An executive appeared simultaneously as CEO and CTO of the same company in the graph, not because he held both roles, but because two documents from different points in time described him differently and the pipeline had no mechanism for tracking temporal context.

This is a typical three weeks following a naive extraction pipeline. Not because the team did anything wrong, but because the default approach was missing a foundational concept: the pipeline had no notion of a "claim." Every edge was created with the same weight, the same standing. No edge knew where it came from. No edge knew whether it was produced from a factual statement or a conditional one. No entity knew that other mentions were pointing to the same real-world thing.

The most common naive fix is to improve the prompt. Add an instruction: "Only extract relationships that have actually occurred; do not extract conditional or hypothetical statements." Run it again on a small sample, measure precision, see improvement, report to leadership. This is the right fix in the right direction and it resolves one specific symptom. But it is not a fix at the right layer.

A better prompt does not solve entity resolution^°. "John Smith" and "J. Smith" are still two nodes because the pipeline has no step that compares mentions across documents. It does not create provenance: three months after the graph gives a wrong answer to some question, the team has no path to trace back which edge caused the error, which document was the source, which prompt version produced it. It does not calibrate confidence by relationship type: every edge remains equal, even though "founded_by" and "might_be_affiliated_with" require completely different levels of precision.

A small sample measures precision on what is visible. Low recall on rare entities is invisible from a sample. Entity splitting is undetectable without ground truth about what the canonical entity should be.

What is missing is not a better prompt. What is missing is a different concept of what a pipeline is: extraction is not a transform from text into graph, but the generation of a set of claims, each with provenance, each with confidence, each with an acceptance threshold calibrated by type. From that concept, the pipeline has a completely different shape.

A Pipeline with a Contract

The four layers below are not a checklist to tick off and move on. They are four questions that an extraction pipeline must be able to answer before the first edge enters the graph, and each must be designed in from day one, not added after the pipeline is already running.

Schema-Guided Extraction: Ontology as Contract

Two ways of prompting an LLM look similar but are not the same. The first: "Extract entities and relationships from this passage." The second: "Given the following schema, which defines four entity types and six relationship types as described below, which of these appear in this passage?"

With the first framing, the LLM might return relationship types like "might_be_involved_in_the_same_sector" or "was_mentioned_alongside" or whatever the model finds suitable for the text. Nobody promised it would not. With the second framing, the model cannot invent an edge type outside the schema, because the schema is the only valid answer space.

This is what an ontology does in an extraction pipeline: not help the LLM understand the domain better, but constrain the output space to exactly the set that the pipeline's designers defined and are accountable for. In the previous chapter on modeling, an ontology served a human reader; here it serves the machine executing the extraction, and the contract with the machine demands more precision. Relationship names must be distinguishable, not just readable, and both positive and negative examples must be provided so the model can calibrate exactly where the boundary sits.

The trade-off of schema-guided extraction is recall. Relationships that fall outside the schema will not be extracted even if they exist in the document. This is not a bug but a feature: controllable scope matters more than uncontrolled high recall, because the team has no way to handle an edge whose relationship type nobody defined.

Entity Resolution: Its Own Problem, Not a Side Step

"John Smith," "J. Smith," and "Smith, John" are three mentions of the same real-world entity. Deciding whether they refer to the same entity is the problem of entity resolution, and it cannot be solved inside an extraction prompt because extraction looks at one document at a time, not at the entire corpus at once.

Entity resolution is a clustering problem: collect all mentions, compute similarity between each pair based on name, co-occurrence context, and entity type, then cluster mentions that are similar enough into a single canonical entity. The structure of the problem is clustering nodes on a similarity graph, where each mention pair has a similarity weight and a decision threshold for merging versus keeping separate.

Why it must be treated as its own problem: if entity resolution is skipped, every metric computed on graph entities is computed on fragmented data. A real person with 80 connections gets split into three nodes, each seeing 25 to 30 connections. His centrality does not appear correctly in the graph. Community detection^° finds the wrong communities because entity splitting creates artificial boundaries. And there is no way to detect this from outside the graph without ground truth about which entity is canonical.

Entity resolution must run after extraction and before any^° analytical operation. Not as an advanced step for pipeline version 2.0, but as a prerequisite for any^° metric on the graph to mean anything.

Provenance Is Non-Negotiable

Every edge must know where it came from: which document, which passage, which chunk ID, which extraction run, which model and prompt version. Not because an audit trail is good in principle, but because without provenance there is no path to debug.

When a downstream graph answers a multi-hop question incorrectly three months later, there are two ways to debug it. The first is to re-extract the entire graph and compare, which takes days and may not surface the cause. The second is to follow provenance from the wrong answer to the wrong node to the wrong edge to the source document and specific passage, which takes a few hours and produces a result. The second path only exists if provenance was attached to every edge from the beginning.

The storage cost of provenance is real at millions of edges. The design decision is whether to store metadata inline in the graph or in a separate store referenced by edge ID. Both have trade-offs in query speed and storage cost, and neither is universally correct. What has no acceptable alternative is not storing it at all.

Confidence and Thresholds per Relationship Type

Not one threshold for every edge type. "FOUNDED_BY" between a person and a company is a high-stakes claim: if it is wrong, every answer about that company's history is wrong downstream. The threshold must be high. "INFLUENCED_BY" between two products or two ideas is an inherently ambiguous claim: documents rarely state it directly, context drives the judgment, and the cost of being wrong is more tolerable. Lower threshold, or use a weighted edge rather than a binary one.

The question for calibrating thresholds is not "what accuracy percentage is good enough." The question is: if this edge is wrong, what does that do downstream? Edges used to answer factual questions need high precision. Edges used to cluster similar products can tolerate noise because a cluster large enough that one wrong edge does not break the aggregate^° result. The impact of an error calibrates the threshold, not the accuracy measured on a small sample.

“A hand-built graph fails where you know work is missing. A machine-built graph fails where you believe work is done.”

a pause

In your current extraction pipeline, which of these four layers is missing? If entity resolution is absent, do you actually know what you are counting in your downstream metrics?

Think about it for a moment.

Amazon AutoKnow: Where to Spend the Human Budget

Amazon has a problem that scale makes impossible to solve by hand: hundreds of millions of seller listings, each a free-form description written by the seller, with no schema, no standardization, and no consistency across sellers. The same battery gets described differently by every seller, with a different product name and attributes recorded in a different format. Customer questions, such as "which battery is compatible with this camera," require structured relationships that do not exist anywhere in the raw text: product-to-compatible-accessory, product-to-attribute, product-to-category. Manual curation at that scale is not possible.

AutoKnow, published by Amazon Research at KDD 2020, is an automated extraction system drawing from three sources: seller product descriptions, customer Q&A, and customer reviews. These three sources complement each other in predictable ways. Seller descriptions can be wrong or incomplete because sellers have marketing incentives. Q&A reflects genuine customer questions, usually about exactly what buyers need to know. Reviews reflect real post-purchase experience. Amazon's product taxonomy serves as the schema that constrains extraction, making it schema-guided in exactly the sense of the first pipeline layer.

Entity resolution here is a product-matching problem: the same headphones, with seller A listing "Sony WH-1000XM5," seller B listing "Sony Noise Cancelling Headphones XM5 Series," and seller C listing "WH1000XM5 Headphone Sony." Three different strings, one product. Without resolution, every attribute and review attaches to three separate entities rather than one. Queries about that product return three disconnected slices of results, and an analyst cannot synthesize anything useful from them.

The most notable aspect of AutoKnow is not its specific extraction technique or entity resolution algorithm. It is how the system handles the human-in-the-loop question.

At hundreds of millions of products, reviewing all extraction by hand is impossible. But reviewing nothing is also not acceptable, because accuracy would fall short for claims that matter. AutoKnow resolves this by classifying extractions along two axes: model confidence and the impact if the extraction is wrong.

High confidence and low impact, such as common attribute-value pairs like color or weight, requires no human review. The model is very likely right here, and when it is wrong, the error affects something that does not change a purchase decision. Low confidence, regardless of claim type, requires human review before the edge enters the graph, because the model is uncertain. High confidence but high impact, such as compatibility claims where an error leads a customer to buy the wrong accessory, still requires regular human spot-checking.

This is an application of the fourth pipeline layer, confidence per relationship type, with an added dimension: not just model confidence, but confidence multiplied by impact according to claim type in this specific domain. The result is that the team reviews a small fraction of all extractions while reviewing exactly the fraction that matters most.

The lesson from AutoKnow is not the specific architecture of the system. It is the framing of the question. The question is not "automated or manual," because at sufficient scale no pipeline is trustworthy when fully automated, and no team is fast enough to review everything. The question is: given a fixed human review budget, where do you deploy it so that each hour of review removes the most high-impact errors? Confidence combined with impact is the framework for answering that question, not a binary decision between automation and human oversight in any absolute sense.

Measuring a Graph You Cannot Fully See

If a graph has a million edges and the team can annotate a thousand edges by hand in a week, precision and recall for the full graph cannot be measured directly. This is not a problem with any specific tool or methodology. It is a fundamental limit of the problem.

Most teams work around it by not measuring: using "looks right" when inspecting a small subset, or using "the demo went well" as a proxy for quality. This is not because the team is careless but because good tooling and agreed-upon benchmarks for this problem do not yet exist. The field is immature on evaluation, and that is a statement of fact, not an apology.

Three measurement approaches are possible now, without waiting for standard benchmarks.

The first is stratified sampling. Rather than sampling the graph at random, sample by relationship type, because accuracy is not uniform across types. "FOUNDED_BY" may achieve high precision because this claim is common and usually stated explicitly in documents. "INFLUENCED_BY" may be much lower because the claim is inherently subjective and rarely stated directly. Stratified sampling produces numbers that are actually actionable: you learn which types need a higher threshold or more few-shot examples in the prompt.

The second is measuring entity resolution separately from relation extraction. These are two problems with entirely different failure modes and measurement approaches. Entity resolution has cleaner ground truth: whether two mentions refer to the same entity is a binary question with a determinate answer. You can measure precision and recall on an annotated set of mention pairs with a much smaller sample than the full graph. More importantly, if entity resolution quality is low, every metric on relation extraction becomes meaningless because the entity foundation is wrong. Measure ER first, fix ER first, then measure RE.

The third is downstream task proxy. If the retrieval system using this graph answers multi-hop questions better after entity resolution improves, that is a signal that graph quality is moving in the right direction. This proxy does not measure the graph directly, but it measures what the graph is supposed to serve. A graph that scores well on the proxy is a graph that does its job, even if the absolute quality of every individual edge cannot be fully assessed.

The design implication of this immature evaluation landscape is that provenance is not a retrospective audit trail -- it is evaluation infrastructure from the start. When there is no standard benchmark for measuring full-graph quality, the ability to trace every edge back to its source document is the only way to debug with any precision when something downstream goes wrong. Provenance cannot be attached after the graph is built and the team realizes it needs to debug; it must exist from the first day the pipeline runs.

Three Ways to Fail

Skipping Entity Resolution

Entities get split into multiple nodes, and each node sees only a fraction of the real connections. A real person with degree 80 gets split into three nodes: one with 32 connections, one with 28, one with 20. None of them ranks as a hub in the graph. Community detection finds the wrong communities because entity splitting creates artificial boundaries. Centrality is computed incorrectly because flow^° through the graph passes through three phantom nodes instead of one.

The distinctive danger of this failure mode is that the results look plausible. The graph is full of edges, answers have structure, the visualization is clean. Nothing appears obviously wrong from the outside. Only someone who knows the ground truth of which entity is canonical can detect it, and that person is usually not the one querying the graph day to day.

Extract Once, Then Abandon

Documents are alive. New press releases, new filings, new product launches, personnel changes. A graph that is never re-extracted becomes, after six months, a map of the company from six months ago, while the system continues to cite it as structured fact.

This is a freshness and lifecycle problem, and the mechanism for solving it lives elsewhere, but the pipeline design must anticipate it from the start: provenance includes extraction timestamps so you know which edges came from documents belonging to which period, entity IDs that are stable across re-extraction runs so the next run does not create duplicates, and schema version control so you know which ontology version the next re-extraction will use. Failing to design for freshness from the beginning means that when an update is needed, rebuilding the entire graph from scratch is the only path.

Mixing High- and Low-Confidence Edges Without Distinction

Downstream retrieval has no defense if the graph does not distinguish between an edge with 95% confidence and one with 55%. Consider a three-hop multi-hop question: if each step traverses an edge with independent confidence of 80%, the combined confidence after three hops is roughly 51%. With edge confidence of 55% per step, after three hops the figure is around 17%.

The point is not these specific numbers, because confidence from LLM extraction cannot be interpreted as a pure statistical probability in any rigorous sense. The point is the principle: the system is answering multi-hop questions without knowing what it is trusting or to what degree, and neither does the user. The final answer may be right or wrong, and there is no way to distinguish from the outside without confidence attached to every step.

Confidence must be part of the schema from the beginning, not a metadata add-on when the team eventually realizes it needs it.

When You Don't Need All of It

The four-layer pipeline is not the right solution for every extraction problem.

A narrow domain with standardized, well-structured documents is often better served by rule-based extraction, which wins on trustworthiness and debug cost. Extracting names, addresses, and contract numbers from invoices with a fixed template: regex plus schema validation is sufficient, cheaper, and when it fails, the failure points back to a specific rule. LLM extraction earns its place at scale and in the presence of document variety, where many document types express the same information in many different ways. When documents are few and standardized, the cost-benefit tilts toward simplicity with nothing lost.

Not every situation requires the full four-layer pipeline. Entity resolution is unnecessary when entities are already unique in the source: product IDs, user IDs, and account numbers do not need to be resolved. Provenance metadata is unnecessary when the graph is small enough that the team can audit every edge by hand and knows exactly which document each one came from. Confidence thresholds are unnecessary when there is only one simple relationship type and the team has manually validated enough of the corpus to trust it entirely.

What these three conditions share is controllability. A simpler pipeline is sufficient when the team has complete visibility into the graph and can trace errors by hand -- not when the graph is small in the sense of edge count. A graph of ten thousand edges where each edge comes from an identified document, reviewed by the team, is simple enough to not need provenance automation. A graph of ten thousand edges from LLM extraction where nobody knows when each edge was produced or from which sentence is a graph that needs all four layers.

The question that distinguishes when the full pipeline is necessary: if this graph gives a wrong answer six months from now, where do you start debugging? If the answer is "I would know immediately because the graph is small and clear, and I know where every edge comes from," a simpler pipeline is enough. If the answer is "I would have to re-extract everything and compare," then provenance and confidence must be present from the start.

This pipeline, with all four layers, handles the problem of a single extraction pass over a stable corpus. But documents do not stand still. New models emerge, the ontology needs to evolve as domain understanding matures, confidence thresholds need recalibration as the document distribution shifts. The question this chapter has not answered: who responds when this graph gives wrong answers six months from now, and what mechanism keeps the graph in step with new documents without requiring a full rebuild every time something changes?

■

One sentence to remember

A hand-built graph fails where you know work is missing. A machine-built graph fails where you believe work is done.

from Graph Theory in the Wild, Chapter 13

the author’s choice · not an algorithm

Tiếp tục trong Graph Theory Ngoài Đời Thật

← Chương trước

Chương XII

Graph làm trí nhớ có cấu trúc cho LLM

Chương tiếp →

Chương XIV

Production-grade hay prototype mãi mãi

or xem toàn bộ tập

A quiet word

What question does this leave you with?

Not published. Never shown to other readers.

In this neighborhood

hand-picked by the author

prerequisite

Graphs as Structured Memory for LLMsTech · 12 min

“Chapter 12 establishes how graphs improve LLM retrieval and leaves open the question of where those graphs come from. Chapter 13 answers that question.”

→

sequel

Production-Grade or Prototype ForeverTech · 14 min

“Chapter 14 picks up where Chapter 13 stops: the extraction pipeline exists, but what does operating it continuously in production actually look like?”

→

If you’d rather wander, the full archive is here.

Kếtin →LLM làm công cụ xây graph, và bài toán graph sai âm thầm →