Vol. I — Graph Theory·Chương VII·Graph Theory Ngoài Đời Thật

GT - Chapter VII

Community Detection: When Clusters Emerge on Their Own,
the algorithm always finds clusters, even in noise

Marketing segments customers by age, region, and spend. Behavioral data draws clusters that match no column: anglers and campers are one community, gym-goers split into three. The clusters are real, but a cluster-finding algorithm never says "there is nothing here", and the real skill is the skepticism loop that runs after it answers.

Tech · 12 tháng 6, 2026 · 24 phút đọc

The Clusters Nobody Named

A mid-sized e-commerce marketplace, the start of the quarter, the marketing team meeting to prepare the biggest campaign of the year. The first task, as every year, is splitting the customer base into segments so the messaging can be tailored. The familiar, reasonable way: split by attributes. Age, region, average spend over three months. The segmentation table is born, each cell with a name, an illustrative persona, even a stock photo of someone smiling. Young urban high spenders. Suburban families. Sports customers. Nobody in the room sees anything to discuss, because this table looks the same every year.

Meanwhile, an analyst on the data team does something nobody asked for. She does not look at the attribute columns; she looks at behavior: who buys what, who views what in the same session as what. From that she builds a network, customers connected through the products they both touched, with not one demographic attribute taking part. Then she lets the structure speak for itself.

What appears matches no cell of the segmentation table. People who buy fishing gear and people who buy camping gear fuse into one block, even though their ages run from twenty to sixty and their regions run from the city to the coast. The gym buyers, that tidy "sports customers" cell, split into three crowds that barely touch, which only later turn out to be three entirely different training cultures: one living around barbells and protein, one around mats and scented candles, one around running shoes and GPS watches. The clusters are visible at a glance, clinging to each other in a way that appears in no classification table the company owns, and no column in the database can name them.

The meeting room splits into two reactions, and both are half right. The excited half says this is real segmentation at last, dividing people by what they do instead of what they declared. The skeptical half asks the killer question: how do we know these crowds are real? Every dataset has grain, and what algorithm running on what data ever fails to draw a few shapes that look meaningful?

The problem has two halves, and the second half is harder than the first. The first half: how do you let group structure reveal itself when nobody can define the groups in advance. The second half: how do you know that what just revealed itself is structure, and not a mirage. This kind of problem has a name, community detection^°, and this chapter teaches both halves. It will also say something few books say plainly: someone who learns only the first half is more dangerous than someone who learned nothing.

Similar and Belonging Together

The standard reflex in front of a grouping problem, and it deserves to be described with respect, is clustering on attributes. Roll each customer up into a vector, age, spend, purchase frequency, then let an algorithm gather the nearby points, k-means and its relatives. This is the tool taught in every machine learning course, available in every library, fast to run, and for plenty of problems it is exactly the right answer. People who choose it are not doing it wrong.

It is enough when the attributes genuinely carry the group information. Segmenting by spend for pricing, by region for logistics, by frequency to catch customers about to churn, these splits stand firm because their business questions really are questions about attributes. It is also enough when the groups have to be explainable through existing columns to an operations team, or when relationship data simply does not exist. This part deserves to be said plainly, because what follows is not a strawman beating.

The break lives in an example you can draw in your head. Two customers nearly identical on every column, same age, same city, same spend, but one lives in the world of fishing tackle and the other in the world of camera accessories, two product universes that never intersect. And in reverse, two people different on every demographic column, a student and a retired engineer, buying their way through the same universe of climbing gear. Similarity is a relation between the attributes of each point. Belonging together is a relation inside the connection structure between points. Two coordinate systems that differ in kind, not in resolution, and no amount of feature engineering converts one into the other.

	Splitting by similarity	Splitting by connection
Input	Attributes of each element	Relationships between elements
Where groups come from	Defined in advance, or distance in attribute space	Emerge from structure
Question answered	Who resembles whom	Who belongs with whom
Typical tools	k-means and relatives	Community detection

What matters more than the tooling is that the trap sits at the question-asking layer. K-means usually gets chosen not because someone laid the two questions side by side and weighed them, but because attribute data is already sitting in tables while relationship data has to be built. The question gets decided by the shape of the data at hand, instead of by what the business actually wants to know. I have watched plenty of segmentations defended passionately in meeting rooms whose deepest reason for existing was that they were computable from the columns already there.

a pause

For the data you manage, if you split it once by similarity and once by connection, how much would you guess the two maps overlap?

Think about it for a moment.

So the right move is not to throw k-means away, but to separate the two questions and know which one you are asking. When the right question is who belongs with whom in the network, you need a definition of "belonging" tight enough to compute. And this is where the problem gets interesting, because it turns out a tight definition still does not release you from the duty of doubt.

Dense Inside, Sparse Outside

Traditional classification runs from definition to members: define the groups first, assign elements after. Community detection runs in reverse, letting the connection structure reveal the groups, and then humans go looking for the meaning of what was revealed. Its power and its trap are the same thing. Because there is no prior definition, nothing constrains the algorithm to answer that there are no groups here. It always returns a partition. Always. So the question for anyone using it is not where the clusters are, but how much this partition deserves to be trusted.

“
A cluster-finding machine never answers that there is nothing here. That answer is your job.
”

To see why, look at how the phrase "belonging together" gets tightened. The most widely used definition is called modularity^°, and its intuition fits in one comparison: a good partition is one where the edges inside the clusters are markedly denser than what we would expect if the same number of edges were scattered at random among the same nodes. Both sides of the comparison matter. Dense inside, sparse between clusters, that is the memorable side. But the compared-to-random side is what keeps the definition from being meaningless, because without comparing to random, every crowd looks like a cluster, including crowds produced by rolling dice. Modularity is a number that scores one specific partition, and the problem becomes finding a partition with a high score.

Finding it how, when the number of ways to partition a million-node graph explodes past any^° hope of trying them all? The most pragmatic answer, and the most popular, is greed, one step at a time. The algorithm named Louvain works like this: each node in turn tries moving into a neighbor's cluster and stays wherever the shared modularity score rises most, this repeats until no node wants to move anymore, then each cluster is compressed into one super-node and the game restarts a level up. The astonishing part is the cost: next to the most expensive measure of the previous chapter, the one that had to compute paths between every pair of nodes, Louvain is almost unbelievably cheap, running on graphs of hundreds of millions of edges on a single machine. That cheapness is why it is everywhere, and its being everywhere is why its pitfalls became the whole industry's pitfalls.

There are two cracks you must know before using any of this. The first is named the resolution limit: plain modularity has a "preferred" cluster size, and that size depends on the size of the entire graph, meaning real small clusters can get swallowed into large ones simply because the graph is big. Modern implementations therefore give you a resolution parameter, a dial that adjusts cluster size. Pause on that sentence for a beat, because it says something more important than its technical surface: the number of clusters is not a truth of the data; it is partly a choice made by the asker. The second crack belongs to Louvain itself: the algorithm can return clusters that are not internally connected, a cluster made of two pieces with no edge between them, a real structural defect, documented. That is why Leiden exists, same intuition, plus a refinement step that guarantees every cluster is connected, and it runs faster. The practical advice fits in one sentence: use Leiden for new work, and understand Louvain so you can read older systems and know why Leiden exists.

That is the end of the mechanics, and if the chapter stopped here it would have taught you exactly the dangerous half. Because everything just described, modularity, Louvain, Leiden, changes nothing about the original fact: the algorithm always returns a partition. What turns mechanics into a skill is the skepticism loop that runs after every answer, three steps, none of them skippable.

Step one, treat resolution as an experiment rather than a single run. Sweep a few levels and watch how the partition changes as you turn the dial. There will be ranges where every notch produces a different world, and ranges where the structure holds still across several notches in a row. The still ranges deserve more trust than the dancing ones, because real structure is stubborn, while the grain of noise changes shape with every way of looking.

Step two, measure stability. Algorithms in this family have a random element, the order nodes get visited for one, so run ten times with different seeds and ask a simple question: how many pairs of nodes still end up under the same roof? The core that holds across ten runs is structure. The fringe that hops between clusters is a gray zone, and the gray zone must be declared as a gray zone. You met this reflex in the previous chapter, with ranking numbers computed by approximation; now it applies to partitions.

a pause

If you rerun ten times and the membership of the largest cluster changes by a quarter, what do you write on the report slide?

Think about it for a moment.

Step three, naming. Take the clusters that survived the first two steps, pull someone who knows the domain^° into the room, and ask: who is this group? A cluster that can be named, the fishing-and-camping crowd, the three gym cultures, is a finding. A cluster that nobody can name after an honest effort means either the resolution is wrong, or the edges the graph was built from are wrong, or it really is the grain of noise. Naming is not a decorative step after the analysis is done. It is the final test of the analysis.

Back to our analyst. She now has a process instead of a beautiful picture: a resolution sweep, ten runs, and an afternoon sitting with the marketing team to name each cluster. The fishing-and-camping cluster survives all three steps and receives a name. A few small clusters that shimmered in the first drawing dissolve into noise at step two. Community detection is not a truth-detecting machine; it is a hypothesis-proposing machine, and those three steps are how one treats a hypothesis. The two real systems below show what that looks like at scale: one where the emergent hypothesis became the backbone of a product, and one where the emergent hypothesis is handed to humans for judgment.

The Musical Galaxies

A music recommendation system needs to know which artists are close to which, and the oldest available division is genre. Genre is a label assigned by the industry, inherited from the era when record stores needed to know which shelf an album went on, rock on one shelf, jazz on another, world music on the shelf for everything left over. Seen through the eyes of the previous section, genre is precisely an attribute segmentation: someone defined the groups first, then assigned artists into them.

The streaming platforms, of which Spotify is the most discussed representative, changed the coordinate system. They build an artist graph from listening behavior: two artists are connected when they get listened to in the same sessions, when they sit together in playlists users built themselves. This edge does not ask what the music is. It only records what the music is listened to with. And choosing this edge rather than another is a modeling decision in exactly chapter three's sense, because the clusters will emerge from precisely the relationship that was chosen, and no other.

What reveals itself on that graph looks like an astronomical map, listening galaxies, dense here and sparse there. Many clusters match familiar genres, which is no surprise. The part that earns the money is the part that deviates. There are listening regions that cut across labels the industry considers far apart, where listeners keep mixing them into the same playlists as if the boundary had never existed. And there are official genres that, seen from behavior, split into several listening communities that barely touch, the three-gym-cultures motif from the start of this chapter, this time at the scale of hundreds of millions of listeners. The real structure of listening differs from the structure the music industry believes about itself, and that deviation is not a data error. It is the value.

But for that structure to become a product, it has to pass through exactly the third step of the skepticism loop. An emergent cluster of artists is just a heap of points until it gets interpreted, and that work is done by people: naming the listening regions, thousands of names, some matching old genres, some invented from scratch because that listening community had never been named by the industry. Features in the family of Discover Weekly stand on that community structure to do what the old genre table never could: recommend something inside your world that you have not met yet, because "your world" is now drawn from the behavior of people whose listening paths resemble yours, not from the shelf an album was filed on.

Told fairly, the story includes its limits. Listening structure drifts over time, this year's communities are not next year's, so the map has to be rebuilt continuously; a partition is a snapshot, not a permanent portrait. And the clusters reflect only the behavior of people listening on the platform, not some universal musical truth. An emergent cluster is always a cluster of exactly the graph that was built. This case wins not because the algorithm is more refined than anyone else's, but because both the skepticism loop and the naming step were done seriously, at industrial scale.

The Communities That Are Too Tidy

Now turn to the other face of the same skill. An anti-money-laundering investigation team at a bank looks out over millions of accounts, and each week they have the capacity to open a few hundred case files. The criminals, meanwhile, have finished learning the lesson of the previous generation of rules: keep every transaction looking clean, every account looking ordinary, no single number crossing any threshold. This story, if you have been following from the start of the book, has been escalating: from tracing the network around one account already known to be bad, to scoring each account's risk by its position, and now to an entirely new question, are there groups of accounts behaving like a single organism?

The central intuition of this problem deserves to be stated carefully. Real human communities are messy. A real person's relationships spray in every direction, your friends' friends do not all know each other, money and contact leak out of the group constantly, people buy things, pay bills, send money to people who have nothing to do with one another. A cluster of accounts that transact almost exclusively with each other, closed, tidy, account ages all roughly equal, with few loose threads into the outside world, is a shape that real life rarely produces on its own. On the community map of the whole network, that excessive tidiness stands out as a region of anomalous density against the background. What is suspicious is not that the cluster exists, every human network is full of clusters, but that the cluster is tidier than the way people live.

The intuition generalizes beyond one domain. In insurance, there are clusters of claims that share the same few doctors, the same garages, the same law offices; no single file is wrong on its own, every file reads as legitimate, but a hundred files winding around exactly that handful of entities form a community dense beyond what natural probability explains. The shared structure of the whole cluster is the anomaly, not any one member of it.

And here operational discipline becomes the heaviest part of the case. In systems of this kind, the graph analytics platforms that large banks deploy publicly, community detection is a candidate-generating filter. Its output is a list of clusters worth looking at, with reasons in human language, internal density, closedness against the background, the entities shared. That list is the input to an investigator, not the input to an account-closing button. The reason lives in the price of being wrong. An emergent cluster can perfectly well be a three-generation family sharing an address, a hometown association wiring money back and forth, a dormitory sharing one wifi line. Guilt-by-association was already a risk to manage when scoring single accounts; for an entire group it is heavier, because a cluster shut down wrongly is dozens of people losing access to their finances at once. The naming step of the skepticism loop carries legal weight in this domain: the final name of the cluster is the conclusion of an investigation, and only a human gets to write it.

Mature systems in this space measure success by candidate quality, what fraction of the files the machine proposes lead to an investigation worth having, not by the number of clusters found. Because the algorithm always finds clusters. The precious thing is the process that turns a partition into a file worth opening. And the investigators in this story know one thing this chapter has not touched: alongside the anomalous clusters that surface on their own, there are cases whose specific shape they know in advance, know well enough to describe in words. That is a different kind of problem.

Three Ways to Wrongly Trust a Cluster

Every failure mode of this chapter is one step of the skepticism loop skipped, and since the loop has three steps, there are three familiar ways to fall.

The first way: trusting a single run. It looks like this: the company dashboard says customers belong to 14 behavioral segments, the number 14 lives in slides, enters the OKRs, gets quoted in the quarterly report, and nobody knows that next week's run returns 11, the one after returns 16, with a few thousand customers quietly changing homes every time the seed changes. The reason lives in the algorithm's nature: the node visiting order has a random element, and the optimization surface has many peaks of nearly equal height, many partitions with nearly equal scores all legitimately "correct", so a single run is one sample drawn from a distribution of partitions, not the unique truth. What makes this failure more dangerous than its cousin in the previous chapter is that clusters have boundaries, and a drawn boundary looks like a national border, far more solid than its nature. The catch: a reporting convention, no partition leaves the team without its stability index attached, run n times, what fraction of node pairs stayed under the same roof, where the boundary zone is. The stable core may go on the dashboard. The fuzzy part must wear the fuzzy label.

The second way: turning the resolution dial wrong, or worse, not knowing the dial exists. It looks like this: either the partition comes back as three giant clusters, northern customers, southern customers, everyone else, correct in a completely useless way, or it comes back as five thousand fragments where the largest cluster has forty members and the marketing team cannot possibly tailor five thousand messages. Both extremes get presented as "the algorithm's result", when they are one notch of a dial whose library default someone already chose for you. The reason is the resolution limit from section three, not a theoretical footnote but real behavior: the number of clusters is a function of a parameter, not a constant of the data, it is just that the output never prints that line. The catch has two sides. The technical side: never run one level, sweep the resolution and look for the range where structure stands still. The business side, easier to forget: cluster size must match action size, if the team can tailor for fifteen segments then a partition into three or into five thousand is wrong for this question, no matter how pretty the modularity score.

The third way: skipping the naming step. It looks like this: a beautiful notebook, cleanly separated clusters, high scores, a shimmering visualization praised in the demo, and six months later not one decision in the company uses that partition, because "cluster 7" tells nobody what to do about cluster 7. The reason: a cluster is a hypothesis, a name is a conclusion, and skipping the naming step is stopping the analysis halfway while looking finished because there is a picture as evidence. And one layer deeper, a cluster that cannot be named is usually a symptom of the wrong edges, structure that emerges from a meaningless relationship is itself meaningless, the error sits in the modeling decision made before the algorithm ever ran. The catch: put the naming session into the definition of done, every core cluster must receive a human name and an answer to the question, what do we do differently for this group. A cluster that stays nameless after honest effort gets recorded as nameless and excluded from decisions. Namelessness is also a result, as long as it is declared.

An Exploration Tool, Not a Confirmation Tool

There is one question every chapter of this part has to answer: when not to use it. For community detection the answer has its own sharp taste, because the advice is to not use this chapter's own technique when the business already has a clear group definition and that definition serves the right question. VIP customers by revenue, business accounts by contract type, paying users by plan, these splits exist for good operational reasons. Letting an algorithm reinvent them, usually worse, an emergent partition that reproduces eight parts of the existing split plus two parts noise, is a step backward wearing the uniform of data science.

Community detection was built for exactly one situation: you suspect there is structure and you do not yet know what it is. It is an exploration tool, not a confirmation tool. Using it to "validate" a split you already believe in is wrong in both directions at once, because the algorithm will always find something, and a person who already believes will always read out what they believe. The mature usage I keep encountering is hybrid: the business split serves as the frame for daily action, while the emergent split runs periodically as a drift detector, and when behavioral clusters start cutting across the official segments, that is the signal that the official map is going stale, each side doing exactly its own job.

Return to the marketing room from the start of the chapter one last time. The new segmentation is live, the clusters have names, this quarter's campaign is tailored to worlds of behavior instead of to declarations, and the fishing-and-camping cluster receives the first message ever addressed to it. Throughout this chapter, structure was something waited for, left to surface on its own. We did not know in advance what we were looking for, and the entire method, from the comparison against randomness to the three-step skepticism loop, was built around that condition. But there are people who do not need to wait. The investigators of the previous section know exactly what they are looking for; they have seen it hundreds of times, can describe it in words, can draw it on paper, a concrete shape, known in advance, only its location unknown^° somewhere in half a billion transactions. The question of whether there is any structure here, and the question of where this shape is, are two different kinds of question. The second kind, no tool in this chapter can touch.

■

One sentence to remember

The algorithm always finds clusters. Your job is to know when a cluster is just an echo of the noise.

from Community Detection: When Clusters Emerge on Their Own

the author’s choice · not an algorithm

Tiếp tục trong Graph Theory Ngoài Đời Thật

← Chương trước

Chương VI

Centrality, ai quan trọng trong mạng lưới

Chương tiếp →

Chương VIII

Pattern matching, đi tìm hình thù của gian lận

or xem toàn bộ tập

A quiet word

What question does this leave you with?

Not published. Never shown to other readers.

In this neighborhood

hand-picked by the author

Opens the door

Centrality: Who Matters in the NetworkTech · 26 min

“Chapter six ranked individual nodes and closed on the image of groups of nodes clinging to each other in a way no ranking describes. This chapter opens exactly into that gap, and inherits the stability reflex chapter six had just planted.”

→

The same thread

Nodes and Edges Are Design DecisionsTech · 24 min

“Chapter three taught that a model is a decision. This chapter shows that clusters emerge from exactly the edges you chose, and a cluster nobody can name is usually a symptom of the wrong edges.”

→

If you’d rather wander, the full archive is here.

← Centrality, ai quan trọng trong mạng lướiKếtin →Community detection, khi cụm tự nổi lên từ dữ liệu →