Episode 5: Communities in Networks - Network Law No. 2
Updated: Mar 2
What's the rumpus :) I'm Asaf Shapira and this is NETfrix.
In the last episode we learned that since the 21st century, network science acknowledged that networks are not random, and they follow universal laws. And if the Power Law is the network's rule No. 1, then rule No. 2 is: Nodes in networks congregate to communities. What does it mean?
A closer look at the internal structure of a network shows that a network is made up of dense clusters that are highly inter-connected. These clusters are called communities, if you are a network scientist, or clusters, if you are more into computer science.
Understanding communities will allow us to analyze - easily and efficiently – networks of any size and even discover what we did not know we did not know (so-called, unknown unknown) and most importantly, to tell an interesting story about our data. In this episode we'll show how to do it, and what it has to do with cauliflower. So, let's begin.
So, why do clusters or communities form in networks?
Say we're going to a party with a friend. We walk in the door, look around and recognize no one. When suddenly a friend from the other side of the room gives us the hello sign. We join our gang that we have known for a long time from joined service in the army. Looking around the room we see more groups: guys that work together on some kind of a startup, guys who know each other from college, a group of girls who just arrived and a bunch of guys I do not know or understand what they are doing here. We naturally formed clusters or groups that each have a common trait: common workplace or college, common gender, or just common weirdness.
These clusters were formed according to the logic of homophilia (from Latin "preferring the same" or just plain "similarity"). Similar nodes better connect to each other and tend to have more interactions between them.
In an example from a previous episode, we analyzed the sociograms or drawings of Moreno’s social networks.
We'll recall that Moreno, as a social psychologist, researched the social relationships of pupils within a classroom. Moreno noticed that in third and fourth grade, the network had split into 2 distinct communities with a single connection between them. The common trait, or homophily that defined each community was that of gender. By the way, the study was done in the 1930s in Brooklyn. When I presented it to teachers in 2019 Israel, they noted that the clustering of pupils into gender-based communities happens in much earlier grades these days than those depicted by Moreno.
In human networks, there are many kinds of homophilic traits to which the network cluster by, such as: language, religion, ethnicity, common opinions and more. But the most common homophilic trait is probably geography, or more accurately, geographical proximity.
If we think about it for a moment, then the traits of language, religion and ethnic origin are often very related and even dependent on geographical proximity.
Even shared opinions are often a matter of geography. For example, it can be assumed that the hard core of Israel's supporters is, well... in Israel.
The same is true of the "Small World" example from a previous episode. Just as a reminder, "the Small World" experiment was conducted by Milgram, who was a well-known social psychologist in the 1960s. In this experiment. Milgram set out to find how many mutual friends would a chain letter have to go through from a random source to a random destination. The chains of letters whose source and destination shared geographical proximity, were shorter and faster, indicating that geographical proximity makes for better connectivity.
But what about today's on-line networks? Technological developments should have made us indifferent to distance. After all, theoretically, I could offer friendship on Facebook to someone who lives next door to me just as easily as to anyone on the other side of the globe. In practice, we'll find that the chances to become friends with someone whom you share the same space, are much higher. And check your friends list at Facebook just to make sure.
We can certainly be friends with people abroad that are not physically close to us. For example, through working relationships or rooting for the same sports team (Green Bay Packers, anyone?). Our homophilia can be based on working on the same project or shared interests, but these ties are usually fewer and weaker than our close proximity ones.
For example, I estimate that if we partition Facebook to communities (in the network sense), we will find communities with strong geographical homophilia: the Israeli community, the community of France, Spain, etc.
But what about border regions between countries?
If geographical proximity is so significant, will we get border communities? The answer is probably not.
Visible boundaries, and invisible ones, limit geographical homophilia. For example, in the case of countries, it's probably because of language barriers. Which means that proximity is a necessary but not a sufficient condition, so we see nationalism still carries weight.
We can visually see the effect of borders in an interactive map that was published in a New York Times' article in 2018, showing where American Facebook users' friends
are located, across a map of the states. In-light of everything we talked about so far, it will come as no surprise that the vast majority of the user's friends are in high proximity, but the number of ties swoops down very steeply once the state's borders are crossed.
Another example in the context of borders is a network study of gangs in Los Angeles published by Radil in 2010, in which he showed that the highway in Pasadena serves both as a physical and a virtual network border between the gangs.
Personally, I recall an example of virtual organizational borders from my service, when initial attempts were made to produce integrated teams from different units. The idea originated from the reasonable assumption about synergy, meaning that combining several disciplines together, makes for a better output.
Some of the early participants of those synergy-teams expressed frustration that despite the joint sit down, there were still barriers and lack of cooperation between the team partners. Often the participants did not see themselves as part of the team but as representatives of the organizational framework, or in free language, human sacrifices to the managers' attempt to show cooperation. Therefore, structural reform was also required alongside the physical and geographical regroup.
Network analysis of the organizational ties could have revealed it sooner, because it would have shown that the network clustered according to same-unit-homophily and not geographical homophily.
And yet another example for the homophilic traits of communities we can find in the academic field. For this, will use the database containing academic papers, and present is as a graph or network, where each paper (or node) is linked to the paper that cites it.
When using community detection on this network, we'll find that the communities are arranged according to the various fields of interest that concern the academy (a community of physicists, community of biologists, etc.).
That's because papers mostly cite papers that are related to their field, which is homophilia by practice. Papers in the field of chemistry will most often cite papers in chemistry. Sometimes they will also cite papers from the field of biology or computer science, but their strongest ties will be to their field of practice.
Apart of homophilic traits, communities also have another pretty amazing feature. They are fractal. What does it mean?
At first glance, the communities look like a collection of random clusters, but a deeper look into them will reveal that these clusters are composed of smaller clusters, which are also dense and interconnected and so on and so forth. This is the fractal structure of the network. And you can't say the word fractal and not mention cauliflower, because it's a great example for fractal structure.
I do not know how close your relationship to cauliflowers, but we are going to take it to the next level. From a distance, a cauliflower looks like a white lump but on closer inspection we can see its made up of several lumps and also these
lumps are made up of smaller lumps and so on but the cool thing is that each lump preserves the structure of the original cauliflower, at any resolution.
For the skeptics among us, I suggest going to the fridge and take a look in the vegetable drawer. Cauliflower - not the innocent looking vegetable we thought.
These lumps are called fractals and we will expand on them in the episode on dynamic networks. What is the connection between cauliflower and a dynamic network? Stay tuned to future episodes.
So, just as the cauliflower structure is preserved in every resolution, so too is the structure of the network, thanks to the communities, that is, the structure of the network meets Barabashi's definition of a structure that is Scale-free. A quick reminder: As we've learned in episode 3, Barabashi, a famous network scientist, discovered that
the distribution of the Power Law is maintained at any resolution in which we look at the network making it Scale-free. And the same goes for clusters or communities in a network. In each community the Power Law is kept: the community will have a few central nodes and most of the nodes will have few edges.
Even when we break down the communities to sub-communities and so on, the power law distribution within them will be maintained. On any network.
This feature of communities in a network will help us in our next use case which involves classification and labeling.
When we seek to classify or label big data, using communities' homophilic nature will allow us to label entire communities in the network.
And vice versa. In order to see what is the homophilic nature of the community, we don't need to identify or tag all the nodes in it. Why? In the Power Law we trust. Using the Degree centrality measure, which is simply the number of edges connected to a node, can point us to central nodes which in turn will best characterize what is the homophilic nature of the community.
For example, let's choose a community from the academic paper network we touched upon earlier. Now, let's look at the top 1% of the most cited papers in it. If those high-degree papers are from the field of physics, it will suffice to label the community as the "physics papers". Because of the Power Law, we don't have to classify each paper in the community.
The same if we come across a community on Facebook where the nodes with the highest centrality score in the community are from Boston, it means we probably reached the "Boston" community".
So, let's do a quick summing up:
A network is made of communities, which are dense clusters, meaning lots of edges between the nodes in the community. There's a reason for this density and that is the community's homophilic nature, meaning nodes that have similar traits tend to connect more. Usually, but not exclusively, this trait will be geographical proximity. The nodes within the communities conform to the Power Law, meaning there are a few central nodes and many nodes that have only a few connections. Through the central nodes, we can understand what's the community all about.
Another use case of community detection can be found in the field of text mining, specifically in an original study in 2012 funded by the US army. The army was in search of alternative ingredients to use in recipes, and so they called for SNA (Social Network Analysis). The study made a network out of about 46 thousand recipes, by linking together ingredients that appeared in the same recipe. The communities detected on this network were discovered to be based on the homophily of different flavors: salty, spicy and sweet (desserts). Using SNA algorithms, the researchers were also able to show which ingredient can be replaced with which ingredient because they shared the same role in the same community. So, this is how SNA allowed for more flexibility in the army's kitchen, if not elsewhere.
When surveying the different communities, the most interesting findings are of course the anomalies – the communities where the homophilic reasoning is different from the rest of the network. What does it mean?
Take Facebook as an example. As we mentioned before, if we use community detection on the entire Facebook network, we'll get geographical communities, meaning a community to each country. It will be very interesting to discover a central community where its homophily reasoning is different, say a community of cat lovers. This means that passion for cats is stronger in those people than their geographical identity (and I'm sorry I called cat-lovers "those people". Some of my best friends are people). In addition, the central position of the cat-lovers community in the network, will testify to the importance of that issue to enough people around the network.
Now let's imagine ourselves to be James Bond's arch-nemesis, fondling a cat while brewing over our dreams of world domination. In order to take over the world, we can go from one geographic community to another and take over its influencers or we can take a shortcut and enlist the help of our global feline-friendly community to spread our message to the world. Knowing cats, I'm sure they are in for it.
OK, so how can we detect communities in the network?
In order to detect them, we'll first need to define what we are looking for, only to find out that, in fact, there is no scientific definition of what a community is. Weird, ah?
We will differentiate for a second between the concept of "connected components", which we talked about in the episode about the small world, and between communities. Recall that connected components are the "islands" in the network and unlike communities, they have a clear definition or border: any node which is connected to a component is part of it. But if there's no edge between this node and other nodes in the component, than it's not part of it. The islands themselves can be made up of one community or several communities, but what defines the boundaries between the communities in the component? We will discuss this in more details later-on, but for the time being, we will content ourselves with the most common definition for a community in data science. The common definition is that the amount of edges within a community is greater than the amount of edges coming out of it. thus making the ties in the community denser on the inside than on the outside. In other words, there will be many strong ties within the community and mostly sparse and weak ties to nodes outside the community.
And you can't mention "weak ties" without mentioning Granovetter. Granovetter's paper from 1973: "The Power of Weak Ties", gained over 52,000 citations and is the most cited paper in the network field.
His paper is considered groundbreaking since before it, the field has dealt mainly with the strong ties of the individual, based on the intuitive assumption that strong ties are the important and influential ties.
In his paper, Granovetter warns us against this kind of intuitions and so he uses empirical data to examine the strength of the weak ties, by demonstrating how people find a new job.
Finding a job is a very significant event in the life of an individual and we might intuitively assume that the strong ties of the individual will play a significant role in this endeavor. An example for strong ties can be close friends, family and so on. What Granovetter has shown was that it is precisely the individual's weak ties that made it possible to find a new job. An example for such weak ties can be acquaintances from the past, such as from college or previous job. The reason he gave for this seeming paradox is the circulation of information.
We usually do not get new information from strong ties. There's a low chance that the people close to us will know of news we don't already possess. It is precisely the weak ties that allow new information to trickle in.
Granovetter's paper also has additional and lesser-known insights but ones that still hold water even half a century after its publication. Some of them will be dealt in future episodes. The one we'll mention that has bearing to our case is the paradox of strong ties. Granovetter diagnosed that strong ties produce a paradox at different network resolutions: in the micro-level, the strong ties create cohesion between the nodes but on the macro-level they actually contribute to the fragmentation of the network. He claimed that this paradox needs to be further proven and later in the episode we will see if he got it right.
As a side note: Unlike current papers in which authors explain how all the previous researchers were wrong and how their paper is significantly better, Granovetter in his paper often praises earlier researchers and other papers. Well, it was the seventies… Simpler times.. (:
At the same time when Granovetter's paper was published, another researcher called Wayne Zachary discovered the potential of community detection, in his famous study of a Karate Club.
Zachary mapped friendships at a karate club which included dozens of members. He did so for several years during the 1970s. And as it happens in many karate films in the 70's, this club also split due to a conflict between instructors or as we know it from the movies: the good dojo and the evil dojo.
What made Zachary's study stand out? Though Zachary is an anthropologist, he used mathematical tools to model his research. Zachary identified that there are intrigues in the club that are manifested in the flow of information within each group and at the same time withholding of information from the other group. Thus, to analyze the flow of information, Zachary used a method called max flow / min cut which is used to optimize the flow between 2 nodes in a network. To do so, one node is named "the source" and the other named "the sink" or "the target".
This method makes it possible to identify bottlenecks in the flow between two nodes. Zachary chose one instructor as the source and the other as the target because he realized that the conflict between the two instructors about the dojo's values encouraged dissemination of information within the support group of each instructor alongside denial of information from the other group. That is, Zachary assumed that the bottleneck in the network would be the boundary between the 2 groups, and so it was.
The algorithm partitioned Zachary's graph to two communities in an almost complete accordance to the actual split that happened in the club, i.e. gave an excellent result in relation to the ground-truth of the research.
The disadvantage of the min / max method for detecting communities in a network is that it places prior knowledge on the network. First, a source and a target for the flow must be defined in advance, and we will not always be able to point at 2 such nodes. In addition, the partition is limited to only two communities, which does not necessarily reflect the complex aspects of reality. Setting this aside, Zachary's breakthrough came from a data-driven approach, that is, letting data tell its story about the network, by using network algorithms.
This data-driven approach is the reason why this study is so iconic in Network science or SNA: The community detection algorithm made it possible to give a good prediction about the split, and which side would each member pick. Another achievement was the data-driven ability to explain the social reason for the split. By using the homophilic reasoning for the detected communities, and identifying the central nodes in them, we can establish that those nodes, i.e., the instructors, were the ones behind the split. Their high Degree score indicates that they rounded up followers in support of their view about the future of the club.
Because of the discovery's importance, Zachary's Karate Club is a very well-known concept in network science, but what is perhaps less known is that there is a club called the "Zachary's Karate Club Club" which awards a trophy to scientists in the network field. In order to win it, the scientists must be first to speak at a network conference and that the first slide in their presentation must be a graph depicting Zachary's Karate Club.
Though only a small country, Israel had it share in this prize thanks to Amir Rubin from Beer Sheva University.
Another winner of the trophy is Mark Newman, who together with his research partner, Girvan, gave wings to community detection research when they showed in early 21st century how to detect communities using a data-oriented method.
Until then, a common method was to partition the graph to communities by using hierarchical clustering or dendrograms.
The intuition behind that method was that networks have a hierarchical structure and thus so do communities.
Hierarchical structuring of communities starts with advancing the nodes along their strong ties (which is a bottom-up method) and these ties will define the community's boundary.
There were other methods that were used in the field of clustering and they too required early assumptions about the data, even if not as extreme as in the case of Zachary's min cut/max flow method. and here are some examples:
The first example is an algorithm called k-means, that requires prior knowledge of the number of communities we want to get from the network, which means that this method is more scenario-driven than data-driven, and perhaps a little supervised. The downside to using K-means is of course that the network can be partitioned to a different number of communities than we initially thought. Although K-means allows the partition to be changed based on observation of the result, this requires many manual iterations.
Another clustering algorithm is DBscan, a 1996 algorithm that its paper is one of the most cited papers regarding clustering algorithm, but DBscan also requires an early assumption about the data. In this case, the required assumption is: What's the minimal number of nodes that should be in a community.
There are also methods that assume a gaussian distribution of the data, but since we know our data is Power law, we will politely brush them aside and move on.
Another known method is a method called k-clique that comes from a family of algorithms that are based on rigid assumptions about what is the definition for a community.
As we've mentioned before, the definition of what a community is is a bit vague. But there are definite situations in which it can be said with certainty that a community is a community. For example, in the case of a clique.
A clique is a group of nodes in which each node is connected to all the other nodes.
For example, to be considered a clique, a community of 4 nodes must have 6 edges to link them, which is the maximum possible number to do so (see image for examples).
Because there's no room in a clique for additional edges, it is sometimes called a complete or completed graph.
It's not so unusual to find a three-nodes clique, but as the number of members increases, the chance of a community to become a clique decreases significantly on a non-linear scale. This means there might be many tiny cliques in a network but only a few large ones. Why? Power Law, bit...s!
The slim chance to find a big clique in a network was what arose the suspicion of the authors of an independent report, investigating twitter campaigns in 2019 elections in Israel. They suspected that one of the tweeting communities was not spontaneous, since it acted as a large clique and this, as we've mentioned, is a rare event. The investigation was called the "Big Bot Report" and in the public debate that followed, the authors were accused of claiming or implying that the active users in this community were bots where in fact behind the users were real activists.
We will not go into the debate here and links to the discussion are available here (Articles in Hebrew): - The authors' report
But what is undisputed by all is that a large clique indicates a significant coordination that can certainly be suspected of being activated by bots.
The fact that cliques are less common is also the reason why k-clique algorithm is not ideal. It will give us a very partial picture of the communities in the network since it will refer us only to the most extreme communities – the cliques (which sounds like a 1960's band). Another obstacle we'll face is the difficulty to understand the community's sub-structure because of the clique's density. Besides, k-clique takes a long time to compute and who has time to wait so long for a mediocre answer?
So, let's do a quick summary:
We've covered the evolution of clustering algorithms in the 20th century. Not to be an ageist, but those algorithms had their constraints, especially the need for early assumptions on the network, such as communities must be cliques of a certain size or assessing beforehand how many communities are supposed to be in the network. Another aspect of those algorithms is their emphasis on the community's intra-connections, or strong ties. And this is where Girvan and Newman come in and the algorithm that bears their name. The Girvan-Newman algorithm, from the early 21st century, makes no presuppositions about the community or network, neither in terms of the shape of the communities nor in terms of their number. It does this by flipping our perspective on what is a community:
If in all the examples we have given so far, the main preoccupation was to find the strong ties that build the community, the complimentary view would be to find the weak ties that form the boundary of the community.
Instead of tracking the community from the bottom up, through the strong connections, the Girvan-Newman algorithm first detects the boundaries of the community from the top down or from the outside inward.
This kind of logic resonance Granovetter's hypothesis from the beginning of the episode. The strong ties that contribute to the creation of a small community, cannot hold a large network and so the network disintegrates to communities whose weak ties represents the boundaries of its circulation. These are the boundaries that Girvan and Newman used.
To find them, they followed the same logic that led Zachary, who used bottlenecks to partition the network. Just instead of using flow analysis between two nodes, they identified the bottlenecks in the graph by calculating the betweenness of the edges in the graph. For those cheaters who skipped episode 4 about centrality measures, Betweenness is a metric that allows us to find nodes that bridge between different nodes in the network.
What Newman and Girvan did was to assign the score to the edges that connect the nodes. According to their method, removing the edges with the highest betweenness score, one-by-one, would gradually result in breaking down the network to connected components (or "islands") any of which would be considered as a community. On a side note, Girvan and Newman's paper is characterized not only by a straight-forward easy-to-follow style, but also by the modesty of its writers who were the first to point out the weaknesses of the method they proposed:
First all, the computation time for this method is long (because betweenness is a metric that takes a long time to compute).
Secondly, when examining the algorithm on dense networks, with many edges, the results obtained are problematic.
And here's another problem- when to stop removing edges? After all, theoretically it is possible to continue and remove the edges with the highest Betweenness in the network until we are left at the end with a network that consists only of nodes, each of which is a community on its own, with no connections at all.
The solution to this problem was brought by Newman himself, in 2004, and on the fly he also solved the other problems he presented in his previous paper.
To decide when to stop removing edges, Newman used the modularity metric.
Simply put, this measure is designed to examine the quality of community detection algorithms by giving a score to their partition of the network.
The score is calculated by comparing the density of the edges in the community obtained versus the density of the edges in a hypothetic random network, which has the same number of nodes.
The denser the community by comparison, the higher the score.
The great advantage of this metric was also that it computes very quickly.
If we now have a quick tool that gives scores for partition to communities, why not use it by its own as a partition algorithm instead of using it only as a measure for other algorithms?
And this is exactly what Newman did and this breakthrough led to a flood of algorithms in the field of community detection.
Perhaps the most famous algorithm that stemmed from the modularity family is the Louvain algorithm by Blondel, who published it in a paper from Louvain University in 2008.
So, how does Louvain Work:
Ironically, the Louvain algorithm uses the ol' bottom-up method of constructing communities, and it does so in several iterations:
First, it starts at a random node and maps its immediate surrounding nodes and computes the modularity score for these sub-graphs.
The scores can range on a scale from 1, which is a perfect community to -1, that is, a clear cut case that this is not a community: Nothing to see here – keep moving!
If the modularity score of the small community or sub-graph is good enough, the Louvain moves to another node and performs a similar process until convergence. The output of the first iteration will be numerous small communities. In the second iteration, Louvain aggregates the smaller communities to produce larger communities, in the same way as in the first iteration: if the edges between the communities preserves a good modularity score, it continues and so on in the next iteration, until convergence is achieved.
Louvain is also weighted, that is, the algorithm also considers the weight or strength of the edges. For example, if a node is connected to 2 communities, the Louvain will assign the node to the community to which its connections are stronger.
The final output is aggregated communities, meaning that they are a product of sub-communities that are connected to each other which in turn are themselves a product of sub-sub-communities and so on, up to the resolution of a single node or atom.
So, what's so great about the Louvain Algorithm:
First all, speed. It gets the job done very quickly, even on large networks, and is capable to be run on parallel computing platforms. The results obtained are considered relatively reliable and are widely used in network research. In addition, the modularity score that the algorithm gives to each community provides additional insights about the network and will cover this in another episode that will deal with best practices in network analysis.
Before I elaborate on the disadvantages of Louvain I am required to full discloser and that is I am sucker for Louvain.
The feature I like best about it is that it is unsupervised which means it has no presuppositions on the data other than the very definition of the community as a denser part of the network. In contrast to other algorithms that require the community to have rigid definitions, such as cliques, or that predetermine how many communities the network should be partitioned into, the Louvain's way sends a cooler and more liberal vibe.
Now, it seems that the main claim against Louvain is the so-called "resolution problem". Louvain's output usually looks like this: One huge community, and next to it a small number of large communities and beyond them a long tail of many tiny communities.
If this reminds anyone of a Power Law, it's not by chance. Communities are Power Law distributed.
So, where's the beef?
It depends on your approach. Some researchers would like to see the large communities disbanded in the first stage of the research into smaller communities, and we know that this is feasible because this is how Louvain works: it groups small communities into a large community.
Personally, I see it as a feature, not a bug. This way you can see which communities are more connected to other communities by being members of the same large community.
But when it comes to huge networks, which contain millions of nodes, there is a limit to the resolution of Louvain. After about 3 dives into the sub-communities, the Louvain "breaks down", that is, you can dive to the sub-communities of a large community, and to their sub-sub-communities but an attempt to see even smaller communities will probably result in seeing each node as a community in itself. This is because Louvain uses only 3 iterations or 3 levels of aggregation.
A possible solution to this is to run the Louvain at any level at which the network is viewed. But here we encounter another challenge and that is that every time we run Louvain we might get a different result for the communities detected. Usually a very negligible difference but a difference none the less.