Episode 4: "Game of Ties" – Centrality measures battling it out
Updated: Dec 10, 2021
What's the rumpus :)
I'm Asaf Shapira and this is NETfrix.
Who are the bottlenecks in my organization? Who should we target for an advertising campaign? How to tell a story about data? Are rhetorical questions an essential part of network analysis? Let's find out.
Like most of our world, the network is also a Power Law distribution, that is, there are a few central nodes and most of the nodes are marginal. So, if there are only a few big hubs in the network, how shall we find these big needles in the haystack?
In order to find these nodes, we will need to use SNA (Social Network Analysis) algorithms, known as "Centrality Measures". these metrics allow us to find the centers of gravity in networks that will help us "control" the network, understand what is going on, dismantle it if necessary, etc.
Let's start our quest with addressing a common misperception. Often in the search for key players, we intuitively look for the most active ones in the network. Although the activity of nodes is also distributed as a Power Law (there are few activists and the majority do little), the fact that something is active, does not necessarily make it central or influential.
Let’s say I call someone 100 times a day. Our relationship or edge weight is a 100. It definitely says that I am an active player and also a bit of a stalker, but it does not make me a hub. Conversely, if I call 20 people or have 20 people call me and even if I conduct only 2 calls with each of them, my relationship weight will only be 40, but I probably play a much more central role in the network than my creepy self.
It can be deduced from this example that the amount of connections a node has is a significant aspect to centrality, and it does sound very intuitive. Therefore, it is not by chance that this is the most popular measure of centrality in the field of networks, and it is called: The Degree of a node. In all the examples given so far in previous episodes, this was the main metric we used.
Let's demonstrate this on a star-shaped network which is a network comprised of a central node and all the other nodes are connected only to it.
In this network, the node in the middle has the highest Degree. All other nodes will receive a score of 1, the lowest, because they are connected to only one node, the central node.
In a directed network we can also measure the incoming links (In-Degree) and the outgoing links (Out-Degree).
If node X contacts 4 nodes and at the same time, 3 nodes contact it, then its out-Degree is 4 and its In-Degree is 3.
The rationale behind this metric is that the node is central, if it's linked to many players.
In school it would be the popular kid, in a computer network it would be the main server, in an organization it is expected to be the head of the bureau, or the main office.
There are probably hundreds of algorithms for identifying "hubs" in the network, but they can be summarized into three main categories that stem from different perceptions of what it means to be central in a network.
The first category, which we have already touched on, is the number of connections of the node.
The second category is to what degree does the node constitute a bottleneck or bridge in the network.
To be considered a bottleneck in the network does not necessarily mean lots of links. It means that the node is located at the network in such a manner that we often have to go through it in order to travel from one part of the network to another.
The leading metric in this category is the Betweenness Centrality measure, meaning the node is located between two parts of the network and constitute as a bridge between them. The mathematical definition of this metric is the number of shortest routes in the network that pass through the node.
If we'll take our star-shaped network and add to it two more star-shaped networks, with only one node connecting the three, then this special node has a Degree of only 3, but it is the sole connector to the 3 networks so its Betweenness score would be high.
The Betweenness score will usually be normalized between 0 and 1. For example, a node's score of 0.5 means that about half of the shortest routes in the network must pass through it. Such a score means that this node literally divides the network in two and any movement from one side of the network to the other must pass through it. In large networks this is a rather rare phenomenon and we will probably encounter much lower scores.
In organizational consulting, which uses ONA (Organization Network Analysis), these bridges (or bottlenecks) are marked as the filters through which new ideas flow to the organization or are sometimes blocked. This is the reason why in Organization Network Analysis, finding bottlenecks is important in order to find out where processes get stuck or may get stuck. Professor Rob Cross, who uses network analytics to understand organizations, used this metric to optimize creative thinking teams in the organization he was consulting to.
In order to create these teams, the organization he was working with chose key people in each department, i.e., the ones with the high Degree, and put them together to create synergy and produce new ideas. The disadvantage of this method was that these people were very busy with the affairs of their own department, hence their high Degree, and greatly defended or promoted the interests of their department. Members with high Betweenness , on the other hand, were exposed to more areas of the organization and were more open to promote interdisciplinary ideas.
So, we are left with the third category with its own definition of centrality. If the first deals with the number of connections and the second deals with mediation between parts of the network, the third deals with location: to be located at the heart of the network is to be central.
The best-known metric in this category is the Closeness Centrality measure. That is, the node is central if it is closer to the other nodes.
Let's demonstrate it with a real-life example and examine a pupil that sits in the center of the class. Even if she or he does not have many friends and does not mediate between groups, the position in the center of the class allows the pupil to hear all the noise and gossip during the lesson. The pupil's location in the center allows information to trickle in his or her direction, thus making the pupil central in the classroom network.
What does this mean to be the closest to other nodes in a network? It does not refer to physical distance but to the number of nodes or steps that we must go through in order to reach our node.
The mathematical definition for Closeness is the node that has the lowest average distance from the other nodes.
A possible application for using Closeness Centrality in the field of intelligence, for example, is in source recruitment. The advantage of recruiting a source with a high Degree is clear: it has access to many places and people. The downside is that this target is probably in high visibility and will be difficult to approach. Its position in the organization may also make it difficult for it to switch sides. On the other hand, someone who does not necessarily have a lot of connections but located in the core of the organization, may be able to get to places of interest with lower visibility. Instead of recruiting the CO or it's head of chamber, we will try to recruit the assistant who sits in the office next door. This potential source might have a low Degree but probably high Closeness.
Sure, there are naïve or less experienced listeners thinking right now to themselves that the office cleaner should have been the obvious choice for recruitment, but as everyone knows, a good cleaner is hard to find.
Another example to the importance of Closeness can be found in the field of epidemiology, the study of diseases. In these Corona days, epidemiology takes center stage with an emphasis on studying the spreading of the virus and mapping patients contacts. Covid-19 sure made it easy for SNA to gain popularity once again.
When we map the contact networks through which the virus has spread, then the node with the highest Closeness will mark for us patient zero.
A node with a high Closeness score can also be the target of cyber-attacks that seek to target a central node that is close to many parts of the networks, and so makes it easier to spread in it.
Ok, so it's time to summarize the three metrics, and we'll do so by applying them in a practical way, and in a cool way.
In the practical example, we will use Facebook's network and I'll just remind ourselves that it consists of approximately 2.5 billion active users:
Let’s say I have 500 friends on Facebook. This means that my Degree on Facebook is 500.
Suppose all my friends are in Israel. What does this say about my Betweenness score on the network?
Probably a low score, since I do not bridge between different areas in the network because all my links are concentrated in a specific place. But - if I had a friend in the US and a friend in Brazil and a friend in Japan and a friend in Africa, then my Betweenness score would probably jump higher, as I become a bridge between remote areas in the network.
So, what's my Closeness ? I have no idea. To find out, I will have to calculate all the distances between nodes in Facebook's network. This algorithm takes a long time to compute on large networks, but I believe sometimes it's worth it because it's one of the more interesting ones.
I guess it's time for full disclosure: I'm a sucker for Closeness centrality. I don't need a reason for it but if I have to say then it is both because it is the least intuitive metric and also because it captures centrality on a network's scale, unlike the Degree centrality, for example, that is by nature, reflecting a more localized aspect of centrality.
For the cool example we will use the "Game of Thrones" series dataset. The links or edges in the network were created based on which character appears with which character in the scene. To avoid spoilers from the five people who have not yet watched and plan to watch it sometime, we will settle on analyzing only the first season.
So, let's do a little quiz: Who in the first season of "Game of Thrones", leads the centralities' scoreboard?
The leading character in the Degree centrality in Game of Thrones' first season is...
This endearing character appears in many scenes, but it is not enough. In order to get a high Degree score, it has to appear with a lot of different characters. In the first season, Tyrion travels throughout the kingdom and therefore makes lot of connections.
And now, the leader in the Betweenness centrality is...
Varys the Counselor.
Recall that high Betweenness means that the character connects or bridges between different parts of the network and constitutes a bottleneck, in this case, of information.
Varys's nickname is "the spider" and the spy network he weaves extends to the furthest parts of the land, bridging kingdoms and continents.
And last, the leader in the Closeness centrality is... they... are......
Ned Stark. The ultimate Protagonist with a capital P.
And here lies the reason why I like Closeness centrality so much. If we were to ask the honest and naive Ned who is the most central figure in the kingdom, Honest Ned would probably answer "What the hell do you mean? The King, of course!". But what he does not understand is that he is at the heart of the plot, as Closeness will testify, at least until he encounters a serious Betweenness problem towards the end which we we'll not elaborate on.
Well, so how many did you get right?
As you can see, the centrality measures stimulate us to tell a story about the data by making us explain to ourselves why the central nodes are central. The different logic behind different centralities helps us to fit the story to the data.
There are many more metrics and new ones are added from time to time, so let's just address in short, two more common metrics:
As a use case, let’s just say that making our small website point to a major or central page won't make ours a central website.
PageRank is named after Larry Page (pun intended), one of the founders of Google, and it was used in the 1990's to rank pages.
In simple terms, the idea behind PageRank is that every page that points or directs to another page, lends it a score. At the beginning of the analysis, each page or node has the same score, 1 divided by N, N being the number of nodes in the network.
The algorithm performs several iterations and in each of them nodes give part of their score to the nodes they are pointing to.
This algorithm has undergone changes and updates over the years to make it especially suitable for finding central pages on the Internet.
One significant change was increasing the score of a page when pointed to by a Seed page. Seeds are pages that serve as clear indicators of quality, such as sites with a GOV extension, or university sites that can help rank other nodes.
The PageRank algorithm stemmed from the first category of centralities we talked about which defines centrality as the amount of links that the node has, or in the case of PageRank, the amount of links that the node's neighbors have.
The common feature that all the centrality measures share is that they are distributed into a Power Law. In each metric there will be a few who will get a high score, or rank, and the majority will get very low scores. How does this fact help us?
Usually when we think of a central figure in an organization, what pops into mind, obviously, is the head of the organization. The head is the one that makes the important decisions and leads the organization. But what about the deputy? Some might say that the deputy is the one who really does the heavy lifting, in touch with the employees, etc. All the above insights are a result of scenario-driven perspective, not data. these are insights that we derive from our past experiences, from what we have been taught and from our intuitions. We know that a manager is the one who makes decisions and the deputy manager is the one who is in charge on daily activities. Still, have we not encountered in the past a manager who can't manage? Is the deputy manager always so dominant? And when they are on sick leave or just on a holiday, what happens then? To see what's really going on, we'll need a data-driven approach, because I guess no one will say that the central figure in an organization is the safety officer. Right?
But this was exactly the case in a network analysis made by Barabashi, the famous network researcher we mentioned in the previous episode. Barabashi performed an analysis of a factory whose management wanted to understand why their messages did not trickle down to the workers on the assembly line. Instead of using the organigram of the organization, Barabashi analyzed the network of contacts of all the employees to understand where they get their information from. What he discovered was that the most central figure did not sit in management, rather it was the safety officer.
It turned out that this guy walked around a lot in the factory and was very sociable, so he made lots of connections and so became a useful tool in disseminating information.
Contrary to the initial intuition of organizations to fire someone who is more central than the manager, Barabashi actually offered to invite this guy to the head office for a cup of coffee, tell him what the management was planning, and let him convey the message.
Centrality measures also help us tell a story about data. Data storytelling is a major challenge for the data analyst. Anyone can look at the data. But to draw from it conclusions and insights is the real challenge.
As a former Intelligence officer, I was required to tell a story about the enemy. But what to do when no one tells me what to tell?
In that context, a very senior officer once gave me the following advice, roughly translated to: "Don't panic. The truth is only second to confidence." (Sounds better in Hebrew), which is the equivalent in English to "When in doubt – shout".
A more data-oriented tip I've got was to use centrality measures. Here's an example for such a story derived from centralities:
Suppose that in the network of an organization, which is made up of several divisions, the most central nodes are labeled as logistics. What story can we tell about the organization?
Will need to ask ourselves, what makes someone central?
Using Degree Centrality, we can say, for example, that's because many turns to it.
And What does this mean if a lot of people turn to logistics?
Here we can tell plausible stories that revolve around the idea that logistics is a center of gravity, as of the time of the analysis:
For example, that the organization is very dependent on logistics or that logistics is a bottleneck in the organization, or that they encountered a logistical problem or just maybe that there is a plan in the making for a surprise party for the VP of logistics.
All these scenarios are plausible, except one. No one ever organizes surprise parties for the VP of logistics.
But what is beautiful about the data-oriented method is that it is very easy to test our hypothesis. No need to go through all the data or test the whole network. It's enough to verify the hypothesis by a qualitative research of just the few that lead the Power Law. Because of the nature of this distribution, they are the few that tell the network's story and through them we can test to see if we got our story right.
To sum this up, Yuval Noah Harari, in his book "Sapiens: A Brief History of Humankind", explains that in order to sustain human society, human beings were organized on the basis of imaginary ideas. An organigram, for example, is an imaginary idea. It is used to decide who pays who and how much but it does not necessarily describe how the organization really works and there is nothing in it that will tell us what is happening in the organization at the present time.
To this end, we have the centrality measures. Contrary to organigrams, they are not fixed and can vary according to the occurrences. For example, given that the logistical bottleneck we described earlier has been resolved, the organization's center of gravity may shift to the organization's management, when planning a new strategy or to another part of the organization that's just turned out as a new bottleneck.
Pheewww… Hope I didn't panic, shout or lacked in confidence. Now let's dive deeper into the subject at hand because it cannot be that simple:
To tell the network's tale, it is not enough to check which nodes have the highest score, since many times our data is noisy. For example, in many networks there are nodes that are not "players" in the network but exist there for technical or other reasons which might not be relevant for our analysis. these nodes have tendency to create fictitious centers of gravity or relationships between actual players. In an email network, this could be, for example, a spam email or an error email sent from the server. Sometimes these phenomena might be of interest to our analysis but many times they aren't.
And sometimes, even if the major node in the network is a real player in the network, it will not necessarily be our focal point for the purpose of the study. It depends on the context, and as an example we'll use an American study of the "Arab Spring" revolution in 2011.
Quoting foreign sources, there is a widespread use of SNA by official American bodies, such as the military and the NSA, the National Security Agency.
During the campaigns in Iraq and Afghanistan, SNA even became part of the American military doctrine titled "Countering Threat Networks" and we will expand on this issue in the episode on intelligence and the network.
And so, in 2011, an American study was conducted on the Twitter network in Egypt to identify the leading factors in the "Arab Spring" revolution. Twitter and Facebook were the leading social networks in Egypt that allowed the masses to organize and coordinate demonstrations.
The Americans assumed that finding the centers of gravity on the network would make it possible to find out who was behind the events and leading them.
Much to their surprise, a close examination of the node with the highest Degree revealed that it was...
No disrespect, the guy's doing his thing and that's ok, but why's Justin and what's his connection to the revolution?
This is because celebrities can have tens of millions of followers that connect to them and so they will almost always overshadow the rest of the network. It is enough that a celebrity will tweet using a current hashtag and they will turn 1st place in Degree score easily.
However, further research of the Egyptian network has shown that despite Justin Bieber's network centrality, the "echoes" to the content he tweeted were weaker than the "echoes" to the actual revolutionary leaders' messages. How can one see it?
As we have mentioned earlier, a high score in the Degree measure alone is not enough (especially in large networks, due to the Degree's local nature) and it is necessary to create context for it as well.
If the no. 1 law of the network is that the network is distributed as a Power Law then in this case we will be required to use the no. 2 law of the network: Networks congregate to communities, that is, the network consists of clusters, each of which has its own reason to congregate and its own centers of gravity. Understanding which is the relevant community for our analysis will help us find the relevant center of gravity but this will be covered in the next episode dedicated to communities in the network where we will also crack Justin Bieber's mystery affair with Tahrir Square, where the masses in Egypt were gathered during the revolution.
So, in the meantime, a few might say: What's the problem? Let’s just ignore a node that has a high out-Degree , meaning that all its edges are outgoing links. Intuition has it that such a node should be considered a "network spammer". Right?
First all, it depends on the context of our study of the network. There will be use cases where such nodes will serve us well, for example when we want to spread in the network. Also, on a social level, maybe this node has an important role in disseminating information? For example, in the field of advertising.
In this field, celebs are widely used on social networks, trying to gain from their great popularity and exposure as a result of the multitude of edges connected to them (i.e. followers).
Large sums exchange hands for a network hub to post a product on social media.
To find such hubs, also known as influencers, companies can use centrality measures, and some do.
So, let's analyze such a case from 2019, in which an Israeli fashion company, Castro, has launched a major campaign for designer glasses in the United States using the mega-influencer Kim Kardashian.