Episode 11: Anyone can analyze networks
What's the rumpus 😊 I'm Asaf Shapira and this is NETfrix - the Network Science Podcast.
In a recent post in a data analysis group on Facebook, someone posed the following problem: He wanted to know which product of his company is usually bought with which product. A brief inquiry revealed that he does not program, but far worse than that, he wasn't aware of episode 8 on NETfrix in which this exact issue was discussed. So, I was able to solve at least one of his problems, but what about programming? Can only programmers analyze networks? In this context, I am reminded of Pixar's movie "Ratatouille":
In this movie, Remy the Rat becomes a chef and so conveys the message that anyone can cook. However, this message downplays some of the plot like the fact that Remy was almost killed several times, forced to steal food, and had to deal with hostility from family and friends.
So, unlike being a chef, network analysis doesn't usually involve near-death experiences and for those who want to try it, this episode aims to make life much easier. So, what's the secret?
In this episode we will talk about and survey accessible network analysis software meaning that you do not need to program, nor do you need a special database. An Excel sheet will do the job just fine. And last but not least, the software featured here are free. With their help, we can display our data as a network and analyze it. Although programmers can use network analysis packages such as NetworkX for Python or igraph for R, still, I guess at least some of them would prefer to work with a designated software either to save some effort or to better rely on best practice or both.
Speaking of, in this episode we will talk a bit about best practice in network analysis and get to know some advanced concepts in the network's field, things like matrices and Higher Order Networks or Hyper Graphs . This is why I'm giving the heads up early on: For the purpose of this episode, I assume that the basic concepts of network analysis, such as Centrality Measures, Community Detection etc., are already familiar to the listeners. For those who are first timers, I recommend to start with episode 2. For those who just need a light warm-up, I recommend Episodes 4 and 5, on the subject of Centrality Measures and Community Detection, respectively.
During my research for this survey, I found at least dozens of network analysis software, but only a dozen of them met all the criteria: The first – that they don't require programming and can run on a PC with Windows. The second – that they can upload Excel files. And third – that they are absolutely free. No "trial period" is permitted. Some may have a premium version, but we won't expand on its features. And lastly, the software should allow analysis of all kinds of networks, that is, we will not review software that only allows to analyze specific networks or that uses network's visualization as a kind of a side show. For example, we will not talk about software designed for analyzing only biological networks, organizational networks, texts or academic papers or software designated to analyze process graphs or decision-making graphs (which bear some similarity to Microsoft's Visio). They may be good in their field, but do not allow much flexibility.
While reviewing the different software I've noticed an interesting phenomenon: Though about half of them were poorly maintained for years, it seemed as though from the moment I started the survey, many of them released a new version or plan a significant upgrade this year. Maybe I just experienced the Bader-Meinhof effect but perhaps it's something else? Could it be that all the lockdowns we endured these last 2 years resulted with a baby boom of network analysis software upgrades? Why? Was it just a way to pass the time or is there something else at play? The answer might be found in Barabshi's reply to a question I presented to him on the 2021Networks conference. Albert Lazlo Barabashi is considered to be one of the founding fathers of Network Science and I asked him how come the general public is familiar with concepts like Machine Learning and AI but the vast majority never heard of Network Science? His answer was that Covid19 brought Network Science to center stage. Everyone started talking about Epidemiological models, R0 and many other concepts that just a few years ago were only used in Network Science conferences. I guess it's too early to say, but perhaps Barabashi was right and Covid19 did usher a growing awareness to Network Science which might in turn resulted in the upgrading of network analysis software. Yet another sign to the growing interest in networks is that for the first time, we have Nobel laureates from the field of Complex Systems, and as we all know, networks are a big part of this field so…Respect! So on this optimistic tone, let's get down to business. We'll review the basics of each software - what sets it apart and what are its pros / cons – keeping in mind especially those which are just starting out in the field. Along the way, we will learn a little about the analysis process and advanced concepts in Network Science. Our review process will try to run parallel to the actual analysis process, so first of all, we will start by talking about installing the software and uploading the data. The installation is mostly straightforward or in the case of browser-based software – not even required. Some of the network analysis software are Java based and therefore may require Java installation as well. It's not complicated, just go to Google, search for Download Java and download. But uploading our data can be a little trickier. The usual practice for uploading data to network analysis software is to upload two separate files, in our case, Excel or CSV files: An edges file (or table) and a nodes file. The edges table is mandatory and will usually contain two columns: "Source" and "Target". This should be enough for most software. Sometimes you might wish to add a weight column, that is, the count of edges or links between the nodes, but most of the time it is not required and the software will do the edges count by its own. Most software will also allow you to add more columns that contain further information about the edges, such as the type of relationship, a time stamp etc. The latter is necessary when doing dynamic network analysis, which is sadly not a common feature in network analysis software.
In regard to the nodes table, it is optional and is mainly used in cases when we wish to show the nodes' attributes in our analysis. This table will usually require 2 columns which will include in its header an "Id "column and a "Label "column. The "id "column serves as the pointer for the software to identify the relevant node in the edges table. That is why the id of a node should match exactly to the node as it appears in the edges table. The second column will include the description of the node. Again, in many cases it would be possible to add more columns with a variety of descriptions and information about the node.
If you just wish to test drive the software, most of them come with sample networks that can be loaded directly from the software. It usually contains classic datasets from the field of Social Network Analysis, some of which we have already touched on, for example, the Zachary's karate club network. Another quick way to test the software is by generating a graph, which is a common feature in most software. By choosing some parameters, the software can create a graph for you. This feature is mainly used to test graph attributes in the field of Graph Theory but since this episode is dedicated to real-world networks, meaning SNA and Network Science rather than theoretical graphs, we will hardly address this feature. But by far, the coolest feature is that some software allow data to be streamed from an external source, or API, most commonly Twitter, so you can harvest it directly via the software. In some you'll need an API key but some software only requires a Twitter account to enable this feature. Now that we have the data, the question is how much data can the software handle?
I guess most of us would not launch our network analysis career by analyzing networks in the magnitude of millions of nodes, but a few thousands, why not? That's why it's important to note that not every software can handle such a quantity and keep pace. Most of the software was designed for small networks with only a few that can handle networks larger than 10,000 nodes. But this benchmark can vary by our willingness to wait for the results. For the purpose of this survey, I tested each software on a standard laptop and ran with it a network with tens of nodes, a network with thousands of nodes and a network with tens of thousands of nodes. If the software took more than 2 minutes to respond, I gave it the mercy blow and pull the plug on the poor thing. Now that everything's ready – it's analysis time. So how intuitive is the analysis on each software? Because intuitiveness is a bit subjective, I used the Israeli benchmark: never read the manual. If it works, it's intuitive and if not, we'll bend and crank it until it works or breaks. On some software I had to admit defeat early on because they required a unique format to upload the data that doesn’t conform to the standard "source/target" format. In that case you sadly have to check the ReadMe file.
Once we know what's what, we can continue our survey by checking what algorithms and visualizations the software provides, and does it allow for an iterative process between these two. We'll start with the algorithms, and they split to "basic" and "advanced". When I say basic algorithms I mean the Centrality Measures: Degree, Betweenness and my close friend, Closeness, and also basic network metrics, for example, calculating the network's density or its diameter. For the purposes of this survey, Community Detection isn't considered a basic algorithm but I highly recommend to apply it. There is a wide and eclectic variety of Community Detection algorithms, but as I see it, a software that does not have the Louvain algorithm, well… it says a lot about it. Full disclosure: First, while talking with French speakers they insisted on pronouncing it "Luva" and not "Louvain" for some obscure reason. Second, is that I'm a sucker for "Luva". It's a great algorithm. So now that we covered the basics, let's move on to talk about more advanced techniques and we'll start with a subject we haven't covered yet in this podcast and by that I point to the elephant in the room: The Matrix.
A psychopath once said that there is more than one way to skin a cat, and this is also true of networks: There is more than one way to store our data as a network.
So far, we have mentioned 2 techniques: the visual method, that is simply drawing the network as nodes and edges and the second is the "edges list" which is the source/target table we mentioned earlier. What we haven't mentioned is the "Adjacency matrix", which can be used to show which node is adjacent to which node.
This matrix is basically a table with as many rows and columns as the number of nodes in the network, that is, a network with 10 nodes will be displayed as a table with 10 rows and 10 columns and the nodes will be assigned as the headers of these rows and columns. In each intersection between two nodes, the cell will contain a numeric value: If there is no link or edge between the nodes, it will be zero, indicating "False". If there is a connection – it will be 1, indicating "True". In a weighted matrix, the value will be the count of edges between the two nodes, or in other words - the weight of the relationship. The diagonal of the table, meaning the cells where the column and row of the same node meet, will show the self-loops in the network. Usually, it will be 0 but there are some networks where self-loops can be found, for example, in an email network, I can send an email to myself. In practice, these cells are often ignored.
The whole idea might sound a bit complicated and the first thing that pops to mind is - why do it? Moreover, it also sounds very wasteful in terms of data storage: think for example of a network with a thousand nodes. It means we need to keep a table with a million cells that most likely the vast majority of them will be empty or contain zero values. Why's that? Because, as you might remember, networks' edges form a long tail distribution meaning most of the nodes will have only a few edges or links. So why should we use matrices for these sparse networks? One reason is that although an adjacency matrix is wasteful in space, it can be economical in running time. Thanks to the matrix, all the node's neighbors appear in each row or column and so we can quickly check which nodes are connected to which nodes. In order to optimize the two methods and save both space and time, there is another method that is a hybrid of the two, and it is called an "Adjacency list". The list contains two columns: on the first one are all the nodes and on the other column all of the nodes' neighbors. That is, if we have a network where the node A is linked to node B and C, node A will be placed on the first cell of the first column and on the same row in the second column, there will be nodes B and C, meaning all the nodes A is linked to. On the second row of the first column will have node B, and on the cell in the column next to it will have node A (C is missing because C isn't linked to B. Just to A).
This way, we do not waste space on links that do not exist and at the same time we can quickly find neighboring nodes.
For those in computer science, optimizing efficiency in data storage and retrieval is important. That sounds cool but again, as a user, who cares? A simple Source/Target table should be enough because it's the software that's doing all the heavy lifting. That's true, but there are some algorithms that rely on the matrix format as an input, because they involve matrix multiplication and other sorts of math Voodoo. We won't get in to it now, and save something for the next episode, but for the time being, let's just give a simple use case for it such as finding similarity between nodes. When we compare two nodes, in terms of their role and location in the network, we are required to compare their neighbors as well. And the best way to find neighbors is by using a matrix. Finding a large percentage of common neighbors between these nodes will give these nodes a high similarity score. In a social network this could mean that these nodes share many mutual friends and if there's no link between them, the application might suggest one. Perhaps the most common application these days for such use cases is in the field of machine learning on graphs, called GNN or Graph Neural Network, which is a subject I might cover someday on this podcast. Matrices open a door to many applications in the network field, but contrary to what I usually say about networks, they are not always simple to grasp and even computers have a hard time dealing with them, resulting in longer running times. Fortunately, and I hope I'm not offending anyone, most of the time we won't need it. Moreover, the results we'll get from some of these algorithms won't be necessarily much better than simpler and often more up-to-date algorithms. Now that all the computer science fanatics have left the room in a tantrum, I can secretly tell you that some software can perform the conversion of an edges list to a matrix by themselves, with no action needed from the user. But when we'll move to the next issue at hand – visualizations – we'll might need to rethink our view of matrices.
Now that we have our data, and we know which algorithms we need to apply, let's talk about the visualization features. Almost all software will have some editing features which can be applied to the nodes and edges, like sizing them according to their Centrality, changing their colors etc.
But the more advanced feature which we'll address is the layout options for the network. The bigger the network, chances are that the nodes and edges will overlap and sure enough, they will get to a state which is known in the professional literature as "The giant hairball effect". This messy ball of yarn makes the network incomprehensible.
That's where the various network layout configurations kick in. They use algorithms that aim to maximize certain features of the network, according to the user's visualization need. For example, a layout algorithm can maximize on the proximity of nodes to each other, i.e., pin nodes connected to each other and keep other nodes away from them. Other algorithms can maximize on the desired layout style, for example, display all nodes in a circle or in a hierarchical configuration or even allow the nodes to be displayed according to their geographical location, for example, by embedding them on a map. On a personal note, my go-to layout is the ForceAtlas2 developed by Mathieu Jacomy which is great for large networks. It usually does a great job of highlighting the network's communities, making the network more readable in the visual sense.
. Though I can deeply relate to those who have been captivated by a particular visualization or layout because of its aesthetic beauty, looks can be deceiving. That's why there's best practice literature on how to visualize networks by none other than Mathieu Jacomy and - spoiler alert - he will also be our guest toward the end of the episode. Speaking of visualizations, I would like to take this opportunity to recommend Christian Miles' newsletter called "source/target" which deals, among other things, in networks' visualizations. There I found a link to a paper who claimed that sometimes networks are better represented visually when they are in a Matrix form rather than in a Node/Edge form.
Their main claim was that no matter how dense the network is, matrices don't suffer from overlapping nodes and edges. And that's a good argument.
But since no software I've surveyed lets you play with the matrix visualization mode, we'll keep this one theoretical for now.
So, now that we can analyze our network by using algorithm and visual aids, we need a software that can iterate between these two. To gain insights on our network, we'll usually need to go back and forth between the algorithm results and the visualization. The process might also involve some filtering or even exporting the data to continue our research on another software. Iteration is an important feature, especially for beginners but also whenever we do some exploring on our network for the first time. Sadly, not all network analysis software enable this workflow. At last – we can begin our software survey, and when I say software, I actually mean Gephi. One reason for it is that there's lots of software to cover so we can't cover all of them in one episode. The rest will be covered in the next episode. Also, to try and keep it light, the full list is detailed at the bottom. But there's also a second reason for it- I'm a sucker for Gephi.
GEPHI When it first came out in 2009, Gephi had about 10,000 downloads, but seven years later, it has crossed the 2 million and by unofficial polls, it seems that even today it is probably the most popular network analysis software.
As I tell my kids, this is not a competition, but Gephi wins.
By the way, when people search Google for a substitute for Gephi, these are the results they get.
Do you know which one? If you do you might win a goat with a bell. *
The founding father of Gephi is Mathieu Jacomy, whom I mentioned earlier. On the interview he'll reveal to us some new features and if you are a Java developer, he'll make you an offer you can't refuse.
Gephi is a Java-based software that at least for now, requires installation of both the software and Java's latest version. Gephi also comes with a lot of plugins, that is, other features that also need to be installed and the good news is that lots of them are quite useful. We'll start with Gephi's most renowned features: Visualization and scale. As for the visualization, there's almost nothing you can't do with it. That's why a lot of network posters you'll find on the web were made on Gephi. It's also one of the few software that allows you to run Jacomy's layout algorithm, ForceAtlas2. As for scale, Gephi is one of the best and can pretty well handle networks with tens of thousands of nodes and edges. According to some brochures, Gephi can handle hundreds of thousands of nodes and edges. This might be true but it doesn't pass the 2 minutes benchmark I mentioned early on and I bet Gephi would crash most of the times. It sounds bad but keep in mind there's only one software out there that can outperform Gephi in that regard and we'll cover it on the next episode.
As for the intuitiveness, I'm not so objective, because I've been working on Gephi for a long time and even wrote a basic guide to it in Hebrew for Israeli research teams during Covid19 (is Covid19 still an issue these days?).
I guess the obvious conclusion is that if it had been intuitive, there would have been no need for a manual. So, for those who wish to take their first steps in the field, Gephi will require some getting used to. Why? Because sometimes even basic functions are "hidden" under non-intuitive headers. For example, in Gephi, if you want to analyze key metrics like Closeness or Betweenness, you need to click on the function: "Calculate the network's diameter". Only then do you realize that it analyzes the other Centrality Measures as well. Although computationally this can make sense, it's a pity to make such a basic operation into a kind of an Easter egg for the user.
Speaking of Centrality Measures and basic algorithms, Gephi of course has you completely covered, Louvain included, and to the best of my knowledge, it is also the only network analysis software that has the Leiden algorithm, which is considered the improved Louvain.
But again, to apply it you need to install the relevant plugin.
Thanks to its 100 or so plugins, Gephi also enjoys a large variety of advanced algorithms and features.
One of them, which is not so common in network analysis software, is the ability to play a dynamic network. Given that the network edges or nodes have a Time Label, Gephi allows you to "play" the network and so watch it change.
Needless to say, it's a nice feature for small networks, but on large networks we'll probably lose our hands and feet if we try to figure out what's going on there by visuals alone.
Another nice feature in Gephi is the ability to perform projection, a feature we mentioned in a previous episode, and it does so quite comfortably and intuitively. Again, you'll need to install the relevant plugin.
Besides its features, Gephi is a great tool to explore networks because it allows the user to easily iterate between analysis states. Once you open Gephi, you can move between its three main tabs:
The "Overview" tab which is where the analysis process takes place, and it has a visualization window and the analysis menus.
The second tab is the "Data Laboratory" which presents the tables of the nodes and edges.
And finally, there's the "Preview" tab which is where you can polish the visualization to fit your PowerPoint needs.
The "Overview" tab keeps the relevant analysis menus close at hand, plus a user can toggle comfortably between the tabs, for example you can highlight a node on one tab and view it on another.
A major drawback though, and it may sound a bit silly, but I can see how it can be irritating to some, is that it takes some mouse skills to pan and zoom the network when you're using the "Overview" tab's visual interface. When this happens, I advocate patience and remember – it's all in the wrist.
So, in conclusion, what are the pros and cons of Gephi? A significant advantage for Gephi is that it's a one-stop shop. You can perform an end-to-end analysis even of large networks, and not feel anything is missing. The fact that it is expected to undergo a significant facelift and include new features, makes it score even more points. I don't want to steal Mathieu Jacomy's thunder, but I got to mention one of the new planned features and that is Hyperedges. This is exciting news because there is currently no free software that have it. So, we need to stop for a moment and explain what are hyperedges: In traditional network analysis, each edge connects two nodes. But there are real-world use cases where an edge can connect more than two nodes. Let's give an example: Suppose we have a network that consists of 3 nodes: A, B and C. Let's say that A corresponded with B, B corresponded with C and C corresponded with A. On this network we have now 3 pairs of edges forming a triad or a triangular graph. A hyperedge is formed when A sends a message to node B and to node C simultaneously. In this case, the edge stands for a parallel action that simultaneously connected more than two nodes together. In order to differentiate this edge from a standard edge it is called "a hyperedge" which can be found in "Higher Order Networks". In recent years there is a growing interest in the concept of "Higher Order Networks" as a model that can improve the resolution at which we analyze networks.
I hate to Say it, but Gephi also has some drawbacks, and I'll begin with a story: When I was little, we would sometimes go to the Mall to a big restaurant which I recall had many food stands. Each food stand offered a different cuisine which is a fancy word for fast food: There was a hamburger stand, pizza, Chinese, Indian, etc. And even though it was not the most delicious meal, the mere ability to choose from a wide variety really impressed me as a child. Needless to say, this restaurant has been closed for many years now. Probably because of health issues but why am I telling this? It's not my intention to imply that Gephi gives a mediocre solution to a wide range of issues, it's just that when you disperse your efforts, it comes at a price. In the case of the forementioned restaurant it meant that all the food there had a greenish hue and to this day I can't explain its source. In the case of Gephi, the toll is a non-intuitive UX and at least for now, a stability issue. When there are so many features, sometimes some of them get crammed into all sorts of tabs making them hard to find. And as for stability, sometimes, when running a particular feature, Gephi crashes, probably because not all of the plugins play nicely with the rest. Clicking the "save" button before each time you try a new function will not necessarily be so attractive to every user. Personally, I consider it as a bonus because it both toughens you up and serves as a contingency in the case of a sudden power outage.
So, is it a bug? Is it a feature? See? This is true love –
when you love in spite of the flaws and perhaps because of them. That’s' why on the next episode we'll compare the rest of the network analysis software to Gephi. Its wide range makes it a good benchmark for the rest.
The Survey Checklist - In each software we will check for the following list:
2. Uploading data interface
3. External interfaces
4. Interface Intuitivity
5. Basic metrics
6. Community Detection
7. Iterative analysis
8. Good / comfortable visualization
9. Advanced algorithms and features
10. Robustness or the capability to handle large networks
12. Significant advantage
13. Significant drawback
Installation: Gephi requires installation + installing the latest Java version. The installation is short and simple but after installing you might want to install many of Gephi's plugins which include features that do not come automatically with the software and we'll mention some of them.
Uploading Data: You can upload Excel files but the headers are not flexible, meaning it has to include a "Source" column and a "Target" column. But Gephi's advantage over other software is that it allows importing and exporting from a wide variety of file types.
External Data Sources or API's: If you have an API code to Twitter, Gephi allows you to connect directly and harvest data from there. In addition, if you have a website that you can connect to as a stream, Gephi allows it too.
Is it intuitive: The analysis menus are not intuitive. Some basic features are "buried" under non-intuitive headers. For example, Centrality measures such as Closeness and Betweenness, are hidden behind the "Network Diameter" function. Also, many functions are plugins that you need to find in the "tools" menu to install.
Basic Metrics: So, if you found them or installed the appropriate plugins, Gephi has it all and then some.
Community Detection: There are not many Community Detection algorithms in Gephi, and it doesn't include any algorithms for triad or clique detection. On the other hand, it has the Louvain algorithm, and to the best of my knowledge, it is the only software that has the improved version or the Leiden algorithm. It can also show the aggregated communities' graph and automatically label each community using the labels of its nodes. Again, to enjoy these features you need to install the relevant plugin.
Iterative Analysis: On this issue Gephi is among the best. It is very convenient for an iterative process and allows you to test metrics, go in and out of the visualizing tab, filter the results and more. But in this regard, there are two major drawbacks: The first is the panning and zooming in the visualization tab. It requires high mousing skills and can be a little frustrating. Another disadvantage is the lack of ctrl-Z or the Undo function. That is, if you delete something, it cannot be recovered. It will not affect your source files of course, but if you want to undo, you will have to upload them again.
Visualization: Gephi is one of the best visualization software and it allows a lot of flexibility. In fact, many posters of networks that you will find online were made with the help of Gephi. In my opinion, Gephi's most significant advantage is in the layout algorithm called Force Atlas2 that was developed by Mathieu Jacomy. It's fast and allows good layout even for huge networks. This is because it often coincides with the network communities, making the network more readable in the visual sense.
Advanced algorithms and features: Thanks to its large collection of plugins, Gephi has a wide range of algorithms and features at its disposal. One of them is the ability to fold a network or perform projection, which is a feature we mentioned in a previous episode, and it does so in a comfortable and intuitive fashion. Another rare feature is the ability to view the dynamics of a network. Given that the edges have a Time Label, Gephi allows you to "play "the network and thus watch it unfold. In addition, on the grid section, you can see a kind of small graph of when each edge was formed. But to do so, you need to make sure that you upload the time labels in the correct format (dd-mm-yyyy) and only use CSV files. It's a nice feature for small networks, but on large networks, we'll probably lose our hands and feet if we try to figure out what's going on there based on visual alone. What Gephi is missing is the matrix-based algorithms family.
Dealing with large networks and performance: Gephi provides a good and fast analysis even for networks with tens of thousands of nodes. According to its brochure, Gephi can also handle even larger networks, at the range of hundreds of thousands of nodes. In practice, the response time for such networks seems to be either too long or end with a crash, even after increasing Gephi's memory to the max which can be done manually.
Is the software updated: The latest version of the software (0.9.2) was released in 2017. A hackathon is planned to take place at the end of 2021 to do some facelifting and adding new features.
Significant advantage: The significant advantage of Gephi is that it's a one-stop shop. You can perform end-to-end analysis even of large networks, and not feel anything is missing. The fact that it is expected to undergo a significant facelift and include new features such as hyperedges makes it score even more points.
Significant drawback: Non-intuitive UX and as for stability, sometimes, when running a particular feature, Gephi crashes, probably because not all of the plugins play nicely with the rest. Better click on the "save" button before each time you try a new function.
Wish to expand the Network Science and SNA community?
The more you rate this podcast, the more people will be exposed to it.
Creative reviews will be read in the following episodes.
See you in the next episode of NETfrix (: * - Participation is prohibited for NETfrix listeners/readers.