Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Block quote
Ordered list
Unordered list
Bold text
Emphasis
Superscript
Subscript
In the world of data, a picture is indeed worth a thousand numbers. As data enthusiasts, we often find it challenging to step out of our world filled with numbers, probability distributions, and Python code, and present our findings to those not in our field without overwhelming them. In this article, I’ll draw from my experiences and examples of failed real-world data visualizations to outline the pitfalls to avoid in data presentation.
In every data analyst or data scientist’s life, there comes a time when it’s not just about studying data, but also about explaining it to others. I remember once at a previous job, I was presenting the results of a machine learning model to a room full of managers. The model had been running successfully for months. The results were good, and I chose a graph that combined bars and lines to compare how four key variables changed with the model’s configuration. But about five minutes in, a manager stopped me, saying he couldn’t understand what he was looking at. That’s when I realized I might have gone overboard. Instead of making the important stuff visible, I had just confused everyone.
In this article, I want to focus on some glaring mistakes people make in data presentation and how I’d fix them. Many of these examples might seem obvious or trivial, but the same mistakes keep repeating in data visualizations we see all around us.
All the graphics in this article were created by me (the idea) and Dan Kovalik (graphical design) either as reimaginings of original real-world charts, or as brand new visualizations of the same data.
The internet is full of graphs like these, many captured from Fox News. Look at the bar graph showing a huge visual difference between columns. It might represent a tax increase that seems enormous until you notice the vertical axis doesn’t start at zero. In reality, the difference between the values is just a few percent.
Our eyes perceive the values in a bar graph as proportional to the height of the bars. That’s why the axes on bar graphs must always start at zero. Fixing this is simple.
For a data analyst, the question arises: if I want to show a difference between numbers that are far from zero, how to overcome this limitation? This rule doesn’t have to be strictly applied to line graphs. With them, our eyes mainly compare the relative height of the points or lines, not their absolute position relative to the baseline.
Talking about graph axes, those of us who work with data daily sometimes like to use a logarithmic scale when showing a variable that changes by orders of magnitude. But for most people, a logarithmic axis is unusual. So, if you decide to use one, clearly indicate it on the graph and explain how a logarithmic axis works to your audience when describing the graph.
Pie charts are almost universally disliked by data professionals. Yet, when used correctly, they serve their purpose. Unfortunately, they are often misused. In one example (again from American TV), the usage is almost absurd. In a survey, viewers were asked about three different candidates in the Republican primaries, and whether they found each candidate likable or not. These are three separate responses to three different questions. A viewer could like all candidates, none, or any combination. The visualization shows the percentage of viewers who find each candidate likable, all squeezed into one pie chart.
Pie charts are suitable for displaying data divided into a few mutually exclusive categories that together add up to 100 percent. If your data doesn’t meet this criterion, it doesn’t make a pie we can slice. In this case, we “fixed” the graph by replacing it with a bar chart.
Since I was able to find the original dataset with all responses for each candidate, we created an alternative visualization using horizontal stacked bar charts. This format shows the structure of sympathies towards each candidate. By comparing the left parts of the rows, we can see the number of people who find each candidate favorable. On the right hand side, we can compare the number of people who find the candidates unfavorable. Between these two segments for each candidate, there are additional segments representing people who either don’t know the candidate or have a neutral opinion of them. As these categories are mutually exclusive and sum up to 100%, I could have used 3 pie charts instead, however, this visualization is better because it allows the direct comparison between candidates.
Edward Tufte, a data visualization expert, has written many books on this subject (highly recommended!). In his work “The Visual Display of Quantitative Information,” he called one graph “the worst graphic ever to find its way into print.”
Indeed, this graph displays merely five values on a huge area. It’s a time series of the percentage of people under 25 years old among college applicants, complemented by the percentage of those 25 and older. Illogically, one series is shown as a filled area from the bottom edge (which, unsurprisingly, doesn’t start at zero) and the other from the top edge. Even though both series always add up to 100%, their color fills make them look like two independent categories, suggesting there might be others in the white space between them. Not to mention the unnecessary 3D effect and the absurdly small font, given the size of the graph.
This graph can be made readable in several ways. I chose “stacked bars” to clearly show that the values are complementary.
Edward Tufte has his “favorite” graph in the one described above, but my “favorite” is the next one. Sometimes we all feel the need to inject more creativity and art into our work, and in this case, the authors maybe even attempted to invent a completely new type of graph.
The graph shows the distribution of federal subsidies across different segments of the energy industry. At first glance, it looks like a pie chart, but don’t be fooled. Try it yourself: look at the graph and see how long it takes to understand it.
The division into halves and quadrants doesn’t correspond to the size of the subsidies but only categorizes them. To decipher the size of the subsidies, you need to focus on the area of circular segments and the area of the annular sectors in each quadrant — an unnecessary mental burden. The graph also lacks a legend, so you only realize that the circular segments in each quadrant represent “direct spendings” and the annular sectors “tax breaks” when you notice the labels in the lower left quadrant.
Creating a good visualization for this data (though it’s only eight numbers) was quite a challenge for me. The values have a wide range and there’s a hierarchy among the categories I wanted to preserve. I settled on a Sankey chart, which breaks down the whole into individual categories according to their hierarchy.
The value of each category corresponds to the size of the columns representing them. In this type of graph, the columns don’t share a common baseline, so they are slightly harder to compare than in a classic bar chart (which is also a good alternative). However, for this purpose, I find this type of graph to be a suitable compromise between displaying the quantities and emphasizing the hierarchical breakdown.
I remember my first encounter with Microsoft Excel many years ago, possibly Excel 95. I was immediately fascinated by all the graph options and effects available. Like my later fascination with WordArt, this phase thankfully passed. However, some graphs from that era serve as a fascinating warning against prioritizing form over content.
One graph displayed four lines, each representing the distribution of a variable across ordered categories. But, why not add a third dimension? So, the lines became surfaces, overlapping in a false perspective, losing sight of visual anchors like axes and grid lines, making comparison difficult. And if you look closely at the horizontal axis, the grid lines don’t even match the tick marks, so it hardly matters. It’s tough to get much out of this graph.
Getting rid of the unnecessary 3D effect is easy. Even then, we could end up with four cluttered lines crossing over each other. So, splitting the graph into four (known as “Small multiples”), each focusing primarily on one distribution seemed apt. The remaining three are always drawn in the background for context.
Since these are graphs of probability distributions, not just any line graph, it might be even more appropriate to use a smoothed line (a kernel density plot) or a bar graph without gaps between the columns (i.e. de facto a histogram).
Dealing with 38 time series, most with 20 data points each, amounts to nearly 800 data points on a single graph. This can easily result in a “spaghetti” graph. The lines overlap, making it nearly impossible to follow any one from start to finish, and the color scheme doesn’t have 38 distinct colors — so each color is shared by 3 or 4 lines. I feel obliged to clarify that this chart was actually featured as a bad example of a standard Excel chart by its authors.
When we have this much data, displaying it all in one graph is extremely challenging. We should consider what information we want to convey to the audience and focus our visualization primarily on that.
If we really need to show all nearly 800 data points, a simple table might be more appropriate. Coloring the cells so that the color (or its intensity) is proportional to the value would make sense. This “preattentive attribute” significantly eases navigation, as it gives a visual cue about the values in the table without needing to read the actual numbers.
If we’re interested in comparing values at a specific point in time (like in 2021), a table sorted by these values, supplemented with Sparklines — small graphs without axes that put the numbers into the context of their historical development — might suffice. (Please excuse the depiction of 2020–2021, which was not in the original graph — our mistake in creating the visual materials.)
And if we want to focus on the development of a specific series, we can use the “small multiples” strategy mentioned in the previous paragraph.
To encapsulate what I’ve demonstrated through various examples, I would conclude this article with a few universal principles. Before you start designing a graph, first articulate the core message you wish to convey to your audience, and tailor the graph to make that message clear. Avoid unnecessary decorations; simplicity in graphs can be both beautiful and effective. And when it really matters, present your graph to a few individuals who are unaware of its context. If they can immediately interpret it correctly, you’re on the path to success.