The author of this post, Roberto Salazar, attended Jeff Heer's course Techniques and Frameworks for Data Exploration through Sphere.
Analysts might have to work with tall and wide datasets when developing visualization plots. Tall datasets contain thousands or hundreds of thousands of rows containing data points or replicates. In contrast, wide datasets contain multiple columns of features, attributes, or labels of different types, such as quantitative, ordinal, or nominal.
Developing compelling visualization plots for both types of datasets has its unique set of challenges. When dealing with wide datasets, we want to consider the visual encoding channels to represent data. However, what is the best option to visually encode all features on a visualization without congesting it while keeping it simple to analyze? Is it position, length, area, volume, value, texture, color, orientation, shape, transparency, or blur? Are these visual encodings appropriate for all data types, or are they some that adapt better to a specific type? All of the previous are important questions analysts must answer when creating visualizations. Ultimately, a visualization must tell a story to provide insights, trends, or patterns.
Jacques Bertin's 1967 book Semiologue Graphique represents the first and widest intent to provide a theoretical foundation for information visualization, which included a set of visual variables that can be used to construct map symbols, as shown in Image 1.
In his book, Bertin also provided his levels of organization according to what he believed are the most effective visual encoding channels for each data type, as shown in Image 2.
On the other hand, what are some visual encoding examples to visualize all data points in tall data sets? Especially when obtaining a population sample is not an option, as we wish to avoid losing information.
This article shows how to choose effective visual encodings to build visualization plots when working with tall and wide datasets. We will apply several principles to a synthetic tall dataset, a dataset containing NYC's Uber ride-sharing data, and the well-known Iris flower dataset used and made famous by the British statistician and biologist Ronal Fisher.
The iris flower dataset contains 150 rows and six columns. Sepal length, sepal width, petal length, and petal width columns represent the features, while the species and species id columns represent the target (i.e., data label), one with the actual flower species and the other with the species encoded into numerical values.
Let’s start by plotting one feature at a time and adding more features to the visualization plot as we move forward until we end up with an easy-to-interpret-and-analyze plot with all features included.
Image 5 shows this dataset's simplest data visualization possible: a scatter plot with the data labels on the x-axis and one numerical feature on the y-axis. Three visual encodings were used at this point: x-axis, y-axis, and points. Since the plot did not include all the dataset features, only limited insights can be obtained from it (i.e., which species has the widest and narrowest sepal). Let us move on with adding another feature.
Images 6 and 7 represent two-dimensional scatter plots with the sepal/petal width encoded in the x-axis, the sepal/petal length encoded in the y-axis, and the flower species encoded with discrete colors, a common technique for encoding categorical features. At this point, we can see that the setosa species' sepal and petal widths and lengths are more unique than those of the versicolor and viriginica species. How do we know that? The blue dots do not overplot with the red and green dots.
Now, let's add a third feature to a single scatter plot.
In Image 8, the petal length feature was encoded with a color scale, which is commonly used for encoding continuous features. Data points with shorter petals have a dark blue color, while those with longer petals have a bright yellow color. Using a scale of colors, we can infer the petal lengths better or at least get a better approximation of their real values.
However, since data points cannot share two color scales, the flower species' original visual encoding of discrete colors had to be changed to shapes. This allows easy and fast interpretations. For example, with this plot, we can spot which data point belongs to each flower species. However, it’s still difficult to clearly differentiate the shapes from each other. Now we just need to add the petal width feature.
In Image 9, the petal width feature was encoded as the size of the dots. Smaller data points represent narrower petals, while larger ones represent wider petals. However, since the petal width is a continuous feature, it is difficult to determine the real numerical value for each data point – just from sight – since there are multiple sizes.
If the petal width had been a categorical feature (e.g., limited to small, medium, and large), it would have been easier to spot which petal widths were smaller or larger than others. Let us try facet visualization to remove the shape visual encoding of the flower species to create a plot that’s a bit easier to interpret quickly.
In Images 10 and 11, a grid facet was applied to remove the shape visual encoding from the flower species, resulting in three scatter plots sharing a y-axis. Analysts can simply look at one particular scatter, knowing that all the data points present on it belong to a single flower species rather than having all of them visually encoded in a single scatter plot.
At this point, all four numerical features and the categorical label have been plotted, resulting in two two-dimensional scatter plots from which multiple conclusions and findings can be obtained.
What if we had more features to encode, such as texture, area, or orientation? The more features present in a dataset, the more challenging it becomes to encode them. Analysts must evaluate if all features are relevant for a given analysis and how many of them need to be included in a given visualization while keeping it easy to analyze. So, we must validate that the information is numerical or categorical for the additional texture, area, and orientation features.
Rather than having a single visualization with as many features as possible, analysts can instead develop multiple visualization plots with highly correlated features or just with the most relevant features for the study.
Moving forward, let us explore the challenges of visual encoding for tall datasets, where single plotting of all data points could result in complex and difficult-to-interpret visualizations.
Tall datasets contain thousands or hundreds of thousands of rows/replicates/data points. These, too, can be visualized. In some cases, analysts apply a sampling technique (e.g., simple random sampling, systematic sampling, cluster sampling, stratified random sampling) to reduce the number of data points to plot and still get significant insights.
However, sometimes, reducing the dataset size is not an option. In this case, analysts must apply specific data visualization tools and encodings. Let us explore two types of visualizations for potting large datasets: two-dimensional hexagonal binned plots and three-dimensional hexagonal binned plots.
The scatter plot in Image 12 below visualizes two numerical features from 100,000 data points of a synthetic dataset. Since most of the data points overlap each other, we know only that the two numerical features have a positive correlation. That is, as the value on the x-axis increases, the value of the y-axis increases, and vice versa.
However, it is almost impossible to determine where the most data points are concentrated. So, let’s leverage hexagonal binned plots, two-dimensional histogram plots, in which the bins are hexagons, and their color represents the number of data points in each bin.
We’ll test multiple grid sizes to understand their effect on the plot granularity. Images 13 through 15 below show different configurations of hexagonal binned plots with three different grid sizes: 10, 50, and 100 bins. As the number of bins per grid increases, the plot becomes more granular (the data points shape becomes sharper), and the bins' sizes decrease along with the number of data points contained in each.
On the left-sided hexagonal binned plots — those containing a continuous numerical scale — the furthest bins from the center are entirely black. This is not helpful for understanding if there are no data points on them or at least one.
So, we’ll use a logarithmic scale on the color range to address this issue.
Images 16 through 18 above show the same hexagonal binned plots from images 13 through 15 using a logarithmic scale. These hexagons are significantly clearer since only bins with at least one data point count were plotted – since the minimum value is 10 to the power of 0, which equals 1 – and reveals the positive correlation within the data points. Furthermore, the highest concentration of data points is near the origin (0,0), and it starts to dissolve as it moves further from the origin.
It is up to the analyst to determine the optimum number of bins per grid, the ideal numerical scale for a given analysis, and the specified granularity criteria.
Like two-dimensional hexagonal binned plots, which use position (x), position (y), and color attributes, three-dimensional hexagonal binned plots can leverage four or more attributes to simplify the interpretation and analysis of hexagonal binned plots.
For the following example, let us explore the NYC Uber Ridesharing Data dashboard – available here – which examines how the frequency of Uber pickups vary over time in New York City and at its major regional airports: LaGuardia, JKF, and Newark Airports.
Images 19 through 22 show three-dimensional hexagonal binned plots for NYC's average Uber pickups from 17:00 to 18:00. Compared with the two-dimensional hexagonal binned plots, a fourth attribute was included: size (the height of the hexagonal bins). Just as those bins close to a dark red color, taller bins account for a greater concentration of data points (more frequencies of Uber pickups).
Those bins close to a light green color and with smaller size account for a lower concentration of data points (fewer frequencies of Uber pickups). The combination of the color and size features resulted in a three-dimensional hexagonal binned plot easy to interpret and analyze.
Suppose we had more numerical or categorical features. In that case, we could add texture to the bins or change their shape, for example.
Building effective data visualization plots for tall and wide datasets is a challenge for data analysts. Visual encoding can be leveraged to include more features in a single visualization plot to provide analysts with more information while keeping it simple to analyze and interpret.
For limited datasets (datasets with less than 10 columns), visual encodings such as position (x-, y-, and z-axis), color, size, shape, texture, orientation, and area can be used, as long as the visualization does not get congested. For very wide datasets (i.e., datasets with tenths or hundreds or columns), where it is impossible to encode all features for a single visualization, analysts would have to incur dimensionality reduction techniques (e.g., principal component analysis, t-SNE) to reduce the width of the dataset and plot the new data points in two- or three-dimensional plots.
For tall datasets, analysts first need to determine if having a subset of the original data could work for a given analysis, which would reduce its height by extracting a sample or subset. If not, they would have to incur into some of the visual encoding techniques covered in this article, such as the use of bins and the use of size and color scales for data points counting to build effective visualization plots.