A recent episode of Array Cast discussed high-rank arrays and the concept of named axes. (I vaguely remembered that Brother Steve had done a paper on this very topic some 20 odd years ago, and sure enough he did.) One use or application of high-rank arrays covered in the podcast was essentially multidimensional OLAP. That is, the construction of high rank arrays from some other source, usually a relational database, precomputing ranges, bins, categories and various measures. For example, in some health care application, you might end up with a rank 6 array that contains record counts by age, sex, marital status, income range, smoker, and drinks-per-day. This application of high-rank arrays is a very bad idea, that has been around a long time, and like many bad ideas, refuses to die.
Voluminous books have been written, endless jargon coined, complex software designed, and vast fortunes made consulting, all selling this bill of goods. All because SQL is viewed as virtually synonymous with the relational database model, and the vast majority of commercial RDBMS implementations use row-based storage. In other words, it's hard and time consuming to write analytic queries that require full table scans in Oracle or SQL Server, and then they run slow. This is no reason to jettison the relational database model.
The name "multidimensional" in this context is completely misleading. It implies that the relational model is somehow not multidimensional. But of course it is. The very fact that an OLAP "data cube" can be constructed from a single relational database table proves this. It is true that a table or matrix has two dimensions, rows and columns. But it is also true that a vector of length n represents a point in n-space. And a matrix is just a collection of row vectors, or a collection of points in n-space. So what then is a matrix? Is it a two dimensional thing or an n-dimensional thing? The answer is that it is both, often at the same time; it simply depends on what's in it, and how we interpret it. But whatever is in it, and however we interpret it, it never ever makes sense to explode the thing into a high-rank array, taking a lot of time and trouble in the process, losing tons of information along the way, and blowing out your workspace to boot.
Also discussed in the podcast was another idea, at best equally bad, probably worse; high-depth arrays and dependent types as an alternative to the relational model. The example presented was storing data about planets and their moons, where a table of planets has a moons column that contains nested tables of moons for each planet. This leads to more complexity, more difficulty in querying, more problems in efficient storage, and all sorts of other problems. Imagine you then want to store the elements that have been found on each moon. You can see where that leads, and it's no where good. The relational model solved this problem simply and elegantly something like 60 years ago.
With the right design (column oriented), and a query language based on APL, the relational model is the right way to organize your data in the examples above. Like the core concepts of APL, the relational model cannot be improved upon. High-rank arrays and nested tables are useful and appropriate in some cases, but not here.
Tidy data is mostly just an informal term for what in database theory is called 3rd (or there abouts) normal form. If data is not tidy, it is considered messy. Another pair of terms for the same concept, but less judgmental, is long-form and wide-form. Often wide-form or messy data is simply report data or secondary data. That is, it is the result of some report generated from the primary data in a DBMS. "Messy" in this case is in the eye of the beholder. Long-form data can also be composed of secondary data; it can be the result of a grouped query for example.
Altair strongly prefers long-form data, while plotly is happy to handle either format.
To explore this issue with SharpPlot we will use the well known 1930's barley experiment data set, beloved by statisticians. The data ranges over 6 farms, 10 varieties of barley, and 2 years, for a total of 120 observations on yields. Here are the first 10 rows:
Year
Farm
Variety
Yield
1931
University Farm
Manchuria
27.00
1931
Waseca
Manchuria
48.87
1931
Morris
Manchuria
27.43
1931
Crookston
Manchuria
39.93
1931
Grand Rapids
Manchuria
32.97
1931
Duluth
Manchuria
28.97
1931
University Farm
Glabron
43.07
1931
Waseca
Glabron
55.20
1931
Morris
Glabron
28.77
1931
Crookston
Glabron
38.13
Consider now aggregating the yield for farm and year, effectively eliminating the variety column. We can compute and display this as a multi-level grouping, grouping by the unique combinations of year and farm, and summing the yields:
Year
Farm
Yield
1931
University Farm
358.28
1931
Waseca
543.47
1931
Morris
292.88
1931
Crookston
436.60
1931
Grand Rapids
290.54
1931
Duluth
302.94
1932
University Farm
295.07
1932
Waseca
418.70
1932
Morris
415.12
1932
Crookston
311.79
1932
Grand Rapids
208.09
1932
Duluth
257.01
This is long-form data (despite being the result of a query). We can also compute and present the same data as a crosstab of yields by farm and year:
Farm
1931
1932
University Farm
358.28
295.07
Waseca
543.47
418.70
Morris
292.88
415.12
Crookston
436.60
311.79
Grand Rapids
290.54
208.09
Duluth
302.94
257.01
This is wide-form data. In this particular case, we have the same number of columns, but if we had more years of data, the crosstab would get wider, and of course the multi-level grouping in long-form would get longer.
In the case of a bar chart, SharpPlot accepts the wide form data with a minimum of fuss. The DrawBarChart method accepts a vector of vectors (in this case, one for year 1931 and one for the year 1932). Playfair assumes the first column of data is the category and all remaining columns are y axis or quantitative values so the chart definition for the wide-form data above is simply:
p←##.Main.New''
p.ChartType←'BarChart'
p.Heading←'Barley Yields from Wide-Form Data'
p
and the result is:
There is nothing wrong with wide-form data as a report format, or as the result of applying a query to long-form data, particularly when we are immediately using it as the input to a charting function, and particularly for bar charts. The wide-form data, like the bar chart we are creating, makes it easy to compare values between years. The long-form data is not conducive to this. Furthermore, we may well want to display the graph and the data side by side, and there is no reason to compute the values twice, once in the DBMS and once in the chart library.
On the other hand, long-form data does not need placeholders for empty or missing or non-existent data. If there was no observation for Grand Rapids in 1931, the long-form data simply would not have this row. But the wide-form data needs a cell for it. This becomes important for scatter plots, where the x axis can have different values for different groups. We will explore this in a future post. SharpPlot will also accept long-form data, but we must also specify the GroupBy and SplitBy methods. While these are methods that take actual column values in SharpPlot, in Playfair we have converted them to properties that take the name of the column of the input table. Using the long-form data above we can define the chart as:
p←##.Main.New''
p.ChartType←'BarChart'
p.Heading←'Barley Yields from Long-Form Data'
p.Select←'Farm,Yield'
p.GroupBy←'Farm'
p.SplitBy←'Year'
Note that in Playfair we must set a Select property to specify the columns for the axes, if the the table has more columns or the columns are not in the proper order for the default behavior.
And this produces the exact same chart as the wide-form example:
Now let's look at the same chart in Vega-Altair code The data used here is the original, ungrouped barley data. The Altair code is:
import altair as alt
from vega_datasets import data
source = data.barley()
alt.Chart(source).mark_bar().encode(
x='year:O',
y='sum(yield):Q',
color='year:N',
column='site:N'
)
I will spare you the Vega code that the Vega-Lite code compiles to.
Some observations:
The data used here is the primary barley data, so the specification must
include aggregating the yield column with a sum function.
Both the Altair and Vega code show the abstract
nature of the grammar of graphics.
The x and y axes are explicitly specified, as opposed to SharpPlot's bar chart
definition, and Playfair, which takes a table and assumes the first column is the categorical axis, the remaining columns are quantitative. (We could specify axes specifically in Playfair, but I'm not sure what benefit there is.)
The x axis is specified as year, not site (farm). This seems a little strange.
It appears Vega treats the chart as a collection of mini barcharts each with year as the axis, rather than one chart with site as the x axis, where each tick mark has multiple values.
The color and column properties somehow specify the multi-level grouping by year
and by site (farm). It's not clear to me at all why these terms are used or how exactly this works. I'm sure there is a good reason, but it is certainly... abstract.
Clearly Vega-Lite and Vega require a little more study.
Playfair is a new project for exploring the features, pros, and cons of SharpPlot, and how it compares to the big popular projects like Vega-Lite, ggplot2 and plotly.
Playfair is also for exploring and what needs to be done to to cover SharpPlot to make it as easy and consistent to use as possible.
It should be noted the main concern is with statistics and business graphics, so the focus is on bar charts, line charts, and scatter plots. 3D plots and the like are not particularly relevant though SharpPlot clearly has some of that capability.
Also it is assumed that a sophisticated database and query language is available to produce the data ready to be plotted. There is little need for the chart library to summarize data, compute totals and averages, and do general analytics, even though SharpPlot and the rest can in fact do this to some degree. However, it can be useful to delegate splitting and grouping data to the chart library in certain circumstances.
There are three completely superficial (but to the casual observer often fatal) cons of SharpPlot which need to to be noted but dispensed with: the SharpPlot website/help, the Wizard provided in the Dyalog Windows IDE, and the default color scheme. The website and CHM help both display charts as fuzzy, out-of-focus, pixelated png files. (That's not good for a graphics package!) The wizard is unusable, the default color palette hideous.
A primary pro of SharpPlot is that it is delivered as a single dll (or even APL workspace) with no dependencies. It is hard to overestimate the benefit of this. This feature is so important that it can make up for many, many cons of the library that may exist. Vega-Lite is a JavaScript library that depends on Vega, which in turn depends on D3 and who knows how many other libraries. And of course it takes a web browser or some JavaScript interpreter to run it. Despite the existence of the HTMRenderer, a JavaScript solution is not ideal.
The Grammar of Graphics
The primary difference between SharpPlot and Vega-Lite and ggplot2 is that the latter two are based on The Grammar of Graphics. This means things are ... abstract. To see how abstract, this quote from the home page of ggplot2 should suffice:
It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).
Now there is nothing wrong with deep philosophy, abstractions and generalizations, and for those who design and create charts for a living, or better yet, those who design systems that design charts, having the power this ultimately delivers is no doubt a good thing. Plotly and SharpPlot both fly a little closer to the ground, and provide specific functions that directly create bar charts and scatter plots and so on. Of course, given a grammar of graphics, the first thing we want to do is completely hide it from our end users, so it remains to be seen if it really is an advantage for basic business graphics.
One potential con of SharpPlot compared to the other packages (those based on GoG or not), is that the SharpPlot draw methods do not take a consistent specification for the data on the x and y axes. For example, the SharpPlot DrawBarChart method takes a vector of values for the quantitativive axis, usually y. The categorical axis is not defined and must be manually labeled. To get a horizontal bar chart, the bar-chart-specific Horizontal property must be specified, instead of just switching the specification of the data feeding the x and y axes. A little more generalization would help here. Be that as it may, end users are probably more comfortable selecting chart type BarChart and then clicking a check box for a Horizontal option, without thinking much about axes.
In future posts we will be putting SharpPlot through its paces, comparing and contrasting with the other charting libraries.