Tool of Thought

APL for the Practical Man

Charting and Tidy Data

September 28, 2023

Tidy data is mostly just an informal term for what in database theory is called 3rd (or there abouts) normal form. If data is not tidy, it is considered messy. Another pair of terms for the same concept, but less judgmental, is long-form and wide-form. Often wide-form or messy data is simply report data or secondary data. That is, it is the result of some report generated from the primary data in a DBMS. "Messy" in this case is in the eye of the beholder. Long-form data can also be composed of secondary data; it can be the result of a grouped query for example.

Altair strongly prefers long-form data, while plotly is happy to handle either format.

To explore this issue with SharpPlot we will use the well known 1930's barley experiment data set, beloved by statisticians. The data ranges over 6 farms, 10 varieties of barley, and 2 years, for a total of 120 observations on yields. Here are the first 10 rows:

Year Farm Variety Yield
1931University FarmManchuria 27.00
1931Waseca Manchuria 48.87
1931Morris Manchuria 27.43
1931Crookston Manchuria 39.93
1931Grand Rapids Manchuria 32.97
1931Duluth Manchuria 28.97
1931University FarmGlabron 43.07
1931Waseca Glabron 55.20
1931Morris Glabron 28.77
1931Crookston Glabron 38.13

Consider now aggregating the yield for farm and year, effectively eliminating the variety column. We can compute and display this as a multi-level grouping, grouping by the unique combinations of year and farm, and summing the yields:

Year Farm Yield
1931University Farm358.28
1931Waseca 543.47
1931Morris 292.88
1931Crookston 436.60
1931Grand Rapids 290.54
1931Duluth 302.94
1932University Farm295.07
1932Waseca 418.70
1932Morris 415.12
1932Crookston 311.79
1932Grand Rapids 208.09
1932Duluth 257.01

This is long-form data (despite being the result of a query). We can also compute and present the same data as a crosstab of yields by farm and year:

Farm 1931 1932
University Farm358.28295.07
Waseca 543.47418.70
Morris 292.88415.12
Crookston 436.60311.79
Grand Rapids 290.54208.09
Duluth 302.94257.01

This is wide-form data. In this particular case, we have the same number of columns, but if we had more years of data, the crosstab would get wider, and of course the multi-level grouping in long-form would get longer.

In the case of a bar chart, SharpPlot accepts the wide form data with a minimum of fuss. The DrawBarChart method accepts a vector of vectors (in this case, one for year 1931 and one for the year 1932). Playfair assumes the first column of data is the category and all remaining columns are y axis or quantitative values so the chart definition for the wide-form data above is simply:

     p←##.Main.New''
     p.ChartType←'BarChart'
     p.Heading←'Barley Yields from Wide-Form Data'
     p

and the result is:

Barley Yields from Wide-Form Data Created by Causeway SVG engine - SharpPlot v3.71.0 Paint the paper ===== Border ===== Y-axis labels 0 50 100 150 200 250 300 350 400 450 500 550 Heading, subheading and footnotes ===== Heading Barley Yields from Wide-Form Data Region ===== for X-axis labels University Farm Waseca Morris Crookston Grand Rapids Duluth X-Axis Ticks ===== X-Axis tickmarks Y-Axis Ticks ===== Y grid Y-Axis tickmarks Start of Barchart =========== Axes ===== Key ===== Block key Block key 1931 1932 Reset to original origin

There is nothing wrong with wide-form data as a report format, or as the result of applying a query to long-form data, particularly when we are immediately using it as the input to a charting function, and particularly for bar charts. The wide-form data, like the bar chart we are creating, makes it easy to compare values between years. The long-form data is not conducive to this. Furthermore, we may well want to display the graph and the data side by side, and there is no reason to compute the values twice, once in the DBMS and once in the chart library.

On the other hand, long-form data does not need placeholders for empty or missing or non-existent data. If there was no observation for Grand Rapids in 1931, the long-form data simply would not have this row. But the wide-form data needs a cell for it. This becomes important for scatter plots, where the x axis can have different values for different groups. We will explore this in a future post. SharpPlot will also accept long-form data, but we must also specify the GroupBy and SplitBy methods. While these are methods that take actual column values in SharpPlot, in Playfair we have converted them to properties that take the name of the column of the input table. Using the long-form data above we can define the chart as:

     p←##.Main.New''
     p.ChartType←'BarChart'
     p.Heading←'Barley Yields from Long-Form Data'
     p.Select←'Farm,Yield'
     p.GroupBy←'Farm'
     p.SplitBy←'Year'

Note that in Playfair we must set a Select property to specify the columns for the axes, if the the table has more columns or the columns are not in the proper order for the default behavior.

And this produces the exact same chart as the wide-form example:

Barley Yields from Long-Form Data Created by Causeway SVG engine - SharpPlot v3.71.0 Paint the paper ===== Border ===== Y-axis labels 0 50 100 150 200 250 300 350 400 450 500 550 Heading, subheading and footnotes ===== Heading Barley Yields from Long-Form Data Region ===== for X-axis labels University Farm Waseca Morris Crookston Grand Rapids Duluth X-Axis Ticks ===== X-Axis tickmarks Y-Axis Ticks ===== Y grid Y-Axis tickmarks Start of Barchart =========== Axes ===== Key ===== Block key Block key 1931 1932 Reset to original origin

Now let's look at the same chart in Vega-Altair code The data used here is the original, ungrouped barley data. The Altair code is:

import altair as alt
from vega_datasets import data
source = data.barley()
alt.Chart(source).mark_bar().encode(
    x='year:O',
    y='sum(yield):Q',
    color='year:N',
    column='site:N'
)

Which compiles to the Vega-Lite code:

{
  "config": {"view": {"continuousWidth": 300, "continuousHeight": 300}},
  "data": {"name": "data-093ece8c35bb2d41094cfb6138ec810b"},
  "mark": {"type": "bar"},
  "encoding": {
    "color": {"field": "year", "type": "nominal"},
    "column": {"field": "site", "type": "nominal"},
    "x": {"field": "year", "type": "ordinal"},
    "y": {"aggregate": "sum", "field": "yield", "type": "quantitative"}
  },..
   ...all the data inlined here...
   ...

I will spare you the Vega code that the Vega-Lite code compiles to.

Some observations:

  1. The data used here is the primary barley data, so the specification must

    include aggregating the yield column with a sum function.

  2. Both the Altair and Vega code show the abstract

    nature of the grammar of graphics.

  3. The x and y axes are explicitly specified, as opposed to SharpPlot's bar chart

    definition, and Playfair, which takes a table and assumes the first column is the categorical axis, the remaining columns are quantitative. (We could specify axes specifically in Playfair, but I'm not sure what benefit there is.)

  4. The x axis is specified as year, not site (farm). This seems a little strange.

    It appears Vega treats the chart as a collection of mini barcharts each with year as the axis, rather than one chart with site as the x axis, where each tick mark has multiple values.

  5. The color and column properties somehow specify the multi-level grouping by year

    and by site (farm). It's not clear to me at all why these terms are used or how exactly this works. I'm sure there is a good reason, but it is certainly... abstract.

Clearly Vega-Lite and Vega require a little more study.