Tool of Thought

APL for the Practical Man

Charting and Tidy Data

September 28, 2023

Tidy data is mostly just an informal term for what in database theory is called 3rd (or there abouts) normal form. If data is not tidy, it is considered messy. Another pair of terms for the same concept, but less judgmental, is long-form and wide-form. Often wide-form or messy data is simply report data or secondary data. That is, it is the result of some report generated from the primary data in a DBMS. "Messy" in this case is in the eye of the beholder. Long-form data can also be composed of secondary data; it can be the result of a grouped query for example.

Altair strongly prefers long-form data, while plotly is happy to handle either format.

To explore this issue with SharpPlot we will use the well known 1930's barley experiment data set, beloved by statisticians. The data ranges over 6 farms, 10 varieties of barley, and 2 years, for a total of 120 observations on yields. Here are the first 10 rows:

Year Farm Variety Yield
1931University FarmManchuria 27.00
1931Waseca Manchuria 48.87
1931Morris Manchuria 27.43
1931Crookston Manchuria 39.93
1931Grand Rapids Manchuria 32.97
1931Duluth Manchuria 28.97
1931University FarmGlabron 43.07
1931Waseca Glabron 55.20
1931Morris Glabron 28.77
1931Crookston Glabron 38.13

Consider now aggregating the yield for farm and year, effectively eliminating the variety column. We can compute and display this as a multi-level grouping, grouping by the unique combinations of year and farm, and summing the yields:

Year Farm Yield
1931University Farm358.28
1931Waseca 543.47
1931Morris 292.88
1931Crookston 436.60
1931Grand Rapids 290.54
1931Duluth 302.94
1932University Farm295.07
1932Waseca 418.70
1932Morris 415.12
1932Crookston 311.79
1932Grand Rapids 208.09
1932Duluth 257.01

This is long-form data (despite being the result of a query). We can also compute and present the same data as a crosstab of yields by farm and year:

Farm 1931 1932
University Farm358.28295.07
Waseca 543.47418.70
Morris 292.88415.12
Crookston 436.60311.79
Grand Rapids 290.54208.09
Duluth 302.94257.01

This is wide-form data. In this particular case, we have the same number of columns, but if we had more years of data, the crosstab would get wider, and of course the multi-level grouping in long-form would get longer.

In the case of a bar chart, SharpPlot accepts the wide form data with a minimum of fuss. The DrawBarChart method accepts a vector of vectors (in this case, one for year 1931 and one for the year 1932). Playfair assumes the first column of data is the category and all remaining columns are y axis or quantitative values so the chart definition for the wide-form data above is simply:

     p←##.Main.New''
     p.ChartType←'BarChart'
     p.Heading←'Barley Yields from Wide-Form Data'
     p

and the result is:

Barley Yields from Wide-Form Data Created by Causeway SVG engine - SharpPlot v3.71.0 Paint the paper ===== Border ===== Y-axis labels 0 50 100 150 200 250 300 350 400 450 500 550 Heading, subheading and footnotes ===== Heading Barley Yields from Wide-Form Data Region ===== for X-axis labels University Farm Waseca Morris Crookston Grand Rapids Duluth X-Axis Ticks ===== X-Axis tickmarks Y-Axis Ticks ===== Y grid Y-Axis tickmarks Start of Barchart =========== Axes ===== Key ===== Block key Block key 1931 1932 Reset to original origin

There is nothing wrong with wide-form data as a report format, or as the result of applying a query to long-form data, particularly when we are immediately using it as the input to a charting function, and particularly for bar charts. The wide-form data, like the bar chart we are creating, makes it easy to compare values between years. The long-form data is not conducive to this. Furthermore, we may well want to display the graph and the data side by side, and there is no reason to compute the values twice, once in the DBMS and once in the chart library.

On the other hand, long-form data does not need placeholders for empty or missing or non-existent data. If there was no observation for Grand Rapids in 1931, the long-form data simply would not have this row. But the wide-form data needs a cell for it. This becomes important for scatter plots, where the x axis can have different values for different groups. We will explore this in a future post. SharpPlot will also accept long-form data, but we must also specify the GroupBy and SplitBy methods. While these are methods that take actual column values in SharpPlot, in Playfair we have converted them to properties that take the name of the column of the input table. Using the long-form data above we can define the chart as:

     p←##.Main.New''
     p.ChartType←'BarChart'
     p.Heading←'Barley Yields from Long-Form Data'
     p.Select←'Farm,Yield'
     p.GroupBy←'Farm'
     p.SplitBy←'Year'

Note that in Playfair we must set a Select property to specify the columns for the axes, if the the table has more columns or the columns are not in the proper order for the default behavior.

And this produces the exact same chart as the wide-form example:

Barley Yields from Long-Form Data Created by Causeway SVG engine - SharpPlot v3.71.0 Paint the paper ===== Border ===== Y-axis labels 0 50 100 150 200 250 300 350 400 450 500 550 Heading, subheading and footnotes ===== Heading Barley Yields from Long-Form Data Region ===== for X-axis labels University Farm Waseca Morris Crookston Grand Rapids Duluth X-Axis Ticks ===== X-Axis tickmarks Y-Axis Ticks ===== Y grid Y-Axis tickmarks Start of Barchart =========== Axes ===== Key ===== Block key Block key 1931 1932 Reset to original origin

Now let's look at the same chart in Vega-Altair code The data used here is the original, ungrouped barley data. The Altair code is:

import altair as alt
from vega_datasets import data
source = data.barley()
alt.Chart(source).mark_bar().encode(
    x='year:O',
    y='sum(yield):Q',
    color='year:N',
    column='site:N'
)

Which compiles to the Vega-Lite code:

{
  "config": {"view": {"continuousWidth": 300, "continuousHeight": 300}},
  "data": {"name": "data-093ece8c35bb2d41094cfb6138ec810b"},
  "mark": {"type": "bar"},
  "encoding": {
    "color": {"field": "year", "type": "nominal"},
    "column": {"field": "site", "type": "nominal"},
    "x": {"field": "year", "type": "ordinal"},
    "y": {"aggregate": "sum", "field": "yield", "type": "quantitative"}
  },..
   ...all the data inlined here...
   ...

I will spare you the Vega code that the Vega-Lite code compiles to.

Some observations:

  1. The data used here is the primary barley data, so the specification must

    include aggregating the yield column with a sum function.

  2. Both the Altair and Vega code show the abstract

    nature of the grammar of graphics.

  3. The x and y axes are explicitly specified, as opposed to SharpPlot's bar chart

    definition, and Playfair, which takes a table and assumes the first column is the categorical axis, the remaining columns are quantitative. (We could specify axes specifically in Playfair, but I'm not sure what benefit there is.)

  4. The x axis is specified as year, not site (farm). This seems a little strange.

    It appears Vega treats the chart as a collection of mini barcharts each with year as the axis, rather than one chart with site as the x axis, where each tick mark has multiple values.

  5. The color and column properties somehow specify the multi-level grouping by year

    and by site (farm). It's not clear to me at all why these terms are used or how exactly this works. I'm sure there is a good reason, but it is certainly... abstract.

Clearly Vega-Lite and Vega require a little more study.

Project Playfair

September 27, 2023

Playfair is a new project for exploring the features, pros, and cons of SharpPlot, and how it compares to the big popular projects like Vega-Lite, ggplot2 and plotly.

Playfair is also for exploring and what needs to be done to to cover SharpPlot to make it as easy and consistent to use as possible.

It should be noted the main concern is with statistics and business graphics, so the focus is on bar charts, line charts, and scatter plots. 3D plots and the like are not particularly relevant though SharpPlot clearly has some of that capability.

Also it is assumed that a sophisticated database and query language is available to produce the data ready to be plotted. There is little need for the chart library to summarize data, compute totals and averages, and do general analytics, even though SharpPlot and the rest can in fact do this to some degree. However, it can be useful to delegate splitting and grouping data to the chart library in certain circumstances.

There are three completely superficial (but to the casual observer often fatal) cons of SharpPlot which need to to be noted but dispensed with: the SharpPlot website/help, the Wizard provided in the Dyalog Windows IDE, and the default color scheme. The website and CHM help both display charts as fuzzy, out-of-focus, pixelated png files. (That's not good for a graphics package!) The wizard is unusable, the default color palette hideous.

A primary pro of SharpPlot is that it is delivered as a single dll (or even APL workspace) with no dependencies. It is hard to overestimate the benefit of this. This feature is so important that it can make up for many, many cons of the library that may exist. Vega-Lite is a JavaScript library that depends on Vega, which in turn depends on D3 and who knows how many other libraries. And of course it takes a web browser or some JavaScript interpreter to run it. Despite the existence of the HTMRenderer, a JavaScript solution is not ideal.

The Grammar of Graphics

The primary difference between SharpPlot and Vega-Lite and ggplot2 is that the latter two are based on The Grammar of Graphics. This means things are ... abstract. To see how abstract, this quote from the home page of ggplot2 should suffice:

It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).

Now there is nothing wrong with deep philosophy, abstractions and generalizations, and for those who design and create charts for a living, or better yet, those who design systems that design charts, having the power this ultimately delivers is no doubt a good thing. Plotly and SharpPlot both fly a little closer to the ground, and provide specific functions that directly create bar charts and scatter plots and so on. Of course, given a grammar of graphics, the first thing we want to do is completely hide it from our end users, so it remains to be seen if it really is an advantage for basic business graphics.

One potential con of SharpPlot compared to the other packages (those based on GoG or not), is that the SharpPlot draw methods do not take a consistent specification for the data on the x and y axes. For example, the SharpPlot DrawBarChart method takes a vector of values for the quantitativive axis, usually y. The categorical axis is not defined and must be manually labeled. To get a horizontal bar chart, the bar-chart-specific Horizontal property must be specified, instead of just switching the specification of the data feeding the x and y axes. A little more generalization would help here. Be that as it may, end users are probably more comfortable selecting chart type BarChart and then clicking a check box for a Horizontal option, without thinking much about axes.

In future posts we will be putting SharpPlot through its paces, comparing and contrasting with the other charting libraries.

Ordered Namespaces

September 26, 2023

It is useful to be able to retrieve the names and values in a namespace in the order in which they were assigned, especially when designing DSLs where users are writing scripts. For example, we may want to assign vectors into a namespace and then display the vectors as columns in an Excel spreadsheet in the same order.

One way to do this is with global triggers. Here is QNSO, for quad namespace, ordered:

QNSO←{
     s←⎕NS''
     s.∆nl←''
     f←⊂'∆T x' ':Implements Trigger *' '{}2007⌶1' '∆nl,←⊂x.Name' '{}2007⌶0'
     f,←⊂'z←∆NL' 'z←⌽∪⌽∆nl∩⎕NL ¯2'
     s⊣s.⎕FX¨f
 }

Now we can do:

      s←QNSO ''
      s.Zebras←5
      s.Apples←8
      s.Eggs←12
      s.∆NL
 Zebras  Apples  Eggs 
      s.Zebras←7
      s.⎕EX 'Apples'
      s.∆NL
 Eggs  Zebras 

The trigger function ∆T is called on every assignment to a global variable in the namespace. The assigned names are accumulated in the variable ∆nl. We have to be careful to use 2007⌶ to turn global triggers off and then back on, or else we will end up in an endless loop.

More posts...