Now that we know how to make attractive charts in SharpPlot, the next step is add interactivity. SharpPlot has a brief tutorial on this topic, and provides various methods for making charts interactive. The AddHyperlinks method will add a hyperlink to any bar or point in a chart. The AddAttributes method allows an arbitrary attribute and value to be inserted into various elements
However, much of the techniques used are outdated given where CSS and SVG are now and the existence of the HTMLRenderer. In addition, using SharpPlot itself to add interactivity might be useful if we were to rely on different output formats but our only concern is SVG. All we need to do is to be able to identify and address the elements of interest. One option to accomplish this is to use the AddAttributes method to add an id to the elements. Unfortunately, AddAttributes adds an additional <rect> element for every <text> element, (and then adds it own id as well). For example, here is a snippet of SVG from a basic bar chart:
I'm sure there was a reason in the past for having the <rect> element, probably just to apply the pointer-visible attribute, but I don't think there is any need for it today. This gets in the way of, say, making the text of one x axis value bold using CSS. We need to identify the <text> element, not some associated <rect> element.
Luckily we can use Abacus to create an APL DOM of the SVG text emitted by SharpPlot. Then we can easily manipulate elements, add attributes, and so on. The problem is that the SVG is full of largely unidentifiable <text> and <rect> elements. But there are comments embedded using the <desc> element, as can be seen above. We can do some crude coding and sort of find out where things are. For example, here is a function that identifies the basic elments of a single series bar chart, adding id and class attributes:
⍝ ⍵ ←→ DOM
⍝ Crude Technique that relies on comments
⍝ Will not work if AddAttributes is used in certain circumastances
⍝ ... as additional elements are inserted.
⍝ Works only on basic bar chart with one series
n←'xlabel' 'ylabel' 'value' 'point'
v←'for X-axis labels' 'Y-axis labels' 'Data value labels ...'('Start of Barchart ',11⍴'=')
Now we can easily identify and manipulate all the relevant elements. (Of course SharpPlot knows exactly where and what everything is when it generates the SVG, and it would be much better if it added the id and class attributes itself.) Now we can construct a bar chart that operates like a pick list, allowing the user to scroll up and down, highlighting the current selection by placing a border around the bar and bolding and increasing the font size of the associated labels:
Note that if you inspect the source of this chart, it is not as it would appear in an application. Here, for convenience in a static web site, we simply do the highlighting by using the style attribute. In an application, classes are used with external style sheets. Scrolling up and down will change the class of the bar, for example, from unselected to selected.
Consider getting a useful first impression and understanding of a single column in a database table, (or a vector of values all of the same type). If there are only a few unique values in the column, say a dozen or less, then a frequency distribution is appropriate. We get an immediate, informative overview of the data, regardless of the type. This is easily displayed in a bar chart. Here we have the distribution of stints for major league baseball players in 2019. A stint is a period of time with a particular team. We can see that most players spent the entire season with one team, while 12 players played for 3 teams:
However, as the number of unique values grows, a frequency distribution becomes less and less useful. When every value is unique, the distribution degenerates into the entire original column catenated with a vector of 1's. For quantitative or temporal data, this problem is easily solved by grouping into bins or buckets, reducing the number of categories. Here we have the number of games played per player for 2019:
However, if the data is categorical, it is generally not possible to meaningfully group the data. One option is to produce a frequency distribution that shows only the top 10 (say) categories, grouping the remainder into an "other" category. This works well when there are many categories, and the categories are of varying sizes, keeping the "other" category relatively small:
If there are many categories and they are similar-sized, this breaks down. Here we have a distribution of the PlayerID column, which is mostly unique, except for players that have done multiple stints in the season:
What, then, is to be done in the case of a categorical column with many evenly distributed unique values? If a frequency distribution is inadequate how about a frequency distribution of the frequency distribution? That is, a table displaying the number of values that occur once, the number of values that occur twice, the number of values that occur three times, etc.:
This table is a much more useful first look at high-variance categorical data. For example, it is immediately apparent if the values are unique and suitable for a key column. It is easy to identify outliers, that is duplicate values or triplicate values. Let's call this a second-order frequency distribution.
By inspection we can tell whether a first-order or second-order distribution will be more useful, and come up with some back-of-the-envelope algorithm to make the choice, which may well be sufficient. But is there a way to actually compute the variance of a categorical column and use that measure to determine what exactly is "high-variance" categorical data? That question and some APL code will be explored in a future post.
A recent episode of Array Cast discussed high-rank arrays and the concept of named axes. (I vaguely remembered that Brother Steve had done a paper on this very topic some 20 odd years ago, and sure enough he did.) One use or application of high-rank arrays covered in the podcast was essentially multidimensional OLAP. That is, the construction of high rank arrays from some other source, usually a relational database, precomputing ranges, bins, categories and various measures. For example, in some health care application, you might end up with a rank 6 array that contains record counts by age, sex, marital status, income range, smoker, and drinks-per-day. This application of high-rank arrays is a very bad idea, that has been around a long time, and like many bad ideas, refuses to die.
Voluminous books have been written, endless jargon coined, complex software designed, and vast fortunes made consulting, all selling this bill of goods. All because SQL is viewed as virtually synonymous with the relational database model, and the vast majority of commercial RDBMS implementations use row-based storage. In other words, it's hard and time consuming to write analytic queries that require full table scans in Oracle or SQL Server, and then they run slow. This is no reason to jettison the relational database model.
The name "multidimensional" in this context is completely misleading. It implies that the relational model is somehow not multidimensional. But of course it is. The very fact that an OLAP "data cube" can be constructed from a single relational database table proves this. It is true that a table or matrix has two dimensions, rows and columns. But it is also true that a vector of length n represents a point in n-space. And a matrix is just a collection of row vectors, or a collection of points in n-space. So what then is a matrix? Is it a two dimensional thing or an n-dimensional thing? The answer is that it is both, often at the same time; it simply depends on what's in it, and how we interpret it. But whatever is in it, and however we interpret it, it never ever makes sense to explode the thing into a high-rank array, taking a lot of time and trouble in the process, losing tons of information along the way, and blowing out your workspace to boot.
Also discussed in the podcast was another idea, at best equally bad, probably worse; high-depth arrays and dependent types as an alternative to the relational model. The example presented was storing data about planets and their moons, where a table of planets has a moons column that contains nested tables of moons for each planet. This leads to more complexity, more difficulty in querying, more problems in efficient storage, and all sorts of other problems. Imagine you then want to store the elements that have been found on each moon. You can see where that leads, and it's no where good. The relational model solved this problem simply and elegantly something like 60 years ago.
With the right design (column oriented), and a query language based on APL, the relational model is the right way to organize your data in the examples above. Like the core concepts of APL, the relational model cannot be improved upon. High-rank arrays and nested tables are useful and appropriate in some cases, but not here.