Elias MÃ¥rtenson posts about a query in R and compares it to his APL-inspired language Kap. Elias of course provides a succint, APL-style solution in Kap. Here we take a look at the R solution and compare it to FlipDB, a relational database management system written in APL. The problem starts with the following table:
| Country | Amount | Discount |
|---|
USA | 2,000 | 10 |
USA | 3,500 | 15 |
USA | 3,000 | 20 |
Canada | 120 | 12 |
Canada | 180 | 18 |
Canada | 3,100 | 21 |
UK | 130 | 13 |
UK | 160 | 16 |
| ... | | |
Given this, the task is to produce a table of total discounted sales by country, excluding outliers defined as entries in each country with an amount greater than 10 times the median for that particular country. The result we are looking for is:
| Country | Total |
|---|
Australia | 540 |
Brazil | 414 |
Canada | 270 |
France | 450 |
Germany | 513 |
India | 648 |
Italy | 567 |
Japan | 621 |
Spain | 594 |
UK | 432 |
USA | 8,455 |
We thus have a row-dependent where clause. That is, whether or not a row is included depends on other rows in the table. More precisely, whether or not a row is included in a group depends on the other rows in the group. This is the crux of the matter, and what makes the query problematic in some tools.
The R solution is given as:
purchases |>
group_by(country) |>
filter(amount <= median(amount) * 10) |>
summarize(total = sum(amount - discount))
As the original post suggests, this is indeed a nice solution. There is a filter, or where clause, that occurs after grouping. In APL terms, we can visualize the reference to amount being a nested array and median being an aggregate function applied with an each, and scalar extension applying to yield a nested Boolean selection vector. Who knows how it is executed under the covers, but that's how it looks.
How is this solved in FlipDB? We define the query as:
| GroupBy: | Name | Expression |
|---|
| | Country | Country |
| Measures: | Name | Expression |
|---|
| | Total | sum (Amount - Discount) where Amount <= 10 * median Amount |
| OrderBy: | Name | Direction |
|---|
| | Country | Up |
(Note that we have to explicitly specify the ordering which seems to be a default in R. We don't show the OrderBy clause in the queries below in the interest of space.)
This is remarkably similar to the R solution. The difference is that we are filtering in-line, just for one column in the result set, whereas the R solution is filtering the entire table. In-line filtering is useful because we can apply different filters, or no filter at all, for different columns in the result table. For example, it might be useful to display the totals without excluding the outliers side-by-side for comparison which can be done by just adding another measure:
| Measures: | Name | Expression |
|---|
| | Total | sum (Amount - Discount) where Amount <= 10 * median Amount |
| | TotalAll | sum Amount - Discount |
yielding:
| Country | Total | TotalAll | |
|---|
Australia | 540 | 540 | |
Brazil | 414 | 414 | |
Canada | 270 | 3,349 | |
France | 450 | 450 | |
Germany | 513 | 513 | |
India | 648 | 648 | |
Italy | 567 | 567 | |
Japan | 621 | 621 | |
Spain | 594 | 594 | |
UK | 432 | 432 | |
USA | 8,455 | 8,455 | |
Here we can see by inspection that only Canada has outliers.
We can, like the R solution, specify a where clause for the query as a whole:
| Where: | Expression |
|---|
| | Amount <= 10 * median by Amount (group Country) |
| GroupBy: | Name | Expression |
|---|
| | Country | Country |
| Measures: | Name | Expression |
|---|
| | Total | sum Amount - Discount |
However, unlike the R solution, this is applied before, and independently of, grouping. Thus we need to specify a grouping in the where clause itself, and then apply the median function to each group using the by operator. The by operator handles the details of applying an aggregate function to grouped data, and then replicating the results to line up with the ungrouped data. The advantage here over the R solution is that we may then group our query by some other column or value than Country. For example, we might group by region, but keep outliers defined within country:
| Where: | Expression |
|---|
| | Amount <= 10 * median by Amount (group Country) |
| GroupBy: | Name | Expression |
|---|
| | Americas | Country in 'USA,Canada,Brazil' |
| | Europe | Country in 'UK,France,Germany,Italy,Spain' |
| | Asia | Country in 'Australia,Japan,India' |
| Measures: | Name | Expression |
|---|
| | Total | sum Amount - Discount |
which yields:
| Region | Total |
|---|
Americas | 9,139 |
Europe | 2,556 |
Asia | 1,809 |
What if we want to group by region but in addition display totals including the outliers, as we did above? We can't exclude that data from the query, but now we can't assume that the grouping for the query is the same grouping for computing the outliers. Rather than applying a where clause, we can pre-compute a column that flags outliers according to their country and then group based on region:
| ComputedColumns: | Name | Expression |
|---|
| | Outlier | Amount > 10 * median by Amount (group Country) |
| GroupBy: | Name | Expression |
|---|
| | Americas | Country in 'USA,Canada,Brazil' |
| | Europe | Country in 'UK,France,Germany,Italy,Spain' |
| | Asia | Country in 'Australia,Japan,India' |
| Measures: | Name | Expression |
|---|
| | Total | sum (Amount - Discount) where not Outlier |
| | TotalAll | sum Amount - Discount |
which yields:
| Partition | Total | TotalAll | |
|---|
Americas | 9,139 | 12,218 | |
Europe | 2,556 | 2,556 | |
Asia | 1,809 | 1,809 | |
In FlipDB computed columns in a query are executed before the where clause. This is useful, as we may want to reference them in the where clause, and if they are row-dependent we may want them computed on the entire table, not just what the where clause includes. However, perhaps in this case, it would be useful if computed columns were to execute after the where clause. We could have two sets of computed columns, one executed before the where clause and one executed after, but that seems a bit overkill.