Tool of Thought

APL for the Practical Man

"We make software the old-fashioned way, we write it."

A Query in R and Kap

May 8, 2026

Elias MÃ¥rtenson posts about a query in R and compares it to his APL-inspired language Kap. Elias of course provides a succint, APL-style solution in Kap. Here we take a look at the R solution and compare it to FlipDB, a relational database management system written in APL. The problem starts with the following table:

Country Amount Discount
USA 2,00010
USA 3,50015
USA 3,00020
Canada 120 12
Canada 180 18
Canada 3,10021
UK 130 13
UK 160 16
...

Given this, the task is to produce a table of total discounted sales by country, excluding outliers defined as entries in each country with an amount greater than 10 times the median for that particular country. The result we are looking for is:

Country Total
Australia540
Brazil 414
Canada 270
France 450
Germany 513
India 648
Italy 567
Japan 621
Spain 594
UK 432
USA 8,455

We thus have a row-dependent where clause. That is, whether or not a row is included depends on other rows in the table. More precisely, whether or not a row is included in a group depends on the other rows in the group. This is the crux of the matter, and what makes the query problematic in some tools.

The R solution is given as:

  purchases |>
  group_by(country) |>   
  filter(amount <= median(amount) * 10) |>
  summarize(total = sum(amount - discount))

As the original post suggests, this is indeed a nice solution. There is a filter, or where clause, that occurs after grouping. In APL terms, we can visualize the reference to amount being a nested array and median being an aggregate function applied with an each, and scalar extension applying to yield a nested Boolean selection vector. Who knows how it is executed under the covers, but that's how it looks.

How is this solved in FlipDB? We define the query as:

GroupBy:NameExpression
CountryCountry
Measures:NameExpression
Totalsum (Amount - Discount) where Amount <= 10 * median Amount
OrderBy:NameDirection
CountryUp

(Note that we have to explicitly specify the ordering which seems to be a default in R. We don't show the OrderBy clause in the queries below in the interest of space.)

This is remarkably similar to the R solution. The difference is that we are filtering in-line, just for one column in the result set, whereas the R solution is filtering the entire table. In-line filtering is useful because we can apply different filters, or no filter at all, for different columns in the result table. For example, it might be useful to display the totals without excluding the outliers side-by-side for comparison which can be done by just adding another measure:

Measures:NameExpression
Totalsum (Amount - Discount) where Amount <= 10 * median Amount
TotalAllsum Amount - Discount

yielding:

Country Total TotalAll
Australia540 540
Brazil 414 414
Canada 270 3,349
France 450 450
Germany 513 513
India 648 648
Italy 567 567
Japan 621 621
Spain 594 594
UK 432 432
USA 8,4558,455

Here we can see by inspection that only Canada has outliers.

We can, like the R solution, specify a where clause for the query as a whole:

Where:Expression
Amount <= 10 * median by Amount (group Country)
GroupBy:NameExpression
CountryCountry
Measures:NameExpression
Totalsum Amount - Discount

However, unlike the R solution, this is applied before, and independently of, grouping. Thus we need to specify a grouping in the where clause itself, and then apply the median function to each group using the by operator. The by operator handles the details of applying an aggregate function to grouped data, and then replicating the results to line up with the ungrouped data. The advantage here over the R solution is that we may then group our query by some other column or value than Country. For example, we might group by region, but keep outliers defined within country:

Where:Expression
Amount <= 10 * median by Amount (group Country)
GroupBy:NameExpression
AmericasCountry in 'USA,Canada,Brazil'
EuropeCountry in 'UK,France,Germany,Italy,Spain'
AsiaCountry in 'Australia,Japan,India'
Measures:NameExpression
Totalsum Amount - Discount

which yields:

Region Total
Americas9,139
Europe 2,556
Asia 1,809

What if we want to group by region but in addition display totals including the outliers, as we did above? We can't exclude that data from the query, but now we can't assume that the grouping for the query is the same grouping for computing the outliers. Rather than applying a where clause, we can pre-compute a column that flags outliers according to their country and then group based on region:

ComputedColumns:NameExpression
OutlierAmount > 10 * median by Amount (group Country)
GroupBy:NameExpression
AmericasCountry in 'USA,Canada,Brazil'
EuropeCountry in 'UK,France,Germany,Italy,Spain'
AsiaCountry in 'Australia,Japan,India'
Measures:NameExpression
Totalsum (Amount - Discount) where not Outlier
TotalAllsum Amount - Discount

which yields:

Partition Total TotalAll
Americas9,13912,218
Europe 2,5562,556
Asia 1,8091,809

In FlipDB computed columns in a query are executed before the where clause. This is useful, as we may want to reference them in the where clause, and if they are row-dependent we may want them computed on the entire table, not just what the where clause includes. However, perhaps in this case, it would be useful if computed columns were to execute after the where clause. We could have two sets of computed columns, one executed before the where clause and one executed after, but that seems a bit overkill.