A Query in R and Kap
May 8, 2026
Elias MÃ¥rtenson posts about a query in R and compares it to his APL-inspired language Kap. Elias of course provides a succint, APL-style solution in Kap. Here we take a look at the R solution and compare it to FlipDB, a relational database management system written in APL. The problem starts with the following table:
| Country | Amount | Discount |
|---|---|---|
USA | 2,000 | 10 |
USA | 3,500 | 15 |
USA | 3,000 | 20 |
Canada | 120 | 12 |
Canada | 180 | 18 |
Canada | 3,100 | 21 |
UK | 130 | 13 |
UK | 160 | 16 |
| ... |
Given this, the task is to produce a table of total discounted sales by country, excluding outliers defined as entries in each country with an amount greater than 10 times the median for that particular country. The result we are looking for is:
| Country | Total |
|---|---|
Australia | 540 |
Brazil | 414 |
Canada | 270 |
France | 450 |
Germany | 513 |
India | 648 |
Italy | 567 |
Japan | 621 |
Spain | 594 |
UK | 432 |
USA | 8,455 |
We thus have a row-dependent where clause. That is, whether or not a row is included depends on other rows in the table. More precisely, whether or not a row is included in a group depends on the other rows in the group. This is the crux of the matter, and what makes the query problematic in some tools.
The R solution is given as:
purchases |>
group_by(country) |>
filter(amount <= median(amount) * 10) |>
summarize(total = sum(amount - discount))
As the original post suggests, this is indeed a nice solution.
A comment on reddit makes the observation:
It's true that R's DSL has some nice defaults here, like the filtering happening implicitly on the grouped columns. But a DSL means there's stuff happening without straightforward execution semantics, there's some magic. And that example is very short in R, because it relies on those defaults, but if it needed sorting by total (instead of by country), or computing a flat amount-discount before grouping, then it'd start looking a bit longer.
We don't see the magic here. The R solution starts with a table or dataset in a column-store format, which is then partioned by country. In APL terms we can think of this as just making a vector of unique values from country and corresponding nested vectors out of amount and discount. All of the basic scalar operations are happy to work on nested vectors, applying scalar extension where necessary. We can then imagine sum and median being aggregate functions that have a built-in each operator when running on nested data. Same for filtering, where APL's replicate function would have a built-in each for nested data, taking a nested boolean array and masking the nested column data. DSLs can just be a collection of well-crafted, higher-level functions. If R were APL, the underlying code just jumps right out of this solution. It is simple, direct, executable and traceable step-by-step, unlike, say, an SQL query. And in fact, this is almost exactly how FlipDB works, and the solution is remarkably similar:
| GroupBy: | Name | Expression |
|---|---|---|
Country | Country |
| Measures: | Name | Expression |
|---|---|---|
Total | sum (Amount - Discount) where Amount <= 10 * median Amount |
| OrderBy: | Name | Direction |
|---|---|---|
Country | Up |
(Note that we have to explicitly specify the ordering which seems to be a default in R. We don't show the OrderBy clause in the queries below in the interest of space.)
The difference is that we are filtering in-line, just for one column in the result set, whereas the R solution is filtering the entire table. In-line filtering is useful because we can apply different filters, or no filter at all, for different columns in the result table. (Useful for a poor man's cross-tab.) For example, it might be useful to display the totals without excluding the outliers side-by-side for comparison which can be done by just adding another measure:
| Measures: | Name | Expression |
|---|---|---|
Total | sum (Amount - Discount) where Amount <= 10 * median Amount | |
TotalAll | sum Amount - Discount |
yielding:
| Country | Total | TotalAll | |
|---|---|---|---|
Australia | 540 | 540 | |
Brazil | 414 | 414 | |
Canada | 270 | 3,349 | |
France | 450 | 450 | |
Germany | 513 | 513 | |
India | 648 | 648 | |
Italy | 567 | 567 | |
Japan | 621 | 621 | |
Spain | 594 | 594 | |
UK | 432 | 432 | |
USA | 8,455 | 8,455 |
Here we can see by inspection that only Canada has outliers.
Note that where is just a simple function that takes an array of one or more columns on the left and a corresponding boolean on the right and then returns the filtered columns. In this case, because we are just summing the result, we could replace where with multiplication, an age-old APL technique, zeroing out instead of filtering.
We can, like the R solution, specify a where clause for the query as a whole:
| Where: | Expression |
|---|---|
Amount <= 10 * median by Amount (group Country) |
| GroupBy: | Name | Expression |
|---|---|---|
Country | Country |
| Measures: | Name | Expression |
|---|---|---|
Total | sum Amount - Discount |
However, unlike the R solution, this is applied before, and independently of, grouping. Thus we need to specify a grouping in the where clause itself, and then apply the median function to each group using the by operator. The by operator handles the details of applying an aggregate function to grouped data, and then replicating the results to line up with the ungrouped data. The advantage here over the R solution is that we may then group our query by some other column or value than Country. For example, we might group by region, but keep outliers defined within country:
| Where: | Expression |
|---|---|
Amount <= 10 * median by Amount (group Country) |
| GroupBy: | Name | Expression |
|---|---|---|
Americas | Country in 'USA,Canada,Brazil' | |
Europe | Country in 'UK,France,Germany,Italy,Spain' | |
Asia | Country in 'Australia,Japan,India' |
| Measures: | Name | Expression |
|---|---|---|
Total | sum Amount - Discount |
which yields:
| Region | Total |
|---|---|
Americas | 9,139 |
Europe | 2,556 |
Asia | 1,809 |
What if we want to group by region but in addition display totals including the outliers, as we did above? We can't exclude that data from the query, but now we can't assume that the grouping for the query is the same grouping for computing the outliers. Rather than applying a where clause, we can pre-compute a column that flags outliers according to their country and then group based on region:
| ComputedColumns: | Name | Expression |
|---|---|---|
Outlier | Amount > 10 * median by Amount (group Country) |
| GroupBy: | Name | Expression |
|---|---|---|
Americas | Country in 'USA,Canada,Brazil' | |
Europe | Country in 'UK,France,Germany,Italy,Spain' | |
Asia | Country in 'Australia,Japan,India' |
| Measures: | Name | Expression |
|---|---|---|
Total | sum (Amount - Discount) where not Outlier | |
TotalAll | sum Amount - Discount |
which yields:
| Region | Total | TotalAll | |
|---|---|---|---|
Americas | 9,139 | 12,218 | |
Europe | 2,556 | 2,556 | |
Asia | 1,809 | 1,809 |
In FlipDB computed columns in a query are executed before the where clause. This is useful, as we may want to reference them in the where clause, and if they are row-dependent we may want them computed on the entire table, not just what the where clause includes. However, perhaps in this case, it would be useful if computed columns were to execute after the where clause. We could have two sets of computed columns, one executed before the where clause and one executed after, but that seems a bit overkill.