Tool of Thought

APL for the Practical Man

Parsing Markdown

October 24, 2022

Once we have an APL document object model , it is natural to want a Markdown to DOM function. While there are a zillion third party markdown libraries, none of them will create an APLDOM directly, and, more importantly, there are advantages to having an easily modified, pure APL, no-dependency solution.

This Parse function takes an array of Markdown lines and returns an array of HTML elements:

          Parse←{
     ⍝ ⍵ ←→ MarkDown
     ⍝   ←→ Array of HTML obs
     ⍝ f ←→ Paragraph|Table|Code|Header
     0=≢⍵:''
     ⎕THIS.H←##.Main
     s←InitState ⍵
     t b←↓⍉↑s ProcessLine¨⍳≢⍵
     ~1∊b:''
     p←b⊆⍵
     f←(b>¯1↓0,b)/t
     f{(⍎⍺)⍵}¨p
 }

        

It first initializes a namespace to hold the state as we step through each line:

          InitState←{
     s←⎕NS''
     s.Lines←⍵
     s.InParagraph←0
     s.InTable←0
     s.InCode←0
     s.InList←0
     s.InBlockQuote←0
     s.InPullQuote←0
     s
 }

        

We toss the entire set of lines in there as well, so we can inspect a subsequent line if necessary (for lists , if nothing else, it appears this is needed). Then we process each line, to determine the type and location of top-level objects or blocks:

          ProcessLine←{
     ⍝ ⍵ ←→ Line Number
     ⍝ ⍺ ←→ State
     ⍝   ←→ Partition Vector
     l←⍵⊃⍺.Lines
     ⍺.InParagraph:''(⍺.InParagraph←l≢'')
     ⍺.InCode:''(1⊣⍺.InCode←'~~~'≢3↑l)
     ⍺.InTable:''(⍺.InTable←'|'=⊃l)
     ⍺.InBlockQuote:''(⍺.InBlockQuote←'> '≡2↑l)
     ⍺.InPullQuote:''(⍺.InPullQuote←'^ '≡2↑l)
     ⍺.InList:''(⍺.InList←⍺ StillInList ⍵)
     ''≡l:'' 0
     '~~~'≡3↑l:'Code'(⍺.InCode←1)
     '|'=⊃l:'Table'(⍺.InTable←1)
     '> '≡2↑l:'BlockQuote'(⍺.InBlockQuote←1)
     '∧ '≡2↑l:'PullQuote'(⍺.InPullQuote←1)
     '#'=⊃l:'Header' 1
     '!'=⊃l:'Image' 1
     IsListItem l:'List'(⍺.InList←1)
     1:'Paragraph'(⍺.InParagraph←1)
 }

        

For each line, if we are not in an object already, we find out what object is starting, and return the type of the object and a 1 to indicate it is starting. If we are already in an object, then we determine whether the line being processed is still a member of that object. The object type is also the name of a function that will be called to convert the block of lines into an array of HTML element objects.

Consider the following Markdown:

          # MarkDown List                                         
                                                        
A list follows:                                         
                                                        
- Item 1                                                
    This is detail on item 1.                           
- Item 2                                                
    This is detail on item 2,                           
    and note there is no detail on the following item 3.
- Item 3                                                
                                                        
This is a trailing paragraph,                           
it should not be lost       

        

The result of ProcessLine tells us the type of object and flags the locations with 1-maps:

                ↑s ProcessLine¨⍳≢⍵
 Header     1
            0
 Paragraph  1
            0
 List       1
            1
            1
            1
            1
            1
            0
 Paragraph  1
            1

        

Note that this partition technique requires a blank line between objects. Now we can apply the appropriate function to each partition. For example, the Paragraph function creates a new paragraph element object, and sets its content after adding any in-line markup:

          Paragraph←{
     l←1↓⊃,/' ',¨⍵
     H.New'p'(InLine l)
 }

        

While the List function, and its subfunction ListItem , creates a list element and must recursively parse the content:

          List←{
     t←↑⍵
     p←(' '≠t[;0])⊂⍵
     l←('-'∊t[;0])⊃'ol' 'ul'
     H.New l(ListItem¨p)
 }
ListItem←{
     t←{⍵↓⍨1+⍵⍳' '}⊃⍵
     i←H.New'li't
     1=≢⍵:i
     i.Content,←Parse 4↓¨1↓⍵
     i
 }

        

The code above is by no means a comprehensive markdown processor. Many features can and will be easily added, but we are generally not concerned with handling every little issue or variation that may be present in Markdown files out in the wild.

Anatomy of a Query

October 23, 2022

A FlipDB query is essentially a highly structured dfn or collection of dfns; it is an ordered set of names and expressions. It is thus much less sophisticated than an SQL query. This simplicity, however, brings certain benefits. It means that we can step through the query line by line (one name and expression at a time), and evaluate, inspect, or even change the code. In other words, the query may be traced just like a function in the Dyalog session, and in fact the FlipDB UI has its own tracer. This is not possible with SQL - an SQL query is not made up of individual, executable, traceable steps.

A FlipDB query usually starts with a specific table, for a default context, and is then composed of the following ordered sections or clauses:

  1. Precomputed expressions
  2. Where statements
  3. Where-not statements
  4. Group-by expressions
  5. Select expressions
  6. Order-By clause
  7. Having statements

Each section (except #6) is an ordered set of names and expressions, and may be thought of as a dfn.

Precomputed Expressions

Precomputed expressions are evaluated before any where statements are applied, and thus operate on the entire table. These may be used to specify constants, computed columns and aggregate values. (These latter two are especially valuable for expressions that may be row dependent, and thus change after the where clause is applied.) For example:

Name Expression
YearEndTarget 2000000
GrandTotalBalance sum Balance
ComputedInterest Balance * Rate / 1200
TotalComputedInterest sum ComputedInterest

The FlipDB tracer steps through these lines just like the Dyalog APL tracer steps through a function. At each line variables may be inspected and played around with. Also like a dfn, a defined name may be referenced in the following lines of the section. The defined names may then be referenced in subsequent sections or clauses, which is of course the whole point of this clause.

Where and Where-Not Statements

Next comes where statements, a set of zero or more Boolean valued expressions:

Expression
Balance > 100000
State in 'NY,NJ,CT'
LTV > 80

These expression are not named by the user, but for tracing purposes the system assigns them names Where1 , Where2 , Where3 , etc. These statements could be packed into a one-liner, but there are significant advantages for the user in keeping the expressions as simple and short as possible, for tracing, debugging, and performance.

The tracer steps through these line by line, allowing the user to see exactly how many rows are restricted by each statement, and the net effect of each statement. At the end, the statements are and'ed together:

                Where←∧/Where1 Where2 Where3 ... 

        

Where not statements are similar to where statements, except they are or'ed together:

                WhereNot←∨/WhereNot1 WhereNot2 WhereNot3... 

        

The tracer steps through these as well. The final boolean restriction is then simply:

                Where∧~WhereNot

        

Group-By Expressions

If the query is to be grouped, then one or more group expressions may be applied. For example:

Name Expression
GeographicDistribtion State
Coupon .5 range Rate

These expressions are evaluated on the rows remaining after the Where and Where Not clauses are applied. These too are stepped through in the tracer, allowing detailed inspection.

Select Expressions

After the optional grouping clause, the select expressions are evaluated. For ungrouped queries these will often be a simple list of column names. For grouped queries generally all of the expressions are aggregate expressions of one sort or another:

Name Expression
NumberOfLoans count LoanID
TotalBalance sum Balance
Percent percentDown Balance
PercentOfTable TotalBalance / GrandTotalBalance

Note that the last expression is making use of a name defined two lines above it, as well as a name defined in the first section of computed values. Yet again, the tracer lets us step through these expression line by line.

Order-By

The order-by clause is the only section that is not a set of names and expressions. Rather it is a list of names defined in the Select clause and a sort directive. For example:

Name Direction
TotalBalance Down

The tracer does not step through the order-by clause line by line, as there is little value added.

Having Expressions

Finally, one or more having expressions may be defined, which are analogous to where statements, except they are applied to the sorted result. For example:

Expression
TotalBalance > 100000

These expressions may be traced just like where expressions.

Conclusion

A FlipDB query is an ordered set of names and expressions that both defines the query and also explicitly specifies how it is to be executed. Whereas an SQL RDBMS might have a query optimizer that selects a query execution path, FlipDB, for better or worse, just does what the user tells it, in the order the user tells it. While no doubt there are a number of costs to this simple model, there are also major benefits. The traceability of a FlipDB query is a direct consequence of the fact that the query is an ordered set of names and expressions. That is, we can read the query incrementally from top to bottom unlike an SQL query where we must mentally process and parse the query to the end before we can understand what is happening at the beginning. The computer and the user or author read and process the query in exactly the same way.

In a future post we will look at how FlipDB deals with joins, views, sub-queries, and all the complexities that arise when data is in a different table.

ComposeRules Revisited

October 7, 2022

In the previous post we looked at constructing CSS programatically . The ComposeRules function looked like:

          ComposeRules←{
     ⍝ ⍵ ←→ Vector of CSS Rules
     ⍺←0
     0=≢⍵:''
     nl←⎕UCS 13
     ⊃,/((⍺>0)/nl),⍺{
         0=≢⍵.Selector:⍺ ComposeRules ⍵.Rules
         b←(4×⍺)⍴' '
         s←b,⍵.Selector,' {',nl
         s,←⍺ ComposeDeclarations ⍵
         s,←¯1↓(⍺+1)ComposeRules ⍵.Rules
         s←s,b,'}',2⍴nl
         s
     }¨⍵
 }

        

As is often the case when reviewing a function after a little time has passed, we wonder why we wrote it the way we did. Often there is a good reason, but also often there is not a good reason.

There are at least two things that are bothersome about this function. First, there is a nested dfn. I don't like to do this without a good reason, and I had a nagging feeling there was in fact no good reason. I was probably just in a hurry to enhance the function to handle nested rules. Second, and directly related, the main function is called recursively by explicit name, rather than using self-reference ( ). A refactoring is in order:

          ComposeRules←{
     ⍝ ⍵ ←→ Vector of CSS Rules
     ⍺←0
     0=≢⍵:''
     nl←⎕UCS 13
     1=≡⍵:⊃,/((⍺>0)/nl),⍺ ∇¨⍵
     0=≢⍵.Selector:⍺ ∇ ⍵.Rules
     b←(4×⍺)⍴' '
     s←b,⍵.Selector,' {',nl
     s,←⍺ ComposeDeclarations ⍵
     s,←¯1↓(⍺+1)∇ ⍵.Rules
     s←s,b,'}',2⍴nl
     s
 }

        

Removing the nested dfn allows recursion via self-reference - the whole thing is much nicer.

More posts...