The DOM via JSON

April 11, 2023

In a A Document Object Model in APL the usefulness of an APL DOM was discussed, and a function HTML2DOM demonstrated:

HTML2DOM←{
     ⍝ ⍵ ←→ HTML
     ⍝ ← ←→ DOM
     0=≢⍵:⍬
     {⍵⌷⍨⍳1=⍴⍵}0{
         m←⍵
         0=≢m:⍺
         b←m[;0]=0
         p←⍺{⍺ New 3↑1↓⍵}¨↓b⌿m
         m[;0]-←1
         p⊣p ∇¨1↓¨b⊂[0]m
     }⎕XML ⍵
 }

This function takes HTML and recursively works through the ⎕XML matrix using the New function, building a tree of element namespaces. The New function is the same function used to build up a document from scratch. All of this code is in the Abacus project.

In a recent discussion on The APL Orchard, it was noted that ⎕JSON could be used to create a namespace tree from HTML. While it is no secret that ⎕JSON creates a namespace tree from JSON (that's what it does!), it never occurred to me to use it for creating an HTML DOM. It seems it should be faster than doing it by "hand", but there is some additional overhead in going from HTML to JSON and then to a namespace tree. Let's see how to do it.

The problem boils down to converting a ⎕XML-style matrix to a ⎕JSON-style matrix. We know what a ⎕XML matrix looks like, and we know how the DOM is stuctured (which is good, as half the battle is simply knowing the destination), but what on Earth does the corresponding ⎕JSON matrix look like? Fortunately, we can create a simple document fragment:

      A←#.Abacus.Main
SimpleDoc←{
     d←A.New'section'
     d.class←'post'
     d.id←'p20230411'
     h←d A.New'h1' 'Simple Doc'
     l←d A.New'ul'
     l.Content←{A.New'li'⍵}¨'One' 'Two' 'Three'
     _←(A.Elements d).⎕EX⊂'Parent'⍝ Not good for JSON
     d
 } 
      d←SampleDoc 0
      h←A.DOM2HTML d
      h
<section class="post" id="p20230411">
  <h1>Simple Doc</h1>                
  <ul>                               
    <li>One</li>                     
    <li>Two</li>                     
    <li>Three</li>                   
  </ul>                              
</section>

And see what the familiar source ⎕XML matrix looks like:

      ⎕XML h
0  section               class  post        3
                         id     p20230411    
1  h1       Simple Doc                      5
1  ul                                       3
2  li       One                             5
2  li       Two                             5
2  li       Three                           5

And the corresponding unfamiliar target ⎕JSON matrix:

      ⎕JSON⍠'M'⊢⎕JSON d
0                       1
1  Content              2
2                       1
3  Content              2
4           Simple Doc  4
3  Tag      h1          4
2                       1
3  Content              2
4                       1
5  Content              2
6           One         4
5  Tag      li          4
4                       1
5  Content              2
6           Two         4
5  Tag      li          4
4                       1
5  Content              2
6           Three       4
5  Tag      li          4
3  Tag      ul          4
1  Tag      section     4
1  class    post        4
1  id       p20230411   4

The are indeed a little different. By inspection, and some trial and error, we can write a function to go directly from the ⎕XML matrix to the ⎕JSON matrix:

XM2JM←{
     ⍝ ⍵ ←→ ⎕XML matrix
     ⍝ ← ←→ ⎕JSON matrix
     ⊃⍪/{
         d e t a←4↑⍵
         k←2×d
         j←k+1
         e≡'':⍉⍪k''t 4         ⍝ Data        1 row OR...
         z←⊂k'' '' 1           ⍝ Object      (1 row)+
         z,←⊂j'Tag'e 4         ⍝ Element     (1 row)+
         z,←4,⍨¨j,¨↓a          ⍝ Attributes  (0 or more rows)+
         z,←⊂j'Content' '' 2   ⍝ Content     (1 row)+
         z,←(0<≢t)/⊂(j+1)''t 4 ⍝ Text        (0 or 1 row)
         ↑z}¨↓⍵
 }

This loops through each row of the ⎕XML matrix, which is less than desirable (but surprisingly not particularly slow). However it provides us insight into how to take a more array oriented approach:

XM2JM←{
     ⍝ ⍵ ←→ ⎕XML matrix
     ⍝ ← ←→ ⎕JSON matrix
     (o t)←↓⍉~⍵[;1 2]∊⊂''                     ⍝ Object, text
     l←≢¨⍵[;3]                                ⍝ Number of attributes
     n←o*⍨3+t+l                               ⍝ Target rows per source row
     s←0,+\¯1↓n                               ⍝ Starting index of source in target
     j←(o∧t)/3+s+l                            ⍝ Text index
     m←0 '' '' 1⍴⍨4,⍨+/n                      ⍝ Initialize result
     m[o/1+s;1 2 3]←(⊂'Tag'),4,⍨⍪o/⍵[;1]      ⍝ Tag row
     m[2+∊s+⍳¨l;1 2 3]←4,⍨⊃⍪/⍵[;3]            ⍝ Possible attribute rows
     m[o/2+s+l;1 2 3]←(+/o)⌿⍉⍪'Content' '' 2  ⍝ Content row
     m[j;2 3]←4,⍨⍪⍵[;2]/⍨o∧t                  ⍝ Possible Text row
     m[s/⍨~o;2 3]←4,⍨⍪⍵[;2]/⍨~o               ⍝ Text row
     m[;0]←(n/2×⍵[;0])+(~m[;1]∊⊂'')+2×j∊⍨⍳≢m  ⍝ Depth
     m
 }

With the exception of the depth column, the target matrix is filled in over about 6 steps. There may be a way to fill in each column in one go, as the depth is done on the penultimate line.

In the next post we will investigate the relative performance of these various techniques.

Tool of Thought

APL for the Practical Man

"We make software the old-fashioned way, we write it."

The DOM via JSON

April 11, 2023