Tool of Thought

APL for the Practical Man

Converting Text to Date

January 23, 2023

Problem #6 from Phase 2 of the Dyalog 2020 APL Problem Solving Comptetition is titled Its a Date! The task was to write a function DDN which basically does the inverse of the date format function 1200⌶.

Apparently the problem as specified was quite difficult, and no complete or even nearly complete solutions were submitted.

This being the APL for the Practical Man blog, we are interested in how the problem should be framed to make the solution useful in a commercial application. And of course the general problem is encountered all the time. Dates are common data elements. Data are in files. Files are often text. So here we propose a function Text2Date, to compare and contrast with DDN.

First and foremost, in the original specification, the function DDN is expected to operate on a single string or character vector, converting one string to one date, one at a time. But it will not do to use the each operator on millions of dates. Text2Date must operate on a character matrix where each row represents a date. This has the extra benefit of making the problem more array oriented - an issue that was noted in the presentation of the contest winners at the recent conference in Portugal.

Second, DDN is expected to handle strings that do not produce a unique Dyalog Date Number, like '07', 'Thursday', and 'Feb 29th', all potential outputs from 1200⌶. This adds complexilty but provides little to no value. We can only elide consecutive trailing elements in ⎕TS, which default to 1 or 0 as appropriate, and no one would reasonably expect otherwise:

       ¯1 1 ⎕DT⊂,2022 
44561
       ¯1 1 ⎕DT⊂2022 28 
DOMAIN ERROR: Invalid date-time
      ¯1 1 ⎕DT⊂2022 28
           ∧
      ¯1 1 ⎕DT⊂2022 ⍬ 28
DOMAIN ERROR: Invalid date-time
      ¯1 1 ⎕DT⊂2022 ⍬ 28
           ∧
      ¯1 1 ⎕DT⊂2022 0 28
DOMAIN ERROR: Invalid date-time
      ¯1 1 ⎕DT⊂2022 0 28
           ∧

Thus Text2Date should only be required to handle strings that produce a unique Dyalog Date Number, with the exception that 2 digit years are resolved using a fixed or sliding century window, which may be provided as an optional argument.

Third, the spec for DDN states that:

No variable length numeric fields will be placed immediately next to another numeric field.

For example, MDDYY, YYMD and MDY would be disallowed. This is almost, but not quite, saying that variable length formats must be delimited. It allows, for example, MMYY-D. This constraint is both too restrictive and not restrictive enough. Leading variable length elements should be allowed immediatly adjacent to a fixed element, but otherwise the elements in a variable length format must be strictly delimited. Thus Text2Date should properly handle the string 'DMMYY', but should not handle 'YYMM-D'. The reason for the former is that undelimited dates in text files are often missing a leading zero due to type conversions somewhere upstream. The reason for the latter is that allowing delimited variable length elements in the same format with undelimited fixed width elements adds much complexity for very little if any practical gain.

If we combine the restriction that any variable length format must be fully delimited, with the further DDN restriction that:

the only alphanumeric characters will be formatted elements of the date/time

then specifying a variable length format becomes merely a matter of specifying the order in which the elements occur. That is, there is no need to specify 'M-YY-D' when simply MYD will do. The string must contain three separated numbers (leaving aside months as text like Jan, Feb, etc.). The particular separator or length of separator is immaterial.

This brings up the question of whether the format specifier for 1200⌶ is really the best unformat specifier. Is a true inverse really what we need? Certainly they are related, but in a real application we would probably want '12/31/2022', '1-1-2023', and '7/4/21' to all convert properly with the same format string.

Thus there are at least two issues that make to-and-from text fundamentally different from all the other date formats. 1200⌶ will create "invalid" dates, that is, strings that cannot be converted back to a unique Dyalog Date Number. And conversely, there are multiple strings that could only be created using multiple formats, but that should all convert back to a Dyalog Date Number using a single format. This may be an argument to keep 1200⌶ out of ⎕DT.

A further complication is handling bad data. To be useful, Text2Date must not fail when it cannot yield a valid Dyalog Data Number for a string. There are then only two options: return a value that indicates failure, or take the route of ⎕VFI, and return a Boolean mask indicating valid results. A zero might be useful for a failure value. If not, a zero is perfect for a fill value, with the Boolean mask indicating valid results, just like ⎕VFI. If 1200⌶ is folded in to ⎕DT, perhaps a variant could be added that changes the result to a ⎕VFI style result. This might be useful for numeric-to-numeric conversions as well.

An Issue With ⎕XML Revisited

January 9, 2023

In a previous post I looked at a couple of issues with ⎕XML. One issue was formatting when converting to XML: I wanted indentation and removal of whitespace for most elements, but not all. I dont know how I missed it, but there is a way to do this built into the XML spec and ⎕XML supports it. Simply add a special attribute to the elements in question. It's right there in the docs:

xml:space="preserve"

Thus, if the overall variant is 'Whitespace' 'Strip', individual elements will preserve whitespace with this attribute specified.

This is more than cosmetic given the differences between XML and HTML. A paragraph with in-line markup, like bold, code, etc., must have whitespace preserved when generated using ⎕XML otherwise an additional space will be added due to the formatting. Of course pre elements should also have whitespace preserved. So far I apply xml:space="preserve" to the p, pre, and tr elements. The latter formats tables nicely, with each row on its own line.

Unfortunately the generated XML has xml:space="preserve" littered throughout the character vector which provides no use, and increases the size of the array. Oddly enough, I don't think I have ever written code to remove all the occurrences of a given substring from a string - or at least I can't remember doing it. It seems like it would be a very common task. Let's do it without using regular expressions.

First, let's find the substrings, easy with the find primitive. This marks the beginning of each substring in the target string :

     'ere'{⍺⍷⍵}'here there where'
0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0

Next lets extend each 1 by the length of the substring, using n-wise or-reduction, fully marking each found substring:

      'ere'{(≢⍺)∨/⍺⍷⍵}'here there where'
1 1 0 0 0 1 1 1 0 0 0 1 1 1

Two related problems arise. First, as always with n-wise reduction, the result is shorter than the right argument. Second, when a substring is found at or near the beginning of the string, we don't get enough 1s. We can fix this by appending the substring to the target string before searching:

      'ere'{(≢⍺)∨/⍺⍷⍺,⍵}'here there where'
1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1

Note that it is merely a convenience to append the substring itself, as it is, by definition, the proper length. We could also append an appropriate number of zeros to the result of find before applying the reduction.

Now we have only one problem: the Boolean mask is too long, by 1. So we drop off the first element:

      'ere'{1↓(≢⍺)∨/⍺⍷⍺,⍵}'here there where'
0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1

The Boolean mask flags what we want to remove, not what we want to keep, so we negate it, flipping 1s to 0s and vice versa:

      'ere'{~1↓(≢⍺)∨/⍺⍷⍺,⍵}'here there where'
1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0

Finally, we apply compress or replicate, with commute to avoid parentheses, to keep everything but the substrings:

      'ere'{⍵/⍨~1↓(≢⍺)∨/⍺⍷⍺,⍵}'here there where'
h th wh

Anatomy of a Query, Part 3

December 27, 2022

In the second part of this series, Anatomy of a Query, Part 2, we covered how columns from multiple tables are accessed in a single query. Unlike SQL, which requires joins at the table level, in FlipDB we can join data on a column-by-column basis, keeping our ability to trace FlipDB queries step-by-step, one name and one value at a time. In addition, this means pure APL solutions translate directly into FlipDB. The examples focused on column join syntax that generally requires a database with pre-specified foreign keys. In much ad hoc, real-world analysis, we have no such luxury; there is simply data, in a bunch of tables.

There is one join syntax that does allow for complete generality, with no pre-specified keys, the general join syntax:

   T[X;Y].Z

Where, where given the context of some starting table (let's call it S), then T is the name of some other table in the same database, X is one or more column names in S, defining the key in S, and Y is one or more column names in T, defining the the key in T. X and Y must conform in number and in type. The names may be different. Y may be elided if the names are the same. Z is a column name in T. This syntax materializes column Z from table T in the table S, by doing the lookup specified by the keys. The result will be simple or partitioned depending on whether the runtime relationship is many-to-one or not.

This is useful, but there are times when we may want to explicitly do look ups, to preserve the indices for further computation. FlipDB provides a getColumn function which will directly fetch a column from another table:

getColumn 'Table2.ColumnA'

from another database:

getColumn 'Database2.Table2.ColumnA' 

or from the same very same table, reaching out of the current context and getting the entire column:

getColumn 'ColumnA'

Thus, in any query, at any point (in the where clause, the select clause, the having clause, etc.) an expression can just grab a column from anywhere. A column accessed this way is generally non-conforming; it's not going to line up with our starting table. This does not hinder its usefulness. A typical use is simply to check if values exist in some list. For example, say in the Suppliers and Parts database we added a Sanction table that contains a list of cities that are sanctioned. In order to find existing suppliers in unsanctioned cities, starting in the Supplier table S, it's just simple APL:

not CITY in getColumn 'Sanction.CITY'

Because FlipDB queries are just sets of names and expressions, the expression can do virtually anything. We can get an entire table using the getTable function. So starting in the supplier table S, we can materialize the parts table:

      getTable 'P'
── SandP.P ────────────────────────────────────────────
 ┌PNO────┐  ┌PNAME──┐  ┌COLOR──┐  ┌WEIGHT┐  ┌CITY────┐ 
 ↓P1     │  ↓Nut    │  ↓Red    │  ↓12    │  ↓London  │ 
 │P2     │  │Bolt   │  │Green  │  │17    │  │Paris   │ 
 │P3     │  │Screw  │  │Blue   │  │17    │  │Oslo    │ 
 │P4     │  │Screw  │  │Red    │  │14    │  │London  │ 
 │P5     │  │Cam    │  │Blue   │  │12    │  │Paris   │ 
 │P6     │  │Cog    │  │Red    │  │19    │  │London  │ 
 └Char(2)┘  └Char(5)┘  └Char(5)┘  └Int8──┘  └Char(10)┘ 
── 6 rows by 5 columns ────────────────────────────────

The transpose function converts the table into a FlipDB property space (analogous to a k dictionary):

      P=transpose D.GetTable 'P'
      P
┌PropertySpace───────────────┐
│ Name             Type      │
│ -------------    --------- │
│ TWID             Integer   │
│ APPENDTYPE       Integer   │
│ PNO              Char      │
│ AUTOKEY          Integer   │
│ PNAME            Char      │
│ COLOR            Char      │
│ WEIGHT           Integer   │
│ CITY             Char      │
│ TransDateTime    DateTime  │
└────────────────────────────┘

which then provides convenient dot access to the columns:

     P.COLOR 
┌COLOR──┐
↓Red    │
│Green  │
│Blue   │
│Red    │
│Blue   │
│Red    │
└Char(5)┘

We can use the indexOf and index functions, both analagous to their APL counterparts, to conform columns from one table to another. Entire tables can be made to line up by using the conform function. The full power of APL techniques and solutions is directly available to us in the expressions.

Because FlipDB queries are sets of names and expressions, each set applied in a certain way and in certain context, and expressions, by definition, can do anything the FlipDB language can do, it follows that a query can do pretty much anything. For example a query can do operating system commands like creating a folder, read a file, or write to a log. These are things that would normally be done in scripts or stored procedures. A query becomes a highly structured script, executed in the context of some table in some database.

In our next and final part in this series, we will look at how FlipDB allows multiple queries to run as a set and how each query can refer to values in previously executed queries.

More posts...