What is the purpose of the Char type in daml?

In Haskell the string type is defined with type String = [Char] and strings can be manipulated using list functions.

In daml it seems there is no Char type and functions such as splitAt are implemented using explode and implode (from the Text library) around the prelude list equivalent - incidentally those Text functions are undocumented.

So why in daml can I write 'a' and have the type inferred as Char, yet not appear to be able to do anything with it?

1 Like

There isn’t really a Char type in DAML. There are some artifacts of a Char type that are imposed upon us by virtue of being based on GHC and not having patched it too deeply in this regard. Once you try to actually compile a program with a Char literal, you will get a compilation error in the conversion to DAML-LF either complaining about the literal or the type depending on what the compiler hits first. So unless you are working on the DAML compiler and you need to understand the inner workings of GHC and why we have artifacts of Char, you are best off pretending that Char literals result in a parse error and there is no Char type.

The more interesting question is why does DAML not have a Char type?

This is primarily related to unicode being more complex than one might realize at first. The JVM (the DAML interpreter is implemented in Scala) stores Strings in UTF-16 encoding. the char type in Java represents an UTF16 code unit so 2 bytes per char. However, not every unicode character (code point if you want to be precise) can be encoded in 2 bytes in UTF-16. For example, the well-loved 😂 code point is encoded as a so-called surrogate pair consisting of 2 16-bit code points. Java will gladly give you 2 chars here. However, for most text-processing usecases you really want to treat this as a single unit (no pun intended, this is 1 code point but 2 code units). There are two solutions out of this mess:

  1. Introduce a 32bit Char type to represent all possible unicode code points. This is the solution taken by Haskell. The downside here is that it confuses people that are used to the Java model. It also requires a new DAML-LF primitive type which always comes at a cost (not saying that it cannot be justified here).
  2. Don’t expose a type for individual characters and instead decompose Text into smaller Texts in things like explode. This clearly doesn’t require a new DAML-LF type and while it might seem like an odd choice it mostly works just fine for the relatively limited text processing that is usually done in DAML.
4 Likes

I would love it it DAML had a distinct types for Char and Text. This would allow more informative types for some existing primitive functions:

explode : Text -> [Char]
implode : [Char] -> Text

Other primitives might be changed to work on Char instead of on Text

isNewLine : Char -> Bool
isUpper : Char -> Bool

And perhaps we might add an indexing function as primitive (alongside the existing substring):

index : Int -> Text -> Char
1 Like

Would it be possible to use an existing 32 bit primitive type for this, such as Int, to substitute for Char?

That exists: toCodePoints : Text -> [Int], fromCodePoints: [Int] -> Text

3 Likes

For what it’s worth, Rust also tends to use strings where you might expect a character. For example, when calling char::to_uppercase, you’ll get an iterator over char, not a single one. This is because, for example, “ß” is “SS” in uppercase.

So working with characters isn’t as simple as you might like. For much of DAML’s use cases, Text works well and, as @cocreature also points out, means you don’t accidentally split an emoji in half.

(Much of this content shamelessly stolen from Working with strings in Rust, by fasterthanli.me.)

2 Likes