In Haskell the string type is defined with type String = [Char]
and strings can be manipulated using list functions.
In daml it seems there is no Char type and functions such as splitAt
are implemented using explode
and implode
(from the Text
library) around the prelude list equivalent - incidentally those Text
functions are undocumented.
So why in daml can I write 'a'
and have the type inferred as Char
, yet not appear to be able to do anything with it?
1 Like
There isnât really a Char
type in DAML. There are some artifacts of a Char
type that are imposed upon us by virtue of being based on GHC and not having patched it too deeply in this regard. Once you try to actually compile a program with a Char literal, you will get a compilation error in the conversion to DAML-LF either complaining about the literal or the type depending on what the compiler hits first. So unless you are working on the DAML compiler and you need to understand the inner workings of GHC and why we have artifacts of Char
, you are best off pretending that Char literals result in a parse error and there is no Char
type.
The more interesting question is why does DAML not have a Char
type?
This is primarily related to unicode being more complex than one might realize at first. The JVM (the DAML interpreter is implemented in Scala) stores Strings in UTF-16 encoding. the char
type in Java represents an UTF16 code unit so 2 bytes per char. However, not every unicode character (code point if you want to be precise) can be encoded in 2 bytes in UTF-16. For example, the well-loved đ
code point is encoded as a so-called surrogate pair consisting of 2 16-bit code points. Java will gladly give you 2 char
s here. However, for most text-processing usecases you really want to treat this as a single unit (no pun intended, this is 1 code point but 2 code units). There are two solutions out of this mess:
- Introduce a 32bit
Char
type to represent all possible unicode code points. This is the solution taken by Haskell. The downside here is that it confuses people that are used to the Java model. It also requires a new DAML-LF primitive type which always comes at a cost (not saying that it cannot be justified here).
- Donât expose a type for individual characters and instead decompose
Text
into smaller Text
s in things like explode
. This clearly doesnât require a new DAML-LF type and while it might seem like an odd choice it mostly works just fine for the relatively limited text processing that is usually done in DAML.
4 Likes
I would love it it DAML had a distinct types for Char
and Text
. This would allow more informative types for some existing primitive functions:
explode : Text -> [Char]
implode : [Char] -> Text
Other primitives might be changed to work on Char
instead of on Text
isNewLine : Char -> Bool
isUpper : Char -> Bool
And perhaps we might add an indexing function as primitive (alongside the existing substring
):
index : Int -> Text -> Char
1 Like
Would it be possible to use an existing 32 bit primitive type for this, such as Int
, to substitute for Char
?
That exists: toCodePoints : Text -> [Int]
, fromCodePoints: [Int] -> Text
3 Likes
For what itâs worth, Rust also tends to use strings where you might expect a character. For example, when calling char::to_uppercase
, youâll get an iterator over char
, not a single one. This is because, for example, âĂâ is âSSâ in uppercase.
So working with characters isnât as simple as you might like. For much of DAMLâs use cases, Text
works well and, as @cocreature also points out, means you donât accidentally split an emoji in half.
(Much of this content shamelessly stolen from Working with strings in Rust, by fasterthanli.me.)
2 Likes