On Attachments: What you should and shouldn't store on a ledger

I often get the question whether to put this data or that data on the Ledger or not so here are some thoughts on that matter. Please discuss and add your own thoughts.

Some reasons not to store blobs on ledger

I would not recommend storing large blobs of data on the ledger because

  • there is no binary data type (this is to some degree to discourage doing this in the first place)
  • every time the contract storing the data is used (e.g. fetched, etc), the data is loaded and sent around, even is the actual attachment is not needed. This generates a lot of network load.
  • The attachment is put out through the APIs wherever the containing contract is put out. This generates a lot of load on API Server and consumer.

State vs Assets

In general terms, I’d make the split between application state and assets. State is any information which changes the behaviour of the application. I would consider a tag on a Tweet part of state, because it determines where in the application the tweet appears. Similarly, I would consider the search index for Google’s picture search part of the application state. It determines which images appear in which searchers.
On the flip side, I would not consider the text of a Tweet part of the state. The text could be different or completely missing and the only thing that would change is that text on the end user’s screen. Similarly, the exact image is not part of the state of an image search. If an image of a cat got switched with one of a dog without being reanalysed or reindexed, the only thing that changes is the image being shown on the final consumer’s screen.

With DLT applications, I would keep the on-/off-ledger split close to the state/asset split. However, for convenience and lower complexity, it does make sense to store small assets like labels or short paragraphs of text on ledger. Similarly, for a simple chat app, it makes sense to store the text content of the messages on ledger.

How about cases where there are large assets?

A simple design is to just store them in am object store and reference them by hash.

template Attachment
  with
     owner : Party
     receivers : [Party]
     url : Text
     hash : Text
  where
     signatory owner
     observer receivers

The sender of a file simply sends a link and a hash that can be used to verify that the correct file was downloaded. Security could be improved by encrypting the file and adding the decryption key to the template meaning only parties that see the Attachment can access the file.

What if I need strong availability or integrity guarantees?

There are some weaknesses to this, however, The owner can take the file down, or even create Attachment contacts for which there is no file at all. DAML’s value prop is to give a common view with strong integrity and consistency guarantees. Indeed, observers are guaranteed to see the same data as the signatories. When sending a blob on ledger, the receiver can not claim not to have received the data, nor does the sender have a way to “take it away” again.

Availability can be solved with aggressive mirroring. Ie every receiver of a file immediately downloads it, verifies it by calculating the hash, and then mirrors it for all other receivers.

Non-repudiation can be achieved by separating the download and decryption steps. Ie the receiver of the file only requests the decryption key once they have downloaded the file and verified the hash. By requesting the key, they confirm receipt. In case of conflict, the sender can prove correctness of file and key by decrypting a file with the given hash with the given key. Faking this would require a hash collision.

8 Likes

As a small addendum on the deniability issue: that can be worked around by having the “heavy payload” contract follow the usual propose/accept pattern. Instead of proposing an IOU, you propose a url/hash pair, and the recipient only accepts it after having checked they can download the file and verify the hash. It’s obviously weaker as the recipient could fake it, but then they’d be the one not able to retrieve the file anymore.

1 Like

@Gary_Verhaegen, your proposal doesn’t give the sender any protection. A data market is a good use-case to have in mind. Without actually putting the data on the ledger, there is no way for the smart contracts to determine whether the data has exchanged hands.
The next best thing to aim for is that if there is disagreement on whether the data has exchanged hands, both parties have irrefutable evidence.

  1. Alice needs to send Bob a file. She sends a URL and a file hash for the encrypted file via the ledger. Both have evidence that Alice sent this. Having this on ledger puts an obligation on Alice to actually make the file available there.
  2. Bob confirms receipt after checking the hash. Alice now has evidence that Bob received the file. Her obligation to provide the encrypted file is now gone. She has a new obligation to provide the decryption key.
  3. Alice sends the key on ledger. Her obligation is complete. She has proof that bob obtained the file and the key.
  4. The only way Bob can complain is if the decryption key and file don’t match or the file doesn’t contain the agreed content. In this case an off ledger resolution is needed, but unless Alice managed to generate a hash collision, there can be no disagreement on what data did actually change hands.
1 Like

I like the state vs asset distinction, will use it.

Actually, there is a DAML vs IPFS integration on the DAML marketplace, which I guess does something similar to your Attachment template: IPFS-DAML Integration - DAML Marketplace

2 Likes

I don’t understand what encryption adds here. If the workflow is:

  1. Alice sends a proposal with hash and URL to Bob.
  2. Bob accepts the proposal.

How does Alice not have just as much proof as in your 4-step process that Bob did claim he had downloaded the file and checked the hash?

1 Like

What if Bob downloads the file, but doesn’t accept the proposal?

1 Like

I see your point. Worst case, if Bob does not accept the key proposal and it somehow ends up needing litigation, she can produce the file and an external party can check that the hash matches and the key does decrypt. Thanks for the explanation!

2 Likes

@bernhard a practical use case comes to my mind where this mechanism can come in handy: public procurements, where parties may need to prove that they have / haven’t sent / received files with a certain content on time. What do you think?

3 Likes

this could be a good example of use case, great point @gyorgybalazsi

1 Like

Great suggestions and certainly good to focus on this topic as it will have practical impact both for the underlying system and the involved parties. Thanks Bernhard!

2 Likes

Couldn’t these two points be mitigated by simply storing binary data in a separate contract, which you then reference by ID?

How far would that go to offsetting the availability guarantees you mentioned as a disadvantage? Also, as I mentioned in the encryption post, the advantage of using DAML’s built-in privacy model is that you don’t have to worry about encryption keys etc.

And in terms of performance, could you explain at a high level what the API server has to do that is so taxing? Thanks!

1 Like

Yes, this could be somewhat mitigated by storing the data on a separate contract and only ever referring to it by ContractId. That way you can reduce the number of Events that reference the contract arguments containing the data to one.

On the API side, it’s just a matter of the amount of data being serialized, deserialized, and transmitted. gRPC has message size limits for a reason. It’s not designed to stream large binary data. If you Google topic, you’ll find lots of blog posts about chunking the data into multiple messages. The DAML Ledger API has no provisions for that, and neither do the internals of most ledgers.
You could solve that, e.g. by chunking the data on Ledger, which would mean uploading a file in multiple chunks in multiple transactions), but I think you’d be likely to choke the entire system with your data and transaction throughput. It may be workable for small numbers of small files. I don’t think you could build a market for stock photography that way.

1 Like

Thanks for your response; I had a look at the posts on chunking.

I would think this is all the more reason to have a binary data type that you can mmap directly rather than having to (de)serialize.

1 Like

I can definitely see the appeal to use a DAML ledger for data transfer. You don’t have to mess with key management, compatible encryption libraries, or finding external services to store the data and channels to transfer it. You get data tranfer, privacy and non-repudiation practically for free (and with something like Canton you even get end-to-end encryption). And Luciano’s approach should conceptually have a reasonable cost.

But I agree with Bernhard that right now you should (or even have to) avoid using the ledger for large data, given our stack. gRPC is one reason, but in Canton we also currently store all contract data twice, once in the Ledger API server, and once in a Canton-internal DB. We’ll might move away from this design at some point, but not in the forseable future. Furthermore, since we also ship contract data around, you have to be careful with delegated choices and divulgence, as your transactions might blow up in size enormously.

3 Likes