Performance testing

Hi DAML Jedi-s,
I am currently working on the implementation of a DAML ledger with various microservices interacting with it. Practically this translates to various executor services, which call the Ledger gRPC API from a particular participant node to exercise/create contracts; and extractor services, which set observers to the gRPC stream of transactions.


(Very simplistic diagram with executor, extractor services :arrow_up:)

Before fully rolling out this project out of the POC phase I need to set in place performance testing to see what are bottlenecks and to have a set of metrics to compare various releases against.
As such I need to evaluate throughput and latency for:

  1. submitting commands to the ledger via the ledger API - assuming a perfectly capable and optimised microservice (or replicas of it) is able to create tens of thousands transactions per second, how fast can they be sent to the ledger api? What is the best strategy with gRPC channels and scaling for it?
  2. grpc requests being handled by DAML - assuming a perfect setup that allow us to send tens of thousands transaction per second, what is the lag between requests being sent and them being processed and effective on the ledger?
  3. ledger transactions being streamed out - what is the lag between transaction being processed by the ledger and them being streamed out via gRPC?

My questions are:

  • are there any public benchmark, metrics, reference for performance I could use? This would be essential to determine how much performance we are leaving on the table and to know how far are we from the theoretical performance roof
  • what would be the best strategy to load/performance tests these various interactions (for example using the simplest possible DAML contract to ensure the contract size is not a bottleneck, specific ways of managing channels with microservices, sharding participant nodes, …)

I understand this is a non-trivial ask but would love to get some insight, especially as I think these metrics and tests must have been needed by other parties before.

2 Likes

Hi @francescobenintende,

welcome to the forum!

Let’s get straight to your questions:

Performance numbers depend on many factors (commands you send / DAML model, hardware, network, …) and therefore we do not share numbers in public at the moment. But let me point you to a few utilities that are helpful when doing performance tests:

  1. The myParticipant.ledger_api.transactions.start_measuring command will subscribe to the transaction stream and register a metric that tracks the rate at which a participant outputs new transactions:
    Canton Console — Canton 1.0.0-SNAPSHOT documentation
    The command also takes a call-back that gets invoked on every transaction; that would allow you to implement your own measurements, if you want to bypass metrics.

  2. This section of our documentation explains how you configure Canton to report metrics to a File, via JMX, to Graphite or to Prometheus:
    Monitoring — Canton 1.0.0-SNAPSHOT documentation
    To get started with metrics, I would probably go for JMX. This allows you to connect to your node using VisualVM and view all the available metrics. Once you have a rough idea which metrics are relevant to you, you can still setup Graphite or Prometheus.

  3. If you want to see where a command spends how much time, you should look into tracing: Monitoring — Canton 1.0.0-SNAPSHOT documentation

  4. Now to your very important question of how to actually load the system. As a starting point, I would use a very simple DAML model. If you don’t have one, you can use the built-in ping workflow: Canton Console — Canton 1.0.0-SNAPSHOT documentation
    The bong workflow has specifically been designed to create high load and contention:
    Canton Console — Canton 1.0.0-SNAPSHOT documentation
    From that you can move into various directions:

    1. Create your own - more realistic - DAML model and workflow.
    2. Tweak the configuration
    3. Group multiple DAML commands into a single command (batching) to get higher throughput.
    4. Increase the number of participants (and, perhaps, domains) to get higher throughput.
  5. Last but not least, if you observe that the system gets overloaded (timeouts), look at this post: Canton - Sequencer time [..] Exceeding max-sequencing-time / Response message for request X timed out - #2 by MatthiasSchmalz

That’s a lot of information and I hope it is helpful to you. Please don’t hesitate to follow-up on any of these points, if you need more help!
Regards,
Matthias

1 Like

Here is the documentation for DAML metrics:

https://docs.daml.com/tools/sandbox.html#metrics

Canton does also expose them, but these metrics are not (yet) part of the Canton documentation.

If you submit commands using the command service (as opposed to the command submission service), you can use the following metric to measure latency:
daml.lapi.command_service.submit_and_wait.mean

1 Like

Thank you so much @MatthiasSchmalz! I’ll go over all these!

Regarding performance I found some posts (here and here) and a whitepaper mentioning 27000 TPS. The case study for some reason is not accessible, as the downloaded paper is the wrong one.

Do you have any information/record of this case study internally?

Thanks again!

The whitepaper and case study refer to a ledger implementation that is no longer maintained. It does not refer to Canton.

In the meantime, I have found another page that could be helpful to you:
https://www.canton.io/docs/dev/user-manual/architecture/overview.html#scaling-and-performance

And here is a recent article on optimizing performance:

https://www.canton.io/docs/dev/user-manual/usermanual/FAQ.html#how-to-setup-canton-to-get-best-performance