Just a few questions around resource exhausted. We are using the DAML on VMWare ledger. Key features of this ledger being a topology with Participant Nodes, where there can also exist Participant Nodes which are identical replicas (for fault-tolerance purposes). These participant nodes are backed by Writer/Committer nodes using a BFT consensus.
While overloading a participant node with requests, we received the following message:
Our question: Does this message occur because that particular Participant Node is overloaded or is it because the underlying Writer Nodes are overloaded? If it is just the Participant Node that is overloaded, it would mean we could switch over to a secondary Participant node, however, if it is because the underlying Writer nodes are overloaded, then switching over to a secondary would result in no benefit.
Next question, also while overloading a Participant Node, we encountered a RESOURCE_EXHAUSTED scenario which did not provide any description in the error message:
We were wondering, under what circumstances does this occur and if anyone can shed more insight into the details of whether this occurs due to an overloaded Participant Node, or because the underlying Writer Nodes are overloaded.
The participant maintains a queue of incoming commands to process and returns RESOURCE_EXHAUSTED if this queue is full and cannot accept any more commands (you can see it this in code). The default configuration is fairly conservative to prevent unwanted out-of-memory errors, but you may want to check the documentation for your Daml driver with regards to how to configure this setting (note that the implementation may also have a different default). This allows you to be more elastic in handling peaks, but of course the client must be aware of the fact that RESOURCE_EXHAUSTED means that the participant is pressuring back and the clients to address this by retrying later (there is also the case in which a request is too large that also causes a RESOURCE_EXHAUSTED error to be returned, but that usually happens when uploading a large package, transaction usually do not require configuring the participant to accept larger messages).
I’m no 100% sure about the second RESOURCE_EXHAUSTED. Can you describe the difference between the scenarios in which the two occurred?
I suspect that the RESOURCE_EXHAUSTED message may have originated from a back-pressuring committer node (code). If that’s the case, I believe the approach should be the same: change the configuration if you need and make sure to handle those errors with the appropriate retry strategy.
Stefano is right; the second set of errors is almost certainly backpressure from the ledger itself. (I’m making a note to improve the error message.) The Daml Driver for VMware Blockchain (VMBC) can be configured to increase the queue size, but as it’s proprietary, you’ll need to consult VMware’s documentation or talk to them to figure out which configuration changes are appropriate for your use case.
Just increasing the queue size probably won’t solve your problem. I expect you’ll need to increase the number of resources available to the VMBC committer nodes.
Fortunately, both Digital Asset and VMware have been hard at work improving the performance of the committer and ledger, so a future release will be able to handle much more load with the same resources.
(Full disclosure: I am one of the people working on these performance improvements.)
Do both cases log message “Back-pressure” onto participant nodes? We can see them appearing in participant node logs but not sure if both cases are covered.
(Full disclosure: I am one of the people working on exploring/testing vmWare ledger.)
Only the latter. Note that we use a logging wrapper that allows us to inject data points relevant at the log call site without explicitly threading through and logging the whole context, so you should see that message, plus some contextual information relative to the submission identifier, submitting party, and so on (either in the form of a string appended to the log message or as an object if you use the logstash-logback-encoder).