RESOURCE_EXHAUSTED with no description and overloaded nodes

zoraizmahmood · February 23, 2021, 12:06am

Hey all!

Just a few questions around resource exhausted. We are using the DAML on VMWare ledger. Key features of this ledger being a topology with Participant Nodes, where there can also exist Participant Nodes which are identical replicas (for fault-tolerance purposes). These participant nodes are backed by Writer/Committer nodes using a BFT consensus.

While overloading a participant node with requests, we received the following message:

Our question: Does this message occur because that particular Participant Node is overloaded or is it because the underlying Writer Nodes are overloaded? If it is just the Participant Node that is overloaded, it would mean we could switch over to a secondary Participant node, however, if it is because the underlying Writer nodes are overloaded, then switching over to a secondary would result in no benefit.

Next question, also while overloading a Participant Node, we encountered a RESOURCE_EXHAUSTED scenario which did not provide any description in the error message:

We were wondering, under what circumstances does this occur and if anyone can shed more insight into the details of whether this occurs due to an overloaded Participant Node, or because the underlying Writer Nodes are overloaded.

Thank you!!

stefanobaghino-da · February 23, 2021, 7:34am

The participant maintains a queue of incoming commands to process and returns RESOURCE_EXHAUSTED if this queue is full and cannot accept any more commands (you can see it this in code). The default configuration is fairly conservative to prevent unwanted out-of-memory errors, but you may want to check the documentation for your Daml driver with regards to how to configure this setting (note that the implementation may also have a different default). This allows you to be more elastic in handling peaks, but of course the client must be aware of the fact that RESOURCE_EXHAUSTED means that the participant is pressuring back and the clients to address this by retrying later (there is also the case in which a request is too large that also causes a RESOURCE_EXHAUSTED error to be returned, but that usually happens when uploading a large package, transaction usually do not require configuring the participant to accept larger messages).

I’m no 100% sure about the second RESOURCE_EXHAUSTED. Can you describe the difference between the scenarios in which the two occurred?

stefanobaghino-da · February 23, 2021, 7:59am

I suspect that the RESOURCE_EXHAUSTED message may have originated from a back-pressuring committer node (code). If that’s the case, I believe the approach should be the same: change the configuration if you need and make sure to handle those errors with the appropriate retry strategy.

SamirTalwar · February 23, 2021, 11:11am

Stefano is right; the second set of errors is almost certainly backpressure from the ledger itself. (I’m making a note to improve the error message.) The Daml Driver for VMware Blockchain (VMBC) can be configured to increase the queue size, but as it’s proprietary, you’ll need to consult VMware’s documentation or talk to them to figure out which configuration changes are appropriate for your use case.

Just increasing the queue size probably won’t solve your problem. I expect you’ll need to increase the number of resources available to the VMBC committer nodes.

Fortunately, both Digital Asset and VMware have been hard at work improving the performance of the committer and ledger, so a future release will be able to handle much more load with the same resources.

(Full disclosure: I am one of the people working on these performance improvements.)

Frankie · February 23, 2021, 11:44am

Do both cases log message “Back-pressure” onto participant nodes? We can see them appearing in participant node logs but not sure if both cases are covered.

(Full disclosure: I am one of the people working on exploring/testing vmWare ledger.)

stefanobaghino-da · February 23, 2021, 1:36pm

Only the latter. Note that we use a logging wrapper that allows us to inject data points relevant at the log call site without explicitly threading through and logging the whole context, so you should see that message, plus some contextual information relative to the submission identifier, submitting party, and so on (either in the form of a string appended to the log message or as an object if you use the logstash-logback-encoder).

Topic		Replies	Views
PARTICIPANT_BACKPRESSURE: The participant is overload: too many request (count:108, limit: 100) Questions	2	226	March 30, 2022
An approach to handling Participant Node failure Tutorials and Guides ledger-api	6	1032	November 16, 2022
When is RESOURCE_EXHAUSTED returned by the command submission/completion services? Questions grpc , sandbox , ledger-api	4	942	March 18, 2021
GRPC exceeds max size error and how to recover Questions damlhub , java	6	1605	February 17, 2023
What happens when the limit set by `--max-commands-in-flight` in Sandbox is exceeded? Questions	5	427	July 2, 2020

RESOURCE_EXHAUSTED with no description and overloaded nodes

Related topics