Sandbox GRPC exception : "Failed to parse submission"

Brian_Weir · May 12, 2021, 4:47pm

Hello team,

I’m encountering a GRPC exception when submitting a single create command to a locally running sandbox :

ERROR: Internal error: Failed to parse submission, correlationId=a147bf5f-ea20-4cb2-bce3-63e71bccd4e5 (context: {readAs=, submittedAt=2021-05-12T16:16:51.066Z, applicationId=daml-script, deduplicateUntil=2021-05-13T16:16:51.066Z, actAs=[Buyer], commandId=a147bf5f-ea20-4cb2-bce3-63e71bccd4e5})
ERROR: Unhandled internal error (context: {readAs=, submittedAt=2021-05-12T16:16:51.066Z, applicationId=daml-script, deduplicateUntil=2021-05-13T16:16:51.066Z, actAs=[Buyer], commandId=a147bf5f-ea20-4cb2-bce3-63e71bccd4e5})
io.grpc.StatusRuntimeException: INTERNAL: Failed to parse submission, correlationId=a147bf5f-ea20-4cb2-bce3-63e71bccd4e5
at io.grpc.Status.asRuntimeException(Status.java:525)
at com.daml.platform.apiserver.services.ApiSubmissionService.handleSubmissionResult(ApiSubmissionService.scala:167)
at com.daml.platform.apiserver.services.ApiSubmissionService.$anonfun$deduplicateAndRecordOnLedger$2(ApiSubmissionService.scala:138)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
WARN: Service responded with error for submitting command with context (Future(),a147bf5f-ea20-4cb2-bce3-63e71bccd4e5). Status of command is unknown. watching for completion…
io.grpc.StatusRuntimeException: INTERNAL: Failed to parse submission, correlationId=a147bf5f-ea20-4cb2-bce3-63e71bccd4e5
at io.grpc.Status.asRuntimeException(Status.java:525)
at com.daml.platform.apiserver.services.ApiSubmissionService.handleSubmissionResult(ApiSubmissionService.scala:167)
at com.daml.platform.apiserver.services.ApiSubmissionService.$anonfun$deduplicateAndRecordOnLedger$2(ApiSubmissionService.scala:138)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)

To recreate this issue, update the “unrollDates” function at contingent-claims to “… (unrollDates 2021 2030…” followed by building and starting the sandbox (“daml start --sandbox-option “-s””)

I have tried playing with some of the sandbox configurations, such as “–max-inbound-message-size” but I cannot get elevate this exception.

Thank you in advance for all of your help!

Kind Regards,
Brian

cocreature · May 12, 2021, 5:32pm

Hi @Brian_Weir,
This example hits the protobuf recursion limit. With some manually added log statements (which should clearly not be required so we’ll look into that) you get a failure of the form

Protocol message had too many levels of nesting.  May be malicious.  Use CodedInputStream.setRecursionLimit() to increase the depth limit.

Having taken a quick look at the code it seems like the list produced by unrollDates gets turned into something like

a `And` b ` c `And`

with the level of nesting of that tree structure corresponding to the length of the list. Protobuf limits the recursion levels (I think 100 is the default) so this blows up.

Afaik you cannot bump this limit yourself at the moment and that only really shifts the limit rather than solving the problem.

To solve this reliably you have to avoid deeply nested structures. There are a few options here depending on your app:

Lists are represented as lists in protobuf not as recursive structures so use those instead of your own recursive structures where it makes sense. E.g, in your example having an And structure that has a list of children instead of composing it out of binary operators might work.
Balancing trees can reduce the recursion depth, e.g, instead of And a (And b (And c d)) use And (And a b) (And c d).
If all else fails, you can manually add an indirection: Instead of referencing your children directly, have a Map NodeId Node and then each node references NodeIds as the children instead of Nodes.

As mentioned above we will definitely look into improving the error here.

gerolf · May 12, 2021, 6:08pm

Thanks for flagging this, @Brian_Weir. And thanks for looking into it, @cocreature. The error should definitely be caught and reported not as Internal Error but something more sensible.

Brian_Weir · May 12, 2021, 6:52pm

Thanks @cocreature for your quick reply and feedback on how to avoid this issue, very much appreciated!!

Brian_Weir · May 17, 2021, 10:01am

Hi @cocreature, I’ve discussed this with the team and they’ve raised a valid point that we (and our clients) should be able to configure the “Recursion Limit”. For example, there’s other initiatives outside of the Contingent Claims such as ISDA which contain recursive data structures which could also run into this limit.

The documentation of Protobuf states the main purpose of this recursive limit is (see here):

To prevent corrupt or malicious messages from causing stack overflows, we must keep track of the depth of recursion when parsing embedded messages and groups.

If a client has a valid use case for having a recursive structure, I don’t believe we should force them to change their implementation due to the default setup of the underlying serialisation library, especially since the purpose of the limit is to prevent corrupt or malicious messages.

Do you agree that we should open a GitHub issue to make the recursion limit configurable by the client ?

On a side note, when I create a data structure with more levels of recursion, I get a different error message :

Caused by: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Command interpretation error in LF-DAMLe: Provided value exceeds maximum nesting level of 100

cocreature · May 17, 2021, 11:18am

I don’t think adding a method to configure this solves the issue for a few reasons:

You’ve now made your code non-portable. In general, users don’t have the option to configure the limit. An obvious example of that is Daml Hub but it’s by far not the only one. In a distributed setting you probably don’t control the settings on other participants. And even in a centralized setting the person that controls things like CLI flags can be different from the developers that produce & upload their Daml code.
Deeply nested values often have relatively poor performance properties.
While you can bump the protobuf limit there are other places where we do not support arbitrarily deep nesting and you will run into stackoverflow or similar.

Luciano · May 17, 2021, 11:23am

Am I right in guessing that this

means it’s a constant somewhere in protobuf? And that means that changing it would require re-compiling the protobuf dependency?

And furthermore, if that is the case, would it mean that we would break compatibility with other protobuf-compliant programs (e.g. if somebody decided to write their own GRPC client instead of using the official one)?

cocreature · May 17, 2021, 11:27am

No, it doesn’t require patching the protobuf library at least not for Java. You can configure it but not without changing the code of your ledger (e.g., sandbox) and recompiling it. For grpc-java, you configure it via CodedInputStream::setRecursionLimit. Similar on the client side.

I don’t think protobuf specifies a general limit and e.g., the Haskell implementation has no recursion limits at all afaik. That said, Java has the limit and most other bindings (e.g. python) go via the C/C++ library which also has a limit (which is also configurable I believe) so you do have to make sure they sync up but you can change it.

Stephen · May 17, 2021, 7:02pm

With the development of Daml-LF, we realized the various problems with setting very large recursion limits, such as (2) and (3) mentioned by @cocreature, and decided to

set a “reasonable” limit,
set it as the limit consistently, and
declare values that exceed the limit as not Daml.

In light of this choice, I would think of having options to set the recursion limit as akin to -fdefer-type-errors in GHC. Sure, it disables an error, but that does not mean the thing you have successfully run is valid code/payload, and you shouldn’t be surprised at strange errors occurring at strange times.

Topic		Replies	Views
Error found when running grpcurl on Sandbox Questions	1	259	October 14, 2022
Testing Deduplication and receiving DEADLINE_EXCEEDED error Features & Bugs canton	2	182	October 25, 2022
Java Command client submitAndWait behaviour on error Questions java , ledger-api	3	255	May 5, 2022
Exception in Sandbox and looping continuously Features & Bugs daml , sandbox , json-api	1	443	September 20, 2021
Error when calling the com.daml.ledger.api.v1.CommandSubmissionService.Submit via grpcurl command line Questions grpc , ledger-api	9	387	September 13, 2021

Sandbox GRPC exception : "Failed to parse submission"

Related topics