Canton : canton-env-execution-context is stuck or slow

Hi team,

I’m running some big tests on my Canton setup.
However, on my participant node, I got below:

Execution context canton-env-execution-context is stuck or slow. My scheduled future has not been processed for at least 3 seconds (queue-size=176).
ForkJoinIdlenessExecutorService-canton-env-execution-context: java.util.concurrent.ForkJoinPool@2ffc2eb4[Running, parallelism = 32, size = 36, active = 36, running = 31, steals = 151776, tasks = 3195, submissions = 176]
WARN  c.d.c.c.ExecutionContextMonitor - Execution context canton-env-execution-context is just slow. Future got executed in the meantime.
WARN  c.d.c.r.DbStorageSingle:participant=participantA tid:7d7c1af4eb53c5cf129f95b0f2e8bbe8 - The operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements' has failed with an exception. Retrying after 3059 milliseconds.
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@2b972e50 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@210c4810[Running, pool size = 25, active threads = 25, queued tasks = 1000, completed tasks = 181375]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at slick.util.AsyncExecutorWithMetrics$$anon$1.execute(AsyncExecutorWithMetrics.scala:222)
	at slick.util.AsyncExecutorWithMetrics$$anon$3.execute(AsyncExecutorWithMetrics.scala:273)
	at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction(BasicBackend.scala:265)
	at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction$(BasicBackend.scala:263)
	at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:37)
	at slick.basic.BasicBackend$DatabaseDef.slick$basic$BasicBackend$DatabaseDef$$runInContextInline(BasicBackend.scala:242)
	at slick.basic.BasicBackend$DatabaseDef.runInContextSafe(BasicBackend.scala:148)
	at slick.basic.BasicBackend$DatabaseDef.runInContext(BasicBackend.scala:142)
	at slick.basic.BasicBackend$DatabaseDef.runInContext$(BasicBackend.scala:141)
	at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:37)
	at slick.basic.BasicBackend$DatabaseDef.runInternal(BasicBackend.scala:77)
	at slick.basic.BasicBackend$DatabaseDef.runInternal$(BasicBackend.scala:76)
	at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:37)
	at slick.basic.BasicBackend$DatabaseDef.run(BasicBackend.scala:74)
	at slick.basic.BasicBackend$DatabaseDef.run$(BasicBackend.scala:74)
	at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:37)
	at com.digitalasset.canton.resource.DbStorageSingle.$anonfun$runWrite$1(DbStorageSingle.scala:38)
	at com.digitalasset.canton.util.retry.Policy.run$1(Policy.scala:72)
	at com.digitalasset.canton.util.retry.Policy.$anonfun$retryWithDelay$8(Policy.scala:89)
	at com.digitalasset.canton.lifecycle.FlagCloseable.performUnlessClosing(FlagCloseable.scala:68)
	at com.digitalasset.canton.lifecycle.FlagCloseable.performUnlessClosing$(FlagCloseable.scala:62)
	at com.digitalasset.canton.resource.DbStorageSingle.performUnlessClosing(DbStorageSingle.scala:21)
	at com.digitalasset.canton.util.retry.Policy.$anonfun$retryWithDelay$7(Policy.scala:90)
	at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:434)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
WARN  c.d.c.r.DbStorageSingle:participant=participantA tid:7d7c1af4eb53c5cf129f95b0f2e8bbe8 - Now retrying operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements'.

Could you please advise ?

Cheers,
Jean-Paul

Hi @jaypeeda

You’ve hit two warnings. The first one here:

WARN  c.d.c.c.ExecutionContextMonitor - Execution context canton-env-execution-context is just slow. Future got executed in the meantime.

is from our internal deadlock monitoring tool. It will alert if the system is stuck or overloaded. In your case, it was just overloaded (implied by the second warning).

The second warning:

WARN  c.d.c.r.DbStorageSingle:participant=participantA tid:7d7c1af4eb53c5cf129f95b0f2e8bbe8 - The operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements' has failed with an exception. Retrying after 3059 milliseconds.
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@2b972e50 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@210c4810[Running, pool size = 25, active threads = 25, queued tasks = 1000, completed tasks = 181375]

is emitted if Canton tried to schedule more database calls than it can queue. This is obvious from the “queued tasks = 1000” entry, as the default queue size is 1000. However, the WARN is harmless as we’ll overflow into the queue of the execution context.

In that sense, the system is telling you that you don’t have enough parallel database connections. You might increase them using

canton.participants.XYZ.storage.max-connections = MORE_THAN_CURRENTLY

I hope this helps. Btw, the reason why you get the overflow is that there are still some database queries that we haven’t batched yet.

Best,
Ratko

1 Like

What does the steal, tasks, submissions indicates ?

May I refer you to the documentation of the ForkJoinPool?

A ForkJoinPool differs from other kinds of ExecutorService mainly by virtue of employing work-stealing: all threads in the pool attempt to find and execute tasks submitted to the pool and/or created by other active tasks (eventually blocking waiting for work if none exist). This enables efficient processing when most tasks spawn other subtasks (as do most ForkJoinTasks), as well as when many small tasks are submitted to the pool from external clients.

So steals means how often another thread took over some task queued with another thread.

2 Likes

Thanks Ratko, that’s very helpful!

How would the symptoms above result in the participant node to no longer report back that “it’s healthy” on its health endpoint?

An overloaded execution context monitor wouldn’t be reported on the health end-point. If the database is too overloaded to be even able to respond to a trivial connection attempt, the node might think it lost the connection to the database in theory. That would lead to the node going passive. However, that should be prevented by setting appropriate resource limits on the participant such that the node never gets into an overload situation but starts to backpressure before that point.

Therefore, if you see the execution context monitor complain, then reduce the participant rate limits such that the node does not get overloaded. If you see slick rejected execution exception, then increase the number of database connections available to the node.

1 Like