Canton : canton-env-execution-context is stuck or slow

jaypeeda · December 20, 2021, 9:27am

Hi team,

I’m running some big tests on my Canton setup.
However, on my participant node, I got below:

Execution context canton-env-execution-context is stuck or slow. My scheduled future has not been processed for at least 3 seconds (queue-size=176).
ForkJoinIdlenessExecutorService-canton-env-execution-context: java.util.concurrent.ForkJoinPool@2ffc2eb4[Running, parallelism = 32, size = 36, active = 36, running = 31, steals = 151776, tasks = 3195, submissions = 176]
WARN  c.d.c.c.ExecutionContextMonitor - Execution context canton-env-execution-context is just slow. Future got executed in the meantime.
WARN  c.d.c.r.DbStorageSingle:participant=participantA tid:7d7c1af4eb53c5cf129f95b0f2e8bbe8 - The operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements' has failed with an exception. Retrying after 3059 milliseconds.
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@2b972e50 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@210c4810[Running, pool size = 25, active threads = 25, queued tasks = 1000, completed tasks = 181375]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at slick.util.AsyncExecutorWithMetrics$$anon$1.execute(AsyncExecutorWithMetrics.scala:222)
	at slick.util.AsyncExecutorWithMetrics$$anon$3.execute(AsyncExecutorWithMetrics.scala:273)
	at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction(BasicBackend.scala:265)
	at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction$(BasicBackend.scala:263)
	at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:37)
	at slick.basic.BasicBackend$DatabaseDef.slick$basic$BasicBackend$DatabaseDef$$runInContextInline(BasicBackend.scala:242)
	at slick.basic.BasicBackend$DatabaseDef.runInContextSafe(BasicBackend.scala:148)
	at slick.basic.BasicBackend$DatabaseDef.runInContext(BasicBackend.scala:142)
	at slick.basic.BasicBackend$DatabaseDef.runInContext$(BasicBackend.scala:141)
	at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:37)
	at slick.basic.BasicBackend$DatabaseDef.runInternal(BasicBackend.scala:77)
	at slick.basic.BasicBackend$DatabaseDef.runInternal$(BasicBackend.scala:76)
	at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:37)
	at slick.basic.BasicBackend$DatabaseDef.run(BasicBackend.scala:74)
	at slick.basic.BasicBackend$DatabaseDef.run$(BasicBackend.scala:74)
	at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:37)
	at com.digitalasset.canton.resource.DbStorageSingle.$anonfun$runWrite$1(DbStorageSingle.scala:38)
	at com.digitalasset.canton.util.retry.Policy.run$1(Policy.scala:72)
	at com.digitalasset.canton.util.retry.Policy.$anonfun$retryWithDelay$8(Policy.scala:89)
	at com.digitalasset.canton.lifecycle.FlagCloseable.performUnlessClosing(FlagCloseable.scala:68)
	at com.digitalasset.canton.lifecycle.FlagCloseable.performUnlessClosing$(FlagCloseable.scala:62)
	at com.digitalasset.canton.resource.DbStorageSingle.performUnlessClosing(DbStorageSingle.scala:21)
	at com.digitalasset.canton.util.retry.Policy.$anonfun$retryWithDelay$7(Policy.scala:90)
	at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:434)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
WARN  c.d.c.r.DbStorageSingle:participant=participantA tid:7d7c1af4eb53c5cf129f95b0f2e8bbe8 - Now retrying operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements'.

Could you please advise ?

Cheers,
Jean-Paul

Ratko_Veprek · December 20, 2021, 3:55pm

Hi @jaypeeda

You’ve hit two warnings. The first one here:

WARN  c.d.c.c.ExecutionContextMonitor - Execution context canton-env-execution-context is just slow. Future got executed in the meantime.

is from our internal deadlock monitoring tool. It will alert if the system is stuck or overloaded. In your case, it was just overloaded (implied by the second warning).

The second warning:

WARN  c.d.c.r.DbStorageSingle:participant=participantA tid:7d7c1af4eb53c5cf129f95b0f2e8bbe8 - The operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements' has failed with an exception. Retrying after 3059 milliseconds.
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@2b972e50 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@210c4810[Running, pool size = 25, active threads = 25, queued tasks = 1000, completed tasks = 181375]

is emitted if Canton tried to schedule more database calls than it can queue. This is obvious from the “queued tasks = 1000” entry, as the default queue size is 1000. However, the WARN is harmless as we’ll overflow into the queue of the execution context.

In that sense, the system is telling you that you don’t have enough parallel database connections. You might increase them using

canton.participants.XYZ.storage.max-connections = MORE_THAN_CURRENTLY

I hope this helps. Btw, the reason why you get the overflow is that there are still some database queries that we haven’t batched yet.

Best,
Ratko

jaypeeda · December 21, 2021, 4:35am

What does the steal, tasks, submissions indicates ?

Ratko_Veprek · December 21, 2021, 9:29am

May I refer you to the documentation of the ForkJoinPool?

A ForkJoinPool differs from other kinds of ExecutorService mainly by virtue of employing work-stealing: all threads in the pool attempt to find and execute tasks submitted to the pool and/or created by other active tasks (eventually blocking waiting for work if none exist). This enables efficient processing when most tasks spawn other subtasks (as do most ForkJoinTasks), as well as when many small tasks are submitted to the pool from external clients.

So steals means how often another thread took over some task queued with another thread.

jaypeeda · December 21, 2021, 9:53am

Thanks Ratko, that’s very helpful!

Turan · January 30, 2023, 11:31pm

How would the symptoms above result in the participant node to no longer report back that “it’s healthy” on its health endpoint?

Ratko_Veprek · January 31, 2023, 7:49am

An overloaded execution context monitor wouldn’t be reported on the health end-point. If the database is too overloaded to be even able to respond to a trivial connection attempt, the node might think it lost the connection to the database in theory. That would lead to the node going passive. However, that should be prevented by setting appropriate resource limits on the participant such that the node never gets into an overload situation but starts to backpressure before that point.

Therefore, if you see the execution context monitor complain, then reduce the participant rate limits such that the node does not get overloaded. If you see slick rejected execution exception, then increase the number of database connections available to the node.

Topic		Replies	Views
Canton participant node error "RejectedExecutionException" Questions canton	3	408	January 19, 2022
Canton participant node warn message: DB_STORAGE_DEGRADATION Questions canton	3	227	January 18, 2022
Query Re Canton 1.0 Future Release Questions daml , canton , install	2	307	January 21, 2022
Canton - Sequencer time [..] Exceeding max-sequencing-time / Response message for request X timed out Questions canton	6	709	April 26, 2022
Mediator_says_tx_timed_out Questions canton	4	243	February 17, 2024

Canton : canton-env-execution-context is stuck or slow

Related topics