Canton participant node error "RejectedExecutionException"

Hi,

As I’m doing my performance test in Canton.

My participant node has started to generate errors, you can find the log below:

WARN  c.d.c.r.DbStorageSingle:participant=PARTICIPANT_NODE_A tid:8c4718b4e90dd1e2a4fddcab13bcce27 - DB_STORAGE_DEGRADATION(13,8c4718b4): A database task was rejected from the database task queue.
The full error message from the task queue is:
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@362f47e6 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@6f064403[Running, pool size = 50, active threads = 50, queued tasks = 1000, completed tasks = 128424] err-context:{messageFromSlick=java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@362f47e6 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@6f064403[Running, pool size = 50, active threads = 50, queued tasks = 1000, completed tasks = 128424], location=RetryUtil.scala:99}
WARN  c.d.c.r.DbStorageSingle:participant=PARTICIPANT_NODE_A tid:62c7b7f842b2d7576df0e0e0a13b7893 - Now retrying operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements'.
WARN  c.d.c.r.DbStorageSingle:participant=PARTICIPANT_NODE_A tid:62c7b7f842b2d7576df0e0e0a13b7893 - DB_STORAGE_DEGRADATION(13,62c7b7f8): A database task was rejected from the database task queue.
The full error message from the task queue is:
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@5ec4a351 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@6f064403[Running, pool size = 50, active threads = 50, queued tasks = 1000, completed tasks = 128424] err-context:{messageFromSlick=java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@5ec4a351 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@6f064403[Running, pool size = 50, active threads = 50, queued tasks = 1000, completed tasks = 128424], location=RetryUtil.scala:99}
WARN  c.d.c.r.DbStorageSingle:participant=PARTICIPANT_NODE_A tid:62c7b7f842b2d7576df0e0e0a13b7893 - The operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements' has failed with an exception. Retrying after 2813 milliseconds.
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@5ec4a351 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@6f064403[Running, pool size = 50, active threads = 50, queued tasks = 1000, completed tasks = 128424]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at slick.util.AsyncExecutorWithMetrics$$anon$1.execute(AsyncExecutorWithMetrics.scala:246)
	at slick.util.AsyncExecutorWithMetrics$$anon$3.execute(AsyncExecutorWithMetrics.scala:299)
	at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction(BasicBackend.scala:265)
	at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction$(BasicBackend.scala:263)
	at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:37)
	at slick.basic.BasicBackend$DatabaseDef.slick$basic$BasicBackend$DatabaseDef$$runInContextInline(BasicBackend.scala:242)
	at slick.basic.BasicBackend$DatabaseDef.runInContextSafe(BasicBackend.scala:148)
	at slick.basic.BasicBackend$DatabaseDef.runInContext(BasicBackend.scala:142)
	at slick.basic.BasicBackend$DatabaseDef.runInContext$(BasicBackend.scala:141)
	at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:37)
	at slick.basic.BasicBackend$DatabaseDef.runInternal(BasicBackend.scala:77)
	at slick.basic.BasicBackend$DatabaseDef.runInternal$(BasicBackend.scala:76)
	at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:37)
	at slick.basic.BasicBackend$DatabaseDef.run(BasicBackend.scala:74)
	at slick.basic.BasicBackend$DatabaseDef.run$(BasicBackend.scala:74)
	at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:37)
	at com.digitalasset.canton.resource.DbStorageSingle.$anonfun$runWrite$1(DbStorageSingle.scala:41)
	at scala.util.Try$.apply(Try.scala:210)
	at com.digitalasset.canton.util.retry.RetryWithDelay.$anonfun$retryWithDelay$1(Policy.scala:155)
	at com.digitalasset.canton.util.LoggerUtil$.logOnThrow(LoggerUtil.scala:80)
	at com.digitalasset.canton.util.retry.RetryWithDelay.run$1(Policy.scala:150)
	at com.digitalasset.canton.util.retry.RetryWithDelay.$anonfun$retryWithDelay$9(Policy.scala:233)
	at com.digitalasset.canton.util.LoggerUtil$.logOnThrow(LoggerUtil.scala:80)
	at com.digitalasset.canton.util.retry.RetryWithDelay.$anonfun$retryWithDelay$8(Policy.scala:220)
	at scala.util.Try$.apply(Try.scala:210)
	at com.digitalasset.canton.lifecycle.FlagCloseable.internalPerformUnlessClosingF(FlagCloseable.scala:123)
	at com.digitalasset.canton.lifecycle.FlagCloseable.internalPerformUnlessClosingF$(FlagCloseable.scala:116)
	at com.digitalasset.canton.resource.DbStorageSingle.internalPerformUnlessClosingF(DbStorageSingle.scala:18)
	at com.digitalasset.canton.lifecycle.FlagCloseable.performUnlessClosingF(FlagCloseable.scala:114)
	at com.digitalasset.canton.lifecycle.FlagCloseable.performUnlessClosingF$(FlagCloseable.scala:111)
	at com.digitalasset.canton.resource.DbStorageSingle.performUnlessClosingF(DbStorageSingle.scala:18)
	at com.digitalasset.canton.util.retry.RetryWithDelay.$anonfun$retryWithDelay$7(Policy.scala:241)
	at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:484)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
[...]
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@27dc6ca6 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@6f064403[Running, pool size = 50, active threads = 50, queued tasks = 1000, completed tasks = 128424] 
err-context:{messageFromSlick=java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3@27dc6ca6 rejected from slick.util.AsyncExecutorWithMetrics$$anon$1@6f064403[Running, pool size = 50, active threads = 50, queued tasks = 1000, completed tasks = 128424], location=RetryUtil.scala:99}
[...]
WARN  c.d.c.r.DbStorageSingle:participant=PARTICIPANT_NODE_A tid:990c167a76d0bcc45c141f0cbb568f2a - Now retrying operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements'.
WARN  c.d.c.r.DbStorageSingle:participant=PARTICIPANT_NODE_A tid:ec1500d754bc81dccc14c2feb07793f7 - Now retrying operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements'.

What could be the cause of this error?

Cheers,
Jean-Paul

Hi Jean-Paul,

This looks like the same DB_STORAGE_DEGRADATION / RejectedExecutionException errors as discussed in Canton participant node warn message: DB_STORAGE_DEGRADATION.

The extra log line

WARN  c.d.c.r.DbStorageSingle:participant=PARTICIPANT_NODE_A tid:990c167a76d0bcc45c141f0cbb568f2a - Now retrying operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements'.

is saying that the database operation “DbContractStore.storeElements” is being retried as the database task queue is currently full.

Let me know if the fixes from the other thread don’t help you.

It doesn’t fix at this stage. I guess I’ve hita bottleneck and have to try the mediator sharding.

Thanks!
Jean-Paul

The error message

WARN  c.d.c.r.DbStorageSingle:participant=PARTICIPANT_NODE_A tid:990c167a76d0bcc45c141f0cbb568f2a - Now retrying operation 'com.digitalasset.canton.participant.store.db.DbContractStore.storeElements'.

Means that the participant’s database queue is filling up. I am not sure that sharding the mediator would help with the participant’s database queue, though it may improve performance overall (especially as other changes move the bottleneck).

In general, I would recommend these changes to improve performance:

  1. Sharding the workflow across multiple participants: Overview and Assumptions — Canton 1.0.0-SNAPSHOT documentation
  2. Increasing the resources allocated to the database
  3. Batching smaller requests into a single transaction: Overview and Assumptions — Canton 1.0.0-SNAPSHOT documentation

However, it does depend on your deployment. For example, if you’re running everything on one machine then (2) may not be feasible and the effectiveness of (1) will depend on how many free resources are available for another participant.