Recommended settings for DD4O load balancing with NGINX

Let’s say I’m using open source NGINX proxy for participant node health monitoring and load balancing between the 2 physical p-nodes (1a and 1b) of a DD4O replicated participant node running in HA setup. Assume 1a is active and 1b standby.

I need to set values for 2 parameters:

  1. timeout - this is how long NGINX proxy will wait for 1a to respond to a Daml command submission or to a health check, before retrying the Daml command submission or the health check
  2. numberOfRetryAttempts - this is how many times NGINX proxy will retry the Daml command submission or health check before marking node 1a as “unhealthy” and switching over to the standby node 1b

What are the recommended/default values that I should use for timeout and numberOfRetryAttempts?
I understand the values may need to be finalised as part of NFR testing since they’re dependent on network latency etc and determined by SLAs, but want to know if there are any recommendations/default values to start with?

1 Like

The participant runs in active-passive mode and the switchover from one active node to the other is done via a database lock acquisition, not by the proxy. So you want to avoid switching over your proxy until you think think the switchover has actually happened.
The passive node tries to acquire the lock with a configurable interval. You can set that interval depending on the level of service you need to provide.
So my recommendation would be:

  1. Do not let your proxy retry command submissions, and don’t use command submissions to steer failover. Deal with all submission failures application side. Set a generous timeout like 30 seconds.
  2. For the health check, retry a few times with smaller timeouts for a period that exceeds the lock acquisition interval. Eg if you set a 15s lock acquisition period, try 6 times 5 seconds. If you get no response or a negative response, fail over.
1 Like

Good question.

I wanted to ask the Canton Team, in reference to the Load Balancing requirements as per External Load Balancing. Is there a definitive and tested list or is it Nginx et al as per normal HTTP_Web_Server load balancing requirements?

Hi @bernhard May I know where to set this “configurable interval” for lock acquisition period? And what is the parameter?

Also from the canton doc it is mentioned: “There is a grace period for the active replica to rebuild the connection and reclaim the lock to avoid unnecessary fail-overs on short connection interruptions. The passive replicas continuously try to acquire the lock with a configurable interval. Once the lock is acquired, the participant replication manager sets the state of the replica to active and completes the fail-over.”

Is this grace period also configurable? If no what’s the value? if yes where is it configured?

Thanks.

kc

Hi @bernhard May I know where to set this “configurable interval” for lock acquisition period? And what is the parameter?

https://www.canton.io/docs/dev/scaladoc/com/digitalasset/canton/config/ReplicationConfig.html#passiveCheckPeriod:com.digitalasset.canton.time.NonNegativeFiniteDuration

As far as I understand, the grace period is the check interval.

On a general note, if you are looking for a config parameter, the config scaladocs linked above are the place to go looking.

Hi Bernhard,

Thanks a lot. I will definitely take some time to learn reading this scala doc. Meanwhile, can you point me where it is configured? Is it put inside a participant node in the *.conf file or something defined globally applicable to all participant nodes? I have tried to search this inside the console but got no luck.

Thanks again for this info.

kc

Thanks for pointing us to the relevant part of the code @bernhard! How is this configuration exposed externally so that we could set values for the lock acquisition period please?

As part of the Canton Enterprise distribution, there is a HA example in examples/e04-high-availability.
If you look at the participant configs, you’ll see

      # allows many participant node instances to share the same database
      replication.enabled = true

Adding there

replication.passive-check-period = 30

should set the check period to 30 seconds.

1 Like

Thanks @bernhard, but I can’t seem to be able to find a passiveCheckPeriod at https://www.canton.io/docs/dev/scaladoc/com/digitalasset/canton/config/ReplicationConfig.html#passiveCheckPeriod:com.digitalasset.canton.time.NonNegativeFiniteDuration

I do see a checkPeriod at https://www.canton.io/docs/dev/scaladoc/com/digitalasset/canton/resource/DbLockedConnectionPoolConfig.html#checkPeriod:com.digitalasset.canton.time.PositiveFiniteDuration

  1. Should this be set in the participant config as replication.passiveCheckPeriod = 30 or replication.checkPeriod = 30 ?
  2. Could you kindly confirm this has a default value of 5 seconds if it is not set in the participant config?
  1. I believe it should be replication.connection-pool.check-period = 30
  2. Yes, default is 5 seconds.