Recommended settings for DD4O load balancing with NGINX

meet · January 25, 2022, 7:59am

Let’s say I’m using open source NGINX proxy for participant node health monitoring and load balancing between the 2 physical p-nodes (1a and 1b) of a DD4O replicated participant node running in HA setup. Assume 1a is active and 1b standby.

I need to set values for 2 parameters:

timeout - this is how long NGINX proxy will wait for 1a to respond to a Daml command submission or to a health check, before retrying the Daml command submission or the health check
numberOfRetryAttempts - this is how many times NGINX proxy will retry the Daml command submission or health check before marking node 1a as “unhealthy” and switching over to the standby node 1b

What are the recommended/default values that I should use for timeout and numberOfRetryAttempts?
I understand the values may need to be finalised as part of NFR testing since they’re dependent on network latency etc and determined by SLAs, but want to know if there are any recommendations/default values to start with?

bernhard · January 25, 2022, 9:44am

The participant runs in active-passive mode and the switchover from one active node to the other is done via a database lock acquisition, not by the proxy. So you want to avoid switching over your proxy until you think think the switchover has actually happened.
The passive node tries to acquire the lock with a configurable interval. You can set that interval depending on the level of service you need to provide.
So my recommendation would be:

Do not let your proxy retry command submissions, and don’t use command submissions to steer failover. Deal with all submission failures application side. Set a generous timeout like 30 seconds.
For the health check, retry a few times with smaller timeouts for a period that exceeds the lock acquisition interval. Eg if you set a 15s lock acquisition period, try 6 times 5 seconds. If you get no response or a negative response, fail over.

Ben_M · January 25, 2022, 10:29pm

Good question.

I wanted to ask the Canton Team, in reference to the Load Balancing requirements as per External Load Balancing. Is there a definitive and tested list or is it Nginx et al as per normal HTTP_Web_Server load balancing requirements?

kctam · January 27, 2022, 6:01am

Hi @bernhard May I know where to set this “configurable interval” for lock acquisition period? And what is the parameter?

Also from the canton doc it is mentioned: “There is a grace period for the active replica to rebuild the connection and reclaim the lock to avoid unnecessary fail-overs on short connection interruptions. The passive replicas continuously try to acquire the lock with a configurable interval. Once the lock is acquired, the participant replication manager sets the state of the replica to active and completes the fail-over.”

Is this grace period also configurable? If no what’s the value? if yes where is it configured?

Thanks.

kc

bernhard · January 27, 2022, 7:22am

Hi @bernhard May I know where to set this “configurable interval” for lock acquisition period? And what is the parameter?

https://www.canton.io/docs/dev/scaladoc/com/digitalasset/canton/config/ReplicationConfig.html#passiveCheckPeriod:com.digitalasset.canton.time.NonNegativeFiniteDuration

As far as I understand, the grace period is the check interval.

On a general note, if you are looking for a config parameter, the config scaladocs linked above are the place to go looking.

kctam · January 27, 2022, 8:59am

Hi Bernhard,

Thanks a lot. I will definitely take some time to learn reading this scala doc. Meanwhile, can you point me where it is configured? Is it put inside a participant node in the *.conf file or something defined globally applicable to all participant nodes? I have tried to search this inside the console but got no luck.

Thanks again for this info.

kc

meet · January 27, 2022, 11:40am

Thanks for pointing us to the relevant part of the code @bernhard! How is this configuration exposed externally so that we could set values for the lock acquisition period please?

bernhard · January 28, 2022, 9:06am

As part of the Canton Enterprise distribution, there is a HA example in examples/e04-high-availability.
If you look at the participant configs, you’ll see

      # allows many participant node instances to share the same database
      replication.enabled = true

Adding there

replication.passive-check-period = 30

should set the check period to 30 seconds.

meet · February 18, 2022, 2:30am

Thanks @bernhard, but I can’t seem to be able to find a passiveCheckPeriod at https://www.canton.io/docs/dev/scaladoc/com/digitalasset/canton/config/ReplicationConfig.html#passiveCheckPeriod:com.digitalasset.canton.time.NonNegativeFiniteDuration

I do see a checkPeriod at https://www.canton.io/docs/dev/scaladoc/com/digitalasset/canton/resource/DbLockedConnectionPoolConfig.html#checkPeriod:com.digitalasset.canton.time.PositiveFiniteDuration

Should this be set in the participant config as replication.passiveCheckPeriod = 30 or replication.checkPeriod = 30 ?
Could you kindly confirm this has a default value of 5 seconds if it is not set in the participant config?

bernhard · February 21, 2022, 9:00am

I believe it should be replication.connection-pool.check-period = 30
Yes, default is 5 seconds.

Topic		Replies	Views
Health check for remote Daml Canton Mediator Questions canton	8	258	January 10, 2023
Canton - Domain Manager Health Check Questions canton	8	371	January 30, 2023
Canton Participant1 Health Ping Failure on Raspberry Pi v3 Questions canton , ping , timeout	7	552	September 15, 2021
Canton Server Scalability Questions canton	2	247	November 26, 2021
Canton Participant to Domain keep alive logic Questions	3	256	September 5, 2022

Recommended settings for DD4O load balancing with NGINX

Related topics