Let’s say I’m using open source NGINX proxy for participant node health monitoring and load balancing between the 2 physical p-nodes (1a and 1b) of a DD4O replicated participant node running in HA setup. Assume 1a is active and 1b standby.
I need to set values for 2 parameters:
timeout - this is how long NGINX proxy will wait for 1a to respond to a Daml command submission or to a health check, before retrying the Daml command submission or the health check
numberOfRetryAttempts - this is how many times NGINX proxy will retry the Daml command submission or health check before marking node 1a as “unhealthy” and switching over to the standby node 1b
What are the recommended/default values that I should use for timeout and numberOfRetryAttempts?
I understand the values may need to be finalised as part of NFR testing since they’re dependent on network latency etc and determined by SLAs, but want to know if there are any recommendations/default values to start with?
The participant runs in active-passive mode and the switchover from one active node to the other is done via a database lock acquisition, not by the proxy. So you want to avoid switching over your proxy until you think think the switchover has actually happened.
The passive node tries to acquire the lock with a configurable interval. You can set that interval depending on the level of service you need to provide.
So my recommendation would be:
Do not let your proxy retry command submissions, and don’t use command submissions to steer failover. Deal with all submission failures application side. Set a generous timeout like 30 seconds.
For the health check, retry a few times with smaller timeouts for a period that exceeds the lock acquisition interval. Eg if you set a 15s lock acquisition period, try 6 times 5 seconds. If you get no response or a negative response, fail over.
I wanted to ask the Canton Team, in reference to the Load Balancing requirements as per External Load Balancing. Is there a definitive and tested list or is it Nginx et al as per normal HTTP_Web_Server load balancing requirements?
Hi @bernhard May I know where to set this “configurable interval” for lock acquisition period? And what is the parameter?
Also from the canton doc it is mentioned: “There is a grace period for the active replica to rebuild the connection and reclaim the lock to avoid unnecessary fail-overs on short connection interruptions. The passive replicas continuously try to acquire the lock with a configurable interval. Once the lock is acquired, the participant replication manager sets the state of the replica to active and completes the fail-over.”
Is this grace period also configurable? If no what’s the value? if yes where is it configured?
Thanks a lot. I will definitely take some time to learn reading this scala doc. Meanwhile, can you point me where it is configured? Is it put inside a participant node in the *.conf file or something defined globally applicable to all participant nodes? I have tried to search this inside the console but got no luck.
Thanks for pointing us to the relevant part of the code @bernhard! How is this configuration exposed externally so that we could set values for the lock acquisition period please?
As part of the Canton Enterprise distribution, there is a HA example in examples/e04-high-availability.
If you look at the participant configs, you’ll see
# allows many participant node instances to share the same database
replication.enabled = true