Ensuring a 1:1 relationship between a running Daml for PostgreSQL server and a running PostgreSQL server

I am reading Core availability considerations for the PostgreSQL driver:

Having two Daml Driver for PostgreSQL servers running on top of a single PostgreSQL server can lead to undefined (and likely broken) behavior. For this reason, you must maintain a strict 1:1 relationship between a running Daml for PostgreSQL server and a running PostgreSQL server

How can the operator guarantee that this 1:1 relationship is kept?

For example, take this pathological case: if the job running Daml for PostgreSQL becomes unreachable (and the operator cannot verify it has been stopped – for example it might be temporarily frozen), the operator cannot know if it is safe to start another Daml for PostgreSQL job to replace it.

Perhaps is there some “locking” mechanism that the Daml for PostgreSQL checks for before making writes to the database?

If you think the current process has died (“unreachable” and “dead” are indistinguishable in a distributed setting, which this necessarily is), kill the container/machine before starting another one. Your ops infrastructure should be able to take care of that.

As the docs snippet above says, it is never safe to start a second Daml for PostgreSQL job to “replace” a currently-running one. Kill the existing one first.

I agree 100% about what you’re saying about “unreachable” and “dead” being indistinguishable.

kill the container/machine before starting another one. Your ops infrastructure should be able to take care of that.

As the docs snippet above says, it is never safe to start a second Daml for PostgreSQL job to “replace” a currently-running one. Kill the existing one first.

My mind goes to these pathological cases that I do not know how the operator would solve, specifically because as you note, we are in a distributed system:

  1. Daml for PostgreSQL container is unresponsive. Operator attempts to kill the container but the physical machine it is running on loses network connectivity (it itself is unreachable) and so the operator cannot verify the container has stopped. Operator may need to wait indefinitely: if operator starts another job the old one might come back online later in the future, breaking the 1:1 invariant. So, effectively, the operator must choose between indefinite downtime or risking data corruption.

  2. (a) Operator successfully kills the unresponsive Daml for PostgreSQL container (b) before being killed, the container sends out a packet to finalize a transaction commit but that packet doesn’t reach Postgres DB yet (c) operator verifies the container is killed (c) operator starts a new container (d) the packet finally hits the Postgres database. Again this breaks the 1:1 invariant.

Really appreciate any insight here as I am excited to use a PostgreSQL deployment in the right way!

1 Like

Hi @ryan,

At some point you’ll have to decide what kind of risk is acceptable to you. The situations you describe are bad, but they are also quite unlikely in practice. Addressing rare occurrences is not always worth the cost.

I do have some suggestions for mitigations though. For your first scenario, you can kill the machine itself, which is often doable without network connectivity to the machine (using cloud APIs if it’s a cloud machine, or just pushing the power button if it’s a physical machine you have access to). If for some reason you cannot kill the machine itself, you may be able to remove it from the network nonetheless: either by changing your security groups/firewall rules/etc. in a cloud environment, or unplugging a network cable in the physical case.

For the second scenario, you could change the database credentials: when you spin up a new Daml for PostgreSQL instance, create new DB credentials for the new instance and delete the credentials for the existing instance. As long as the new and old credentials give access to the exact same dbs/tables/etc., that change should be transparent to the ledger functionality, and it will prevent the old instance from ever reconnecting should it wake up again.

At some point you’ll have to decide how far you’re willing to go, depending on your use-case. Addressing these edge-cases definitely add complexity; for some use-cases it will be good enough to have daily backups and not try to mitigate any of these scenarios, and just restore the most recent “good” backup if something goes wrong. But if you do need very high availability with no data loss, I hope the above suggestions help a bit.