Canton node error handling / disconnection

Hi team,

Question on error handling in Canton.

From my understanding on how the transaction lifecycle in Canton:

  1. Alice encrypt the payload of the transaction and passes it to the domain.
  2. Domain passes the payload to the nodes that are involved in the transaction
  3. Each participant nodes validate the transaction using the Daml Execution Engine that is in their participant node
  4. The domain’s mediator aggregates confirmations and broadcasts transaction confirmation
  5. All the participant nodes record the transaction on the ledger of their nodes.

Now my question is, if one of the node in the network didn’t receive the transaction confirmation of step 4 due to a crash or a disconnection.

Does the domain knows which node didn’t receive the transaction confirmation?
What should be the approach of the node that went down do in that case ? Reconnect to the domain to get latest transaction confirmation?

Thanks and regards,

JP

1 Like

Minor amendment: The participant node of Alice creates a transaction and then individual sub-views of the transaction, one per participant of the other stakeholders. This is sent to the domain (sequencer and mediator), which forward each sub-view to the relevant participant. The mediator understand which participants are required to respond to the transaction so it knows what answers to expect. Each participant validates their sub-view (this is how we enforce sub-transaction privacy) and responds to the mediator (via the sequencer). Once all have responded, the overall transaction is committed back to the set of participants (or transaction failed if one or more participants reject).

I believe that retry logic is in place in event of a failure to send to one participant and eventually the transaction will timeout and be rejected if the participant never responds. If the node reconnects within the timeout period then it should get a replay of the request but if it reconnnects after timeout then the whole transaction will need to be resubmitted by the original application.

Others may have more detail on the above.

1 Like

Thanks for the clarification Edward.

May I confirm with you, this means that in step 5, the sequencer:

  • Records the latest result of transaction on the underlying ledger
  • Pass this result to the involved participants

Regardless if the participants receive the transaction or got disconnected, the transaction has been recorded on the ledger, making the transaction the truth and immutable to all stakeholders nodes

Hence if a participant node got disconnected during step 5, they will have to re-connect to the sequencer in order to retrieve latest transaction record ?

Yes, that’s correct. The sequencer provides “total order guaranteed multi-cast”. A participant that crashes will just resume from the last known offset and catch up from there.

Generally, the participant node will reconnect to the domain on restart and start to load all the data that it has missed. Therefore, nothing needs to be done manually.

1 Like