Health Monitoring and Recovery of DAML Components

Hi, I have 3 questions -
1) As I understand it, communication between the ledger integration adapter and the ledger as well as between all components of the ledger is ledger-specific. But how do you secure the components in the DAML layer and the communication between them - aka, the ledger api server, the DAML execution engine, the ledger integration adaptor etc? - Moved to separate thread
2) How can I as a developer “monitor the health” of these components?
3) In case of failure of one of these components, do they auto-recover/self-heal? Or is handling that a responsibility of the ledger provider?

1 Like

Thanks for asking, @meet. :grinning:

I can answer questions 2 and 3, but not question 1, which really feels like it belongs in its own thread. Would you mind splitting question 1 out?

Meanwhile, I’ll see what I can do about answering your questions on component health.

Sure @SamirTalwar, that makes sense. I’ll create a separate question for #1. Thanks for looking into #2 & #3!

Preamble

Let’s talk about these components:

  • the participant, made up of:
    • the ledger API server
    • the indexer
    • the index database (typically PostgreSQL, but can also be in-memory)
  • and the ledger, made up of:
    • the DAML execution engine
    • the DAML driver for the ledger
    • the ledger/blockchain itself (e.g. PostgreSQL, VMWare Blockchain, Ethereum, etc.)

The connections look something like this:

  user <----> ledger API server ------------------> DAML ------> execution engine
                     ^                             Driver <---\        |
                     |                               |         \---- ledger
               index database <------ indexer <------/

Typically, the participant is run as either two processes:

  1. the ledger API server and the indexer
  2. the PostgreSQL database containing the index

(If the index is in-memory, it will all be in the same process.)

However, it’s entirely possible to run as 3 or more processes:

  1. the indexer
  2. the index database
  3. one or more ledger API server instances

We shall refer to each numbered point above as a “service”.

Participant health checks

In the Kubernetes world, there are three distinct types of health probe: liveness, readiness, and startup probes. This terminology is useful for differentiating them. Broadly speaking, a liveness probe tells you whether a service or component is up and a readiness probe tells you whether it’s healthy. (A startup probe allows you to disable the other two on component startup. Let’s ignore this one for now.)

To check for liveness, we suggest testing the service’s exposed ports. For the ledger API server and indexer combo, we can use the gRPC port. If it accepts a TCP connection, the service is live. PostgreSQL can similarly be tested on its communication port (typically port 5432).

Testing readiness is more involved. To facilitate this, we expose an endpoint in the ledger API server over the standard gRPC Health Checking Protocol that allows you to ask for the health of the service, and potentially components within the service. The simplest way to verify this is to use the grpc-health-probe command-line utility, but you can also make a direct call to the health check endpoint. The service will return SERVING if all components are working as expected, or NOT_SERVING otherwise.

As part of the health checks, the ledger API server will routinely poll the index database to ensure the connection is working. If the ledger API server or indexer becomes disconnected from the database, its health status will change to NOT_SERVING until connection is restored. To investigate the issue, you will need to examine server logs as well as the networking architecture of your deployed participant environment, as this typically varies.

If the indexer cannot read from the ledger, it will currently attempt to reconnect and recover. Likewise, if the ledger API server cannot read from the index or write to the ledger, the given request will probably fail, but if the failure is transient, the next request may succeed.

The ledger API server will not currently report as NOT_SERVING if the connection to the ledger-side execution engine fails, though it will log errors.

Currently, the health checks are split into three components (called “services” in the gRPC health checking protocol). You can ask for the health of the entire system or for the health of one of the following services:

  • read, which checks the health of the indexer
  • write, which checks the health of the ledger API server’s write endpoints
  • index, which checks the health of the ledger API server’s read endpoints

Ledger health checks

The implementation of the ledger and associated DAML driver is typically subject to certain constraints. Because those constraints vary wildly, it’s very difficult to come up with a standard way of measuring or reporting the health of the ledger and its components.

We recommend implementing either the gRPC health checking protocol or a similar system for the ledger’s various services (including the DAML driver).

Detecting and handling health check failures

Unfortunately, it’s typically very difficult for an application to determine its own health. For this reason, I recommend a layered approach to monitoring:

  1. verify that the reported health status is OK (SERVING, in the case of gRPC)
  2. aggregate logs across services and flag if there are too many errors in a short amount of time
  3. instrument network connections and flag if there are too many failed requests, including but not limited to:
    • gRPC responses with errors
    • abruptly-terminated TCP connections
    • zero-sized requests
    • timeouts or very long response times
  4. monitor machine metrics and alert when a machine gets too hot (e.g. CPU pegged at 100% for a long time)

There’s always more, but this is probably a decent start.

Some failures will be handled automatically through various retry mechanisms in the ledger API server. However, handling more persistent failures is the responsibility of the ledger operator. Often, this can be automated, with a scheduler such as Kubernetes watching for health check failures and automatically restarting the process if there are too many. However, this may not solve connectivity problems, for example, so you’ll need to decide how to handle larger, cross-service errors depending on your infrastructure.

8 Likes

Thanks @SamirTalwar for the detailed explanation!

1 Like