Getting NOT_CONNECTED_TO_ANY_DOMAIN in kubernetes after a couple of hours of the infrastructure running

Hey, I am deploying a canton domain + participant node in one k8s namespace and a single participant node for each participant in their own separate k8s namespaces. When I spin up all the infrastructure, the participants are able to connect to the domain but after several hours
I get the following error:

NOT_CONNECTED_TO_ANY_DOMAIN(9,480a8f14): This participant is not connected to any domain.

To remedy this, I have to restart the domain + participants in each of the namespaces to get things working again. Could I potentially be missing a keep alive call somewhere in the participant.conf file?

2 Likes

Hi @Ahmed_Aly

You should get at least a warning message associated with the disconnection. One possibility that I can think of is if you have a firewall rule/load balancer/proxy that blocks HTTP/2 keep-alive pings. That would mean that while you are using the connections actively, the timeout cannot happen as you are communicating, but when the connections are idling for a substantial amount of time, the violation of the keep-alive timeouts would force one side to severe the connection. There is no keep-alive call in the configuration, but you can adjust the gRPC keep-alive settings for your nodes. I do not recommend to change the default values as allowing the proper traffic through is a simpler, more elegant solution and coming up robustly compatible values for both sides is not a trivial calculation.

One other way that this can happen is if you have intermittent but somewhat long lasting connection issues between your DB and one of the domain nodes (most likely the sequencer). In that case if the reconnection is not prompt enough, the participant would drop the connection and give up reconnecting after some tries.

Could you please look into the log files of your sequencer looking for warning/error before the participant node reports such an error? Could you please also make sure that HTTP/2 keep-alive pings are allowed between the gRPC endpoints of your canton nodes and their client (be those other canton nodes or Ledger API clients)?

Kind Regards,
Mate

1 Like

@Mate_Varga One of my colleagues Pablo was able to track down some logs regarding the sequencer:

INFO c.d.p.i.MeteringAggregator - Aggregating transaction metering for LedgerMeteringEnd(Offset(Bytes(000000000000000013)),2022-12-08T16:00:00Z), context: {participant: “"}
INFO c.d.p.i.MeteringAggregator - Aggregating transaction metering completed up to LedgerMeteringEnd(Offset(Bytes(000000000000000013)),2022-12-08T16:00:00Z), context: {participant: "
”}
INFO c.d.p.s.b.c.InitHookDataSourceProxy - Init hook execution finished successfully, connection ready, context: {participant: “"}
INFO c.d.p.s.b.c.InitHookDataSourceProxy - Init hook execution finished successfully, connection ready, context: {participant: "
”}
INFO c.d.c.s.c.t.GrpcSequencerSubscription:participant=/domain= - The sequencer subscription has been terminated by the server.
INFO c.d.c.s.c.t.GrpcSubscriptionErrorRetryPolicy:participant=/domain= - Trying to reconnect to give the sequencer the opportunity to become available again (after Connection terminated by the server.)
INFO c.d.p.s.b.c.InitHookDataSourceProxy - Init hook execution finished successfully, connection ready, context: {participant: “"}
INFO c.d.p.s.b.c.InitHookDataSourceProxy - Init hook execution finished successfully, connection ready, context: {participant: "
”}
INFO c.d.p.s.b.c.InitHookDataSourceProxy - Init hook execution finished successfully, connection ready, context: {participant: “"}
INFO c.d.p.s.b.c.InitHookDataSourceProxy - Init hook execution finished successfully, connection ready, context: {participant: "
”}
INFO c.d.p.s.b.c.InitHookDataSourceProxy - Init hook execution finished successfully, connection ready, context: {participant: “*********”}

We do not have a loadbalancer between the participant and the domain, we do have a kubernetes service between.

This is a log stream right after the connection is dropped by the server:

[0;39m[34mINFO c.d.c.s.c.t.GrpcSequencerSubscription:participant=/domain= - The sequencer subscription has been terminated by the server.
[0;39m[39mDEBUG c.d.c.s.c.t.GrpcSequencerSubscription:participant=/domain=- Completed subscription with Success(GrpcSubscriptionError(Request failed for sequencer. Is the server running? Did you configure the server address as 0.0.0.0? Are you using the right TLS settings?
GrpcServiceUnavailable: UNAVAILABLE/Connection terminated by the server.
Request: subscription))
[0;39m[34mINFO c.d.c.s.c.t.GrpcSubscriptionErrorRetryPolicy:participant=
/domain=* - Trying to reconnect to give the sequencer the opportunity to become available again (after Connection terminated by the server.)
[0;39m[39mDEBUG c.d.c.s.c.t.GrpcSequencerSubscription:participant=/domain= - Attempting to close ‘SyncCloseable(name=grpc-context)’

[0;39m[39mDEBUG c.d.c.s.c.t.GrpcSequencerSubscription:participant=/domain= - Successfully closed ‘SyncCloseable(name=grpc-context)’.
[0;39m[39mDEBUG c.d.c.s.c.ResilientSequencerSubscription:participant=/domain= - The sequencer subscription encountered an error and will be restarted: GrpcSubscriptionError(Request failed for sequencer. Is the server running? Did you configure the server address as 0.0.0.0? Are you using the right TLS settings?
GrpcServiceUnavailable: UNAVAILABLE/Connection terminated by the server.
Request: subscription)
[0;39m[39mDEBUG c.d.c.s.c.t.GrpcSequencerSubscription:participant=/domain= - Attempting to close ‘AsyncCloseable(name=grpc-sequencer-subscription)’

[0;39m[39mDEBUG c.d.c.s.c.ResilientSequencerSubscription:participant=/domain= - Waiting 10 milliseconds before reconnecting
[0;39m[39mDEBUG c.d.c.s.c.t.GrpcSequencerSubscription:participant=/domain= - Already completed. Discarding Closed
[0;39m[39mDEBUG c.d.c.s.c.t.GrpcSequencerSubscription:participant=/domain= - Successfully closed ‘AsyncCloseable(name=grpc-sequencer-subscription)’.
[0;39m[39mDEBUG c.d.c.s.c.ResilientSequencerSubscription:participant=/domain= - Starting new sequencer subscription from 56
[0;39m[39mDEBUG c.d.c.s.c.ResilientSequencerSubscription:participant=/domain= - The sequencer subscription has been successfully started
[0;39m[39mDEBUG i.g.n.NettyClientHandler - [id: 0x4111bfb2, L:/10.0.41.97:43218 - R:-..svc.cluster.local/172.20.128.97:4000] OUTBOUND HEADERS: streamId=431 headers=GrpcHttp2OutboundHeaders[:authority: -platform-..svc.cluster.local:4000, :path: /com.digitalasset.canton.domain.api.v0.SequencerService/Subscribe, :method: POST, :scheme: http, content-type: application/grpc, te: trailers, user-agent: grpc-java-netty/1.44.0, grpc-accept-encoding: gzip, memberid: PAR::::12209f85732e6823839b866ecd2dcfb25ea48e29fd37dec3eb27fb56c6e3866f1c0b, authtoken-bin: 1XzxYbCscH4scxqK2RtnOOlMVP8, domainid: ::12201455caa15a48da3a0297fab3417e2a0980b983b96cea1d42ca875ca504eb48af] streamDependency=0 weight=16 exclusive=false padding=0 endStream=false
[0;39m[39mDEBUG i.g.n.NettyClientHandler - [id: 0x4111bfb2, L:/10.0.41.97:43218 - R:
-platform-
.krypton-.svc.cluster.local/172.20.128.97:4000] OUTBOUND DATA: streamId=431 padding=0 endStream=true length=103 bytes=00000000620a5e5041523a3a73746167696e672d6b727970746f6e2d776f7a3a3a31323230396638353733326536383233383339623836366563643264636662

[0;39m[39mDEBUG i.g.n.NettyClientHandler - [id: 0x4111bfb2, L:/10.0.41.97:43218 - R:
-..svc.cluster.local/172.20.128.97:4000] INBOUND PING: ack=false bytes=1234
[0;39m[39mDEBUG i.g.n.NettyClientHandler - [id: 0x4111bfb2, L:/10.0.41.97:43218 - R:
-..cluster.local/172.20.128.97:4000] OUTBOUND PING: ack=true bytes=1234
[0;39m[39mDEBUG i.g.n.NettyClientHandler - [id: 0x4111bfb2, L:/10.0.41.97:43218 - R:
-..svc.cluster.local/172.20.128.97:4000] INBOUND HEADERS: streamId=431 headers=GrpcHttp2ResponseHeaders[:status: 200, content-type: application/grpc, grpc-encoding: identity, grpc-accept-encoding: gzip] padding=0 endStream=false
[0;39m[39mDEBUG i.g.n.NettyClientHandler - [id: 0x4111bfb2, L:/10.0.41.97:43218 - R:
-platform-.****.svc.cluster.local/172.20.128.97:4000] INBOUND DATA: streamId=431 padding=0 endStream=false length=396 bytes=00000001870ac7020ab7010ab4010ab1010838120c08acdfc49c0610c8c1cd8e021a5874687265652d686f6d65732d646f6d61696e3a3a313232303134353563


Hi @Ahmed_Aly,

Unfortunately the log lines you have shared are not giving a definitive clue about what may be the problem. Could you please share your canton configuration files and (if possible --debug level) a more substantial amount of logs from around the time when the disconnection was logged? If you are not comfortable with uploading the files on the post, please feel free to DM me.

Kind Regards,
Mate

_shared {
  // TLS Configuration is documented at https://docs.daml.com/canton/usermanual/static_conf.html#tls-configuration
  tls {
    // the certificate to be used by the server
    cert-chain-file = "./tls/participant.crt"
    // private key of the server
    private-key-file = "./tls/participant.pem"
    // trust collection, which means that all client certificates will be verified using the trusted
    // certificates in this store. if omitted, the JVM default trust store is used.
    trust-collection-file = "./tls/root-ca.crt"
    // define whether clients need to authenticate as well (default not)
    client-auth = {
      // none, optional and require are supported
      type = require
      // If clients are required to authenticate as well, we need to provide a client
      // certificate and the key, as Canton has internal processes that need to connect to these
      // APIs. If the server certificate is trusted by the trust-collection, then you can
      // just use the server certificates. Otherwise, you need to create separate ones.
      admin-client {
        cert-chain-file = "./tls/admin-client.crt"
        private-key-file = "./tls/admin-client.pem"
      }
    }
  }
}

canton {
  monitoring.health {
    server {
      address = 0.0.0.0
      port = ${HEALTH_PORT}
    }
    check {
      type = is-active
    }
  }

  participants {
    @@PARTICIPANT_ID {
      parameters.unique-contract-keys = yes

      storage {
        type = postgres
        config {
          dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
          properties = {
            serverName = "________"
            portNumber = 5432
            databaseName = ________
            user = ${POSTGRES_USER}
            password = ${POSTGRES_PASSWORD}
          }
        }
        max-connections = 1
      }

      ledger-api {
        port = ${LEDGER_API_PORT}
        address = 0.0.0.0
        auth-services = [{
          type = jwt-rs-256-jwks
          url = "_______"
        }]
      }

      admin-api {
        port = ${ADMIN_API_PORT}
        address = 0.0.0.0
      }
    }
  }
}

Here is the participant.conf file:

_shared {
  // TLS Configuration is documented at https://docs.daml.com/canton/usermanual/static_conf.html#tls-configuration
  tls {
    // the certificate to be used by the server
    cert-chain-file = "./tls/participant.crt"
    // private key of the server
    private-key-file = "./tls/participant.pem"
    // trust collection, which means that all client certificates will be verified using the trusted
    // certificates in this store. if omitted, the JVM default trust store is used.
    trust-collection-file = "./tls/root-ca.crt"
    // define whether clients need to authenticate as well (default not)
    client-auth = {
      // none, optional and require are supported
      type = require
      // If clients are required to authenticate as well, we need to provide a client
      // certificate and the key, as Canton has internal processes that need to connect to these
      // APIs. If the server certificate is trusted by the trust-collection, then you can
      // just use the server certificates. Otherwise, you need to create separate ones.
      admin-client {
        cert-chain-file = "./tls/admin-client.crt"
        private-key-file = "./tls/admin-client.pem"
      }
    }
  }
}

canton {
  monitoring.health {
    server {
      address = 0.0.0.0
      port = ${HEALTH_PORT}
    }
    check {
      type = is-active
    }
  }

  participants {
    @@PARTICIPANT_ID {
      parameters.unique-contract-keys = yes

      storage {
        type = postgres
        config {
          dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
          properties = {
            serverName = "________"
            portNumber = 5432
            databaseName = ________
            user = ${POSTGRES_USER}
            password = ${POSTGRES_PASSWORD}
          }
        }
        max-connections = 1
      }

      ledger-api {
        port = ${LEDGER_API_PORT}
        address = 0.0.0.0
        auth-services = [{
          type = jwt-rs-256-jwks
          url = "_______"
        }]
      }

      admin-api {
        port = ${ADMIN_API_PORT}
        address = 0.0.0.0
      }
    }
  }
}

Could you please share the logs as well? The configuration as expected seems correct, I was asking for it just to know exactly how your nodes are set up.

Even though keep-alive messages are sent between the nodes as evident from the logs you shared, it can be that a load balancer/proxy/firewall does not respect those as traffic and kills the connection. Is the issue occurring only when you leave your nodes without any load?