Debugging (potential) issue with http-json-api

Hi there,

I’m working on a project where we have an architecture something like this:
client -> custom-http-json-api-proxy -> http-json-api -> ledger. Under load under circumstances we’re having trouble reproducing, we’re getting an issue where calls to /exercise from http-json-api-proxy succeed on the ledger, but fail to return. The sequence of events is something like:

  1. http-json-api-proxy forwards a POST request
  2. stuff gets written to the ledger
  3. http-json-api waits 2 seconds, logs that it has returned a 200 response
  4. http-json-api-proxy waits ~30 seconds, then reports a 503 error.

This smells like some kind resource exhaustion (thread pools maybe, b/c akka), but because we can’t yet reproduce it in a non-prod environment (and for the moment the application is turned off in prod), it’s a little difficult to diagnose. Notably, we’re also proxying websocket requests, which are consistently getting borked (probably by our proxy), closed, and retried, so there’s a potential mechanism for thread pool exhaustion, either on our proxy, or on http-json-api. Also haven’t repro’d, and probably need tcpdump + better visibility into what’s happening on http-json-api to diagnose.

So, anyways, that’s background. Got a couple of questions.

  1. What facility is there to see the runtime workings of http-json-api?
  2. How much effort would it be to build my own http-json-api maybe with akka-http-metrics baked in?
  3. Any suggestions on how to diagnose this?
  4. Any idle thoughts on how http-json-api might log a success without actually returning it?

I’m not 100% sure of what exactly you would like to see here, but we do have a set of metrics and logging that is available, which is described more in depth on the documentation.

I cannot quantify the effort, but I wouldn’t say much more than half a day. Perhaps @Stephen or @victor.pr.mueller can give you a more informed answer.

As for the two remaining questions, I would again ask either Stephen or Victor to suggest their ideas. Looking at the description of the problem it appears that it’s something happening at the proxy level (I had a quick look about this issue a few days and found a StackOverflow answer that seemed to point out at the proxy being unable to support request timeouts for streams – but I’m not sure whether you’re using Envoy as a proxy and/or if this applies in your case).

I’m not 100% sure of what exactly you would like to see here

At the moment what I’d like to see in particular is akka thread pool information sans having to do a thread dump and make sense of it.

  1. are there any facilities or planned facilities for being able to increase log level without having to bounce the JVM process?
  2. ditto for metrics
  3. is there any plan for being able to supply a trace-id/correlation-id for tracing requests end-to-end through the system, a la canton?
  4. is there any supplied way to get visibility into the threadpool?

I also suspect the 503 issue is being caused by the proxy, but is unclear if it’s manifesting in our proxy or in http-json-api. Our architecture is really more something like cloudarmoristiojson-api-proxyhttp-json-apiledger, where json-api-proxy is a proxy we’ve written with akka. I think envoy is in the mix, but it’s not sitting between the two failing components.

Not without bounding the process, no.

All metrics are always recorded.

There is a request_id that is part of the context of each log entry. There is also an instance_id in case you are aggregating logs for multiple HTTP JSON API instances in a single collector.

Not currently, no.

Are there plans in the future to expose this?

There are no plans currently to expose the thread pool metrics.

It’s pretty easy to get building on the daml repo. Have git installed, follow the instructions, then bazel build //ledger-service/http-json:http-json-binary to make yourself a standalone jar and runscript if you need one.

Some starting points are Endpoints.scala and HttpService.scala in that tree, and bazel-java-deps.bzl for manipulating the available libraries from Maven. I couldn’t say what the difficulty of working in akka-http-metrics would be, though, nor its efficacy.

At that point control should have passed out of Endpoints and back to akka-http, unless the log you’re talking about is specifically the one that says "Responding to client with HTTP...". There would be more interesting things to look at were this one of the streaming responses, but exercise is a strict response, so everything interesting has already happened by that log.