Why might we see anomalous response times during performance testing?

Hi all,

We’re currently going through the performance testing stage in our application using the http json api and daml-on-sql, both at version 1.11.1. For the most part we’ve been able to either explain where the time is being taken, or remove costly operations to get performance closer to acceptable ranges, but now we have a couple of questions regarding time taken by DAML and whether there’s anything we can do about it.

We’re currently finding a large response time at the start of our test (400+ms) which quickly reduces to around 40ms, but then throughout the test we get spikes where one or two requests take 200+ms and then it quickly goes back down to 40ms. These spikes are not frequent, but the large change in response time does make them very noticeable in our performance statistics.

The actual numbers aren’t so important, we still need to look at increasing CPU/RAM, and other influencing factors, so that’s fine for right now. What matters is that we can’t explain why we are getting these spikes, so we assume they’ll still be present after any changes.

We would appreciate any advice on what might be causing spikes, or where we should look to find what’s happening.

Thanks!

1 Like

Hey @lashenhurst, thanks for the interesting question.

Most Daml components (excluding the compiler) are written in Scala and run inside the JVM. I will try and explain a little bit about how the JVM works at runtime. Forgive me if you know all this already.

The Scala code is compiled to Java bytecode, which is then interpreted by the JVM runtime when you launch the component. This bytecode is not directly compiled to machine code; instead, the runtime runs it using an interpreter for a while. This is slower but allows it to collect information about how the code works. After a function is called enough times (typically 100,000 calls, but it’s also time-dependent and configurable), the JVM will compile the bytecode to native machine code to improve the performance, using the data it gathered earlier to decide how best to optimize it. All this probably explains why you see a large response time at first, which decreases once the JVM has optimized the code.

Now let’s talk about memory. The JVM uses a form of garbage collection: effectively, once memory is no longer referenced, it doesn’t deallocate it straight away. Instead it waits until it gets fairly close to the limit (configured at startup using the -Xmx flag) and then collects all the unused memory at once, typically also compacting the “live” data so that large areas can be allocated more effectively. This involves pausing the program for brief periods of time so that there are no race conditions while the garbage collector moves objects around in memory. I expect these pauses are what you’re seeing during your spikes.

To verify whether it’s the garbage collector which causes the spikes, I suggest enabling garbage collection logging. For Java SE 11 and newer, you can do this by adding the --verbose:gc flag, which I think works on older versions as well.

Once you’ve verified that the garbage collector is causing the spikes, I suggest reading the Garbage Collector Tuning Guide published by Oracle to dig into how you can reduce these pauses.

I’m not sure how you’re running these components, so if you need help getting these parameters to the right process, just yell. You might have some luck setting the JAVA_OPTS environment variable if there is no way to modify the arguments passed to the java process on the command line directly.

If any of this doesn’t explain the spikes you’re seeing, please let us know so we can try and get to the bottom of things together.

3 Likes

Hey @SamirTalwar thank you for an excellent read, I was unaware of the JVM runtime optimisation cycle :+1:t2:

1 Like

Anyone tried other JVM such as the commercial ones? Would it make it even faster?

1 Like

I think trying Azul’s JDK is probably worth a shot; I hear they have a pauseless garbage collector. Whether it will work for Daml and your use case though, is something we can only really determine through experimentation. I encourage you to try it out.

If you’d like to stick to OpenJDK (which is very good), then I suggest:

  1. Upgrading to OpenJDK 16, which contains major performance improvements over earlier Long-Term Support versions such as OpenJDK 11 and 8. (OpenJDK 17 will be another LTS release, so remember to upgrade again in September, at least.)
  2. Increasing the max heap size with the -Xmx flag.
  3. Monitoring your garbage collection with -verbose:gc.
  4. Trying one of the new garbage collectors such as Shenandoah or ZGC, which both reduce pause times.
2 Likes

GC optimization is a bit of a dark art, and most of the folklore round it is “not always true”. I randomly stumbled onto this talk a couple months ago, which I think gives a good overview of what the current state of tuning JVM GCs is (though it’s also a bit of a sales pitch).

I’ll throw in two comments from my own experience (though, remember, test for your own case):

  • Setting -Xms to the same value as -Xmx can help a little bit with “ramp up”: by telling the JVM to immediately reserve all the memory it’s going to end up using, you avoid having to pause processing while requesting more memory. That’s the theory, anyway; in practice it’s likely modern kernels will not really allocate memory until it’s actually used, so this may not work as well as it used to. (The “filling the memory” period will also overlap significantly with the “as-yet-unoptimized code” period, so the effect may be negligible.)
  • Increasing the available memory (-Xmx) can actually degrade overall performance, or at least consistency, at least for some GC configurations: GC pauses will be rarer but longer as there is more memory to manage.
4 Likes