Best ways to achieve performant queries with ledger

Hello there,
I am wondering about how to properly do fast queries in the below scenario.
Say today we have a template which the contract size increase over time by a huge amount. However, at each point in time, we only care about the contract after the previous query.

For example, at the beginning of the day (t means at specific time and q means query comes in)
t1: 1M contracts
q1: query1 comes in only cares about 1M contracts, queried and process them
t2: +1M contracts => 2M contracts
q2: query2 comes in only cares about the new 1M contracts that hasn’t been processed
t3: +1.5M contracts => 3.5M contracts
q3: query3 comes in only cares about the new 1.5M contracts that hasn’t been processed

Assuming no caching at all in the Java ledger client application, all the contracts need to be queried upon query comes in. What is a better way to minimise the time spending on querying?
Should we leverage JSON API server (and use some fields to filter e.g. unprocessed status), or if going directly using GRPC and filtering on the client side will actually be faster? or if there’s any other way.

Thanks a lot in advance!

1 Like

First, the JSON API is an ordinary ledger client. There is nothing it can do that you cannot possibly do yourself in a gRPC client application.

It really depends on what you mean by

As in, the ones that were created after a given offset? That is the most natural approach to implement “only process new contracts”, and there is almost certainly no reason to use the JSON API in this case, but that is incompatible with

because you have to store an offset to implement this.

This strikes me as crying out for rearchitecture. If you have a template T with a status : Bool field, and a contract with False is semantically unlike a contract with True, then in reality you probably want two templates Tfalse and Ttrue with sets of choices relevant to their “status”, and this advice broadly applies to any “status” field.

Then, your application’s rule is “process all contracts of template Tunprocessed”, and this has some interesting properties that make it interact cleanly with the ledger, regardless of choosing gRPC or JSON API:

  1. It’s a “fix the ledger” application so if a run fails after only processing some contracts, you can just run it again. There’s no possibility of a “missed” contract, because all existing contracts are candidates for processing no matter when they were created.
  2. The “ACS + transaction stream after offset” is an efficient way to retrieve exactly the set of contracts to consider while skipping already-handled contracts. There is no need for further filtering.
  3. There is no need to remember anything on the client.

However, if you did not mean that the application should make any ledger changes while handling “new” contracts, none of this applies, and I do not see how that can possibly be done without client-local state. Especially if, by “query” you mean “some predicate aside from before or after a given offset”.

1 Like

@Stephen thanks so much for the suggestion of creating Tunprocessed and Tprocessed. I think this will improve a huge part of our concern.
The real case has a slightly more complicated angle to it, in the Tunprocessed, we have a field say Type, which can be say A, B, C, D and each time when we processed, we only wants to processed one Type and in order.

  1. Processed A → all A become Tprocessed (will no longer publish any contract with type A after A is processed, yet other type of B, C, D will continue to be created)
  2. Processed B → all B become Tprocessed (will no longer publish any contract with type B after B is processed, yet other type of C, D will continue to be created)
    … so on and so forth.

In this case will JSON API be able to help? e.g. when doing step 1, client application crashes while processing Tunprocessed, as JSON API caches the contracts with field Type A, maybe it will be faster than query by using GRPC API? Or using offset, query and filter with GPRC API will be a better way to go?

1 Like

Call it anything but that :sweat_smile:

JSON API’s query store helps the most in the cases where

  1. you have a large number of contracts of a given template, but a very small number of results you expect from your query filter, or
  2. the participant returns ACS/transactions much slower than JSON API would.

You only have a representative example here, but 25% is quite a lot, and in my estimation not small enough to benefit meaningfully from the JSON API query store. For result sets of 1M contracts meant to be returned all at once, of such a sizable “frequency”, I suspect that JSON API will be somewhat slower, actually.

From your description going via gRPC sounds right.

When you crash, you always start with the ACS anyway, because that will have already removed any Tunprocessed that were successfully handled by the crashed run. Then you do not need to store an offset at all.

1 Like