RFC: approximate_quantile API #113

JLockerman · 2021-04-20T19:30:40Z

JLockerman
Apr 20, 2021

Deadline May 7th 2020

With our intent to stabilize our tdigest and uddsketch implementations in May 2021 (tracking issue #107), we would like to come to a consensus on what their APIs should look like, as well as determine some principles for similar APIs.

Summary

Both tdigest and uddsketch are approximate quantile algorithms. They have similar APIs, but slightly different inputs, and outputs, along with very different guarantees, making each suitable for different use cases. We would like our APIs to be suitable both for users who only know that they want an approximate quantile but don't care what kind, and for users who want the specific guarantees provided by a specific algorithm.

Background

We broadly expect to see approximately 2 kinds of applications, and 2 kinds of users, each of which desire a different API. For applications there is:

Threshold-based alerting, which naturally fits into a single approximate_percentile(data, quantiles) function.
Retrospective analysis, which really wants two-part API,
```
CONTINUOS AGGREGATE percentile_sketch(data)
```
and approximate_percentile(sketch, qunatiles)

and for users

Some just want some kind of estimate, and don’t particularly care what it is.
Others want the specific guarantees given by one of the algorithms.

There’s a tension between these.

API

At a high level, these functions offer two key capabilities (syntax is for demonstration only)

They provide an aggregation function percentile_approx(data, quantile) which returns the approximate value of a dataset at a quantile.
They provide a pair of functions, an aggregate digest(data) that returns a digest-ed form of that data, which can then be used in a function percentile_approx(digest, quantile) which returns the approximate value of the digested dataset at a quantile. This pair of functions enables retrospective analysis using continuous aggregates, and is expected to be the primary API.

There are a number of open questions about making an API out of these capabilities:

Algorithm-Specific API: how should we name the algorithm-specific interfaces to these APIs.
Algorithm-Agnostic API: how, if at all, should we provide an approximate_quantile() functions for users that don't care which particular algorithm they use.
1. Which algorithm should be the default.
Algorithm-Generic API: how, if at all, should we provide an API that's generic over the provided algorithm.
Accessor Functions: conventions for accessing non-quantile data from sketches.
General Naming Principles: for which kinds of functions are the above styles of APIs appropriate.

Frontrunner

The current frontrunner is a fully-descriptive API for general usage, along with a percentile_approx(...) function that forwards to uddsketch for users who are unsure of what they want. That is, the API would consist of

tdigest_percentile_approx(data, quantile, buckets);
tdigest(data, buckets);
percentile_approx(tdigest_sketch, quantile);

uddsketch_percentile_approx(data, quantile, buckets, error);
uddsketch(data, buckets, error);
percentile_approx(uddsketch, quantile);

percentile_approx(data, quantile, buckets => 100, error => 0.001);
percentile_sketch(data, quantile, buckets => 100, error => 0.001);

used like

-- immediate
SELECT percentile_approx(data, 0.50, 100);

-- retrospective
CREATE MATERIALIZED VIEW digested
WITH (timescaledb.continuous) AS
SELECT percentile_sketch(data, 100) AS sketch ...;

SELECT
    percentile_approx(sketch, 0.25),
    percentile_approx(sketch, 0.50),
    percentile_approx(sketch, 0.75)
FROM digested;

JLockerman · 2021-04-20T19:30:49Z

JLockerman
Apr 20, 2021
Author

Changelog

2020/05/04: add Background section. update Frontrunner tdigest_percentile_sketch(...) to tdigest(...), uddsketch_percentile_sketch(...) to uddsketch(...) , percentile(sketches) to percentile_approx(sketches).
2020/04/27: changed preferred API to percentile_approx(...) to match postgres's percentile_cont(...) and percentile_disc(...)
2020/04/20: added initial contents

0 replies

JLockerman · 2021-04-20T19:31:00Z

JLockerman
Apr 20, 2021
Author

Algorithm-Specific API

The algorithm-specific API allows users to explicitly specify which approximate-quantile algorithm they wish to use. Since this best specifies authorial intent in the code, we believe it will be the most commonly used interface. The usability of this interface can be graded on three features:

The ease of determining the algorithm used.
The ease of determining the purpose of using the algorithm (i.e. that it's for measuring quantiles).
How onerous writing code to use the API is.

in general we expect two main kinds of usages

Direct calculation of specific approximate quantiles from available data for exploration and threshold-based alerting, such as
```
SELECT tdigest_percentile_approx(data, quantile => 0.50, buckets => 100);
```

Continuous aggregates storing quantile-sketches, that can later be used to estimate any quantile over a given time-range for retrospective analysis, e.g.

CREATE MATERIALIZED VIEW digested
WITH (timescaledb.continuous) AS
SELECT tdigest_percentile_sketch(data, buckets => 100) AS sketch ...

for later use like

SELECT
    percentile_approx(sketch, quantile => 0.25),
    percentile_approx(sketch, quantile => 0.50),
    percentile_approx(sketch, quantile => 0.75)
FROM digested;

Options for the concrete names for these functions include:

Full description

tdigest_percentile_approx(data, quantile, buckets);
tdigest_percentile_sketch(data, buckets);
percentile(tdigest_sketch, quantile);

uddsketch_percentile_approx(data, quantile, buckets, error);
uddsketch_percentile_sketch(data, buckets, error);
percentile(uddsketch, quantile);

Algorithm/Usage

tdigest_estimate(data, quantile, buckets);
tdigest_sketch(data, buckets);
quantile(tdigest_sketch, quantile);

uddsketch_estimate(data, quantile, buckets, error);
uddsketch_sketch(data, error, buckets);
percentile(uddsketch, quantile);

Just Algorithm

tdigest(data, quantile, buckets);
tdigest(data, buckets);
percentile(tdigest_sketch, quantile);

uddsketch(data, quantile, buckets, error);
uddsketch(data, buckets, error);
percentile(uddsketch, quantile);

0 replies

JLockerman · 2021-04-20T19:31:21Z

JLockerman
Apr 20, 2021
Author

Algorithm-Generic API

This API is intended primarily for new users that want an approximate quantile, but don't necessarily care which algorithm they use. A straw version of this
could be

percentile_approx(data, quantile, buckets, error)

there are a few open questions:

Does it pull its weight

It is not obvious whether these functions add enough to be worth having, especially if the algorithm-specific functions are of the form

<algorithm>_percentile_approx(...)

Should it be allowed in Continuous Aggregates

One of the most appealing use cases for approximate quantiles is retrospective analysis in continuous aggregates, where you create an aggregate storing a percentile_sketch(...), then investigate arbitrary quantiles at query time. If we allow this for the algorithm agnostic API, this forces the API to choose a single approximation algorithm to use, while if we only allow percentile_approx(...), we may be able change the recommended algorithm backwards compatibly (though depending on the arguments the various algorithms expect this may be difficult).

Should the tuning parameters be defaulted

Both t-digest and uddsketch have tunable number of buckets in which values can be stored, and uddsketch has a target-error parameter as well. Should these parameters be defaulted in quantile_estimate(...) to ease getting started?

Should the algorithm used be alterable

It may be possible to implement a generic API that accepts the algorithm used as a parameter such as

percentile_approx(data, 'tdigest(100)');
percentile_approx(data, 'uddsketch(100, 0.01)');

Would such a function be easier to use than algorithm-specific functions. Note that such a function has awkward implications for continuous aggregates; while multiple digests of the same type can be combined, there is no general way to combine a t-digest and a uddsketch and get sensible output, so adding such an API may preclude allowing the function in continuous aggregates, limiting its overall usability.

0 replies

JLockerman · 2021-04-20T19:31:37Z

JLockerman
Apr 20, 2021
Author

Accessor Functions

Both t-digest and uddsketch accumulate additional data in the process of generating sketches; both calculate the sum and count of accumulated elements, while t-digest also calculates the min, max, and mean. It may be useful to expose functions to retrieve these values from the sketches, for retrospective analysis in continuous aggregates, e.g.

SELECT get_min(sketch), get_count(sketch) FROM digested;

where digested is a

CREATE MATERIALIZED VIEW digested
WITH (timescaledb.continuous) AS
SELECT tdigest_percentile_sketch(...) ...

get_<property> is not the best name for these. However, we don't want to use the natural names, count(...), sum(...), min(...), and max(...), because they would conflict with the builtin postgres functions of the same names.

0 replies

JLockerman · 2021-04-20T19:31:45Z

JLockerman
Apr 20, 2021
Author

reserved for General Naming Principles

0 replies

JLockerman · 2021-04-20T19:31:54Z

JLockerman
Apr 20, 2021
Author

reserved just in case

0 replies

JLockerman · 2021-04-21T17:02:09Z

JLockerman
Apr 21, 2021
Author

What should be the default

uddsketch has more precise guarantees, but also has extra knobs to adjust. t-digest has a simpler interface, but it's more difficult to interpret what the results mean. It might be possible to pick a sensible default for uddsketch error, but we don't know if it is.

0 replies

WireBaron · 2021-04-28T16:34:10Z

WireBaron
Apr 28, 2021

I'd actually go with something of a mix of the approaches. I like just the algorithm name when building the aggregate object, and am really not a fan of tdigest_sketch (uddsketch_sketch is almost as bad). However, I feel that we should have a more explicit name for the direct calculation.

I also think naming the functions quantile is more correct than percentile, but I do see the advantage in following postgres's percentile_disc and percentile_cont semantics. I'd also like to be explicit that our function is returning an estimate or approximation. To this end, I'd like to see an overloaded percentile_approx(foo, quantile) function which would work on a tdigest, uddsketch, or directly on data using some sort of default implementation.

So, to match the formatting above, what I'm proposing is:

tdigest_estimate_quantile(data, quantile, buckets);
tdigest(data, buckets);
percentile_approx(tdigest_sketch, quantile);

uddsketch_estimate_quantile(data, quantile, buckets, error);
uddsketch(data, buckets, error);
percentile_approx(uddsketch, quantile);

And a generic:

percentile_approx(data, quantile);

2 replies

JLockerman May 4, 2021
Author

@WireBaron tdigest(...)/uddsketch(...) is better than what we have now, I'll update the frontrunner

JLockerman May 4, 2021
Author

I think we should stick to percentile_approx... for all the quantile-outputting functions

davidkohn88 · 2021-04-28T16:40:07Z

davidkohn88
Apr 28, 2021

I think there's a question of how many ways we have of calling these algorithms, I think we should have a single, consistent way of calling them, ie percentile_approx(uddsketch(data, ...)) without the quantiles you're trying to get (or with the quantiles you're trying to get), because otherwise users will be confused because there are multiple different ways to get the same result. While there is a bit more overhead to learning what percentile_approx(uddsketch(data, ...)), having multiple different ways that you call the same function is much more confusing and would scare me off, when reading the documentation etc. If we are going to do that, I would definitely only doing it in the generic case, but not for the particular bits, but even that I would avoid, because then we'd be advising users to do very different things in continuous aggregates than otherwise, and that seems like it would again add confusion and difficult discussions to parts of the docs that should have a single, simple message.

0 replies

davidkohn88 · 2021-04-28T16:41:20Z

davidkohn88
Apr 28, 2021

Another question: Should we support array as input for the quantile estimator so we can do multiple at the same time?

iepercentile_approx(sketch, [0.1, 0.5, 0.9]) or the like

1 reply

JLockerman Apr 28, 2021
Author

if we have a percentile_approx() I'd say yes, but I'm fine leaving that for next release; it's big enough that it's a new feature.

tylerfontaine · 2021-04-29T21:26:35Z

tylerfontaine
Apr 29, 2021

So ELI 5, cause I don't understand.

Why would this aggregate (vs, say, avg) be in two parts? It's different from other aggregates used in caggs, since I wouldn't normally pass another function to access the value, rather than just get the value that's been pre-calculated.

2 replies

davidkohn88 Apr 30, 2021

We should probably have a section in our docs where we explain this design choice more explicitly as it's not all that common, but it's something we're trying to do with all of our aggregates (so far, at least) where we're trying to make them re-combinable, this means that there's an extra layer in calling them in order to extract the true result, but it's the standard way of doing so, rather than having to derive that for every aggregate and having to think about whether they're re-combinable.

In the case of this percentile stuff it would mean something like quantile(0.5, percentile_approx(value)) rather than percentile_approx(0.5, value) is the calling convention (naming TBD, but the basic gist is there).

trying to explain the reasoning behind that: the main bit is it gives you a lot more flexibility in, especially, the continuous aggregate context, where
a) I can extract other percentiles than I planned for without rebuilding the continuous agg from scratch. So, I can put off making that decision and
b) I can re-aggregate in the continuous agg without issue so percentile_approx(percentile_approx) works (unlike, say, avg(avg) (though we might introduce an aggregate to fix that in the future )
c) you can use multiple accessors to get different values from these functions and only calculate the aggregate once, which PG would try to optimize, but not in a continuous agg and it's unclear if it does it if you have different inputs to the function, and because a lot of these have some constant inputs and some variable, it's likely it wouldn't figure out that you can optimize that. So it can also be helpful in the non-continuous agg context where it's more obvious to the planner what the optimizable aggregate step that is common is and then the extraction step is the same.

Now PG does have a way of optimizing that, ie the aggregates will be checked to see if they can use the same combine function in the aggregate node, and for things like sum(val) & avg(val) it works, but in this case, the planner can't tell that percentile_approx(0.99, value) and percentile_approx(0.5, value) should use the same thing because the first arg is different to them, so it won't optimize that and these are relatively expensive function calls, compared to common aggregates.

So while this concept is slightly weird at first, I do think that most inexperienced users will just take the calling convention and copy paste it, and while it's slightly more writing, I don't think it's crazy more than otherwise. I could be wrong on that, and I also think this is a good opportunity for some docs on it (or a blog post or whatever) as education about how it works.

JLockerman May 4, 2021
Author

@tylerfontaine one of the use cases we want to support with these percentile-approximation functions, and one that we don't think can be supported without them, is retrospective analysis of percentiles, for instance, if you roll out a new fast-path for your application, and want to see how it changes the distribution of latencies. Storing a continuous aggregate of specific percentiles won't necessarily give you useful information because the changepoints might be different, but storing and querying all the data is probably too expensive to do without good reason.

davidkohn88 · 2021-04-30T15:03:59Z

davidkohn88
Apr 30, 2021

We have another proposal to put forward as well, which is to use an operator for percentile extraction, namely something like

SELECT percentile_approx(value) @ 0.99 FROM foo;
--it'd be re-combinable like other aggregates
SELECT percentile_approx(approx) @ 0.99 FROM continuous_agg_with_percentile;

1 reply

davidkohn88 Apr 30, 2021

Talking to others in the company on Slack, people seem to think this is worse than nested function calls, so this will probably not be the approach unless people speak up for it.

RFC: approximate_quantile API #113

Deadline May 7th 2020

Summary

Background

API

Frontrunner

Replies: 12 comments · 6 replies

JLockerman Apr 20, 2021 Author

Changelog

JLockerman Apr 20, 2021 Author

Algorithm-Specific API

Full description

Algorithm/Usage

Just Algorithm

JLockerman Apr 20, 2021 Author

Algorithm-Generic API

Does it pull its weight

Should it be allowed in Continuous Aggregates

Should the tuning parameters be defaulted

Should the algorithm used be alterable

JLockerman Apr 20, 2021 Author

Accessor Functions

JLockerman Apr 20, 2021 Author

JLockerman Apr 20, 2021 Author

JLockerman Apr 21, 2021 Author

What should be the default

JLockerman May 4, 2021 Author

JLockerman May 4, 2021 Author

JLockerman Apr 28, 2021 Author

JLockerman May 4, 2021 Author

Replies: 12 comments 6 replies

JLockerman
Apr 20, 2021
Author

JLockerman
Apr 20, 2021
Author

JLockerman
Apr 20, 2021
Author

JLockerman
Apr 20, 2021
Author

JLockerman
Apr 20, 2021
Author

JLockerman
Apr 20, 2021
Author

JLockerman
Apr 21, 2021
Author

JLockerman May 4, 2021
Author

JLockerman May 4, 2021
Author

JLockerman Apr 28, 2021
Author

JLockerman May 4, 2021
Author