RFC: approximate_quantile API #113
Replies: 12 comments 6 replies
-
Changelog
|
Beta Was this translation helpful? Give feedback.
-
Algorithm-Specific APIThe algorithm-specific API allows users to explicitly specify which approximate-quantile algorithm they wish to use. Since this best specifies authorial intent in the code, we believe it will be the most commonly used interface. The usability of this interface can be graded on three features:
in general we expect two main kinds of usages
Options for the concrete names for these functions include: Full descriptiontdigest_percentile_approx(data, quantile, buckets);
tdigest_percentile_sketch(data, buckets);
percentile(tdigest_sketch, quantile); uddsketch_percentile_approx(data, quantile, buckets, error);
uddsketch_percentile_sketch(data, buckets, error);
percentile(uddsketch, quantile); Algorithm/Usagetdigest_estimate(data, quantile, buckets);
tdigest_sketch(data, buckets);
quantile(tdigest_sketch, quantile); uddsketch_estimate(data, quantile, buckets, error);
uddsketch_sketch(data, error, buckets);
percentile(uddsketch, quantile); Just Algorithmtdigest(data, quantile, buckets);
tdigest(data, buckets);
percentile(tdigest_sketch, quantile); uddsketch(data, quantile, buckets, error);
uddsketch(data, buckets, error);
percentile(uddsketch, quantile); |
Beta Was this translation helpful? Give feedback.
-
Algorithm-Generic APIThis API is intended primarily for new users that want an approximate quantile, but don't necessarily care which algorithm they use. A straw version of this percentile_approx(data, quantile, buckets, error) there are a few open questions: Does it pull its weightIt is not obvious whether these functions add enough to be worth having, especially if the algorithm-specific functions are of the form <algorithm>_percentile_approx(...) Should it be allowed in Continuous AggregatesOne of the most appealing use cases for approximate quantiles is retrospective analysis in continuous aggregates, where you create an aggregate storing a Should the tuning parameters be defaultedBoth t-digest and uddsketch have tunable number of buckets in which values can be stored, and uddsketch has a target-error parameter as well. Should these parameters be defaulted in Should the algorithm used be alterableIt may be possible to implement a generic API that accepts the algorithm used as a parameter such as percentile_approx(data, 'tdigest(100)');
percentile_approx(data, 'uddsketch(100, 0.01)'); Would such a function be easier to use than algorithm-specific functions. Note that such a function has awkward implications for continuous aggregates; while multiple digests of the same type can be combined, there is no general way to combine a t-digest and a uddsketch and get sensible output, so adding such an API may preclude allowing the function in continuous aggregates, limiting its overall usability. |
Beta Was this translation helpful? Give feedback.
-
Accessor FunctionsBoth t-digest and uddsketch accumulate additional data in the process of generating sketches; both calculate the SELECT get_min(sketch), get_count(sketch) FROM digested; where CREATE MATERIALIZED VIEW digested
WITH (timescaledb.continuous) AS
SELECT tdigest_percentile_sketch(...) ...
|
Beta Was this translation helpful? Give feedback.
-
reserved for General Naming Principles |
Beta Was this translation helpful? Give feedback.
-
reserved just in case |
Beta Was this translation helpful? Give feedback.
-
What should be the defaultuddsketch has more precise guarantees, but also has extra knobs to adjust. t-digest has a simpler interface, but it's more difficult to interpret what the results mean. It might be possible to pick a sensible default for uddsketch error, but we don't know if it is. |
Beta Was this translation helpful? Give feedback.
-
I'd actually go with something of a mix of the approaches. I like just the algorithm name when building the aggregate object, and am really not a fan of tdigest_sketch (uddsketch_sketch is almost as bad). However, I feel that we should have a more explicit name for the direct calculation. I also think naming the functions quantile is more correct than percentile, but I do see the advantage in following postgres's percentile_disc and percentile_cont semantics. I'd also like to be explicit that our function is returning an estimate or approximation. To this end, I'd like to see an overloaded percentile_approx(foo, quantile) function which would work on a tdigest, uddsketch, or directly on data using some sort of default implementation. So, to match the formatting above, what I'm proposing is:
And a generic:
|
Beta Was this translation helpful? Give feedback.
-
I think there's a question of how many ways we have of calling these algorithms, I think we should have a single, consistent way of calling them, ie |
Beta Was this translation helpful? Give feedback.
-
Another question: Should we support array as input for the quantile estimator so we can do multiple at the same time? ie |
Beta Was this translation helpful? Give feedback.
-
So ELI 5, cause I don't understand. Why would this aggregate (vs, say, avg) be in two parts? It's different from other aggregates used in caggs, since I wouldn't normally pass another function to access the value, rather than just get the value that's been pre-calculated. |
Beta Was this translation helpful? Give feedback.
-
We have another proposal to put forward as well, which is to use an operator for percentile extraction, namely something like SELECT percentile_approx(value) @ 0.99 FROM foo;
--it'd be re-combinable like other aggregates
SELECT percentile_approx(approx) @ 0.99 FROM continuous_agg_with_percentile;
|
Beta Was this translation helpful? Give feedback.
-
Deadline May 7th 2020
With our intent to stabilize our tdigest and uddsketch implementations in May 2021 (tracking issue #107), we would like to come to a consensus on what their APIs should look like, as well as determine some principles for similar APIs.
Summary
Both tdigest and uddsketch are approximate quantile algorithms. They have similar APIs, but slightly different inputs, and outputs, along with very different guarantees, making each suitable for different use cases. We would like our APIs to be suitable both for users who only know that they want an approximate quantile but don't care what kind, and for users who want the specific guarantees provided by a specific algorithm.
Background
We broadly expect to see approximately 2 kinds of applications, and 2 kinds of users, each of which desire a different API. For applications there is:
approximate_percentile(data, quantiles)
function.approximate_percentile(sketch, qunatiles)
and for users
There’s a tension between these.
API
At a high level, these functions offer two key capabilities (syntax is for demonstration only)
percentile_approx(data, quantile)
which returns the approximate value of a dataset at a quantile.digest(data)
that returns a digest-ed form of that data, which can then be used in a functionpercentile_approx(digest, quantile)
which returns the approximate value of the digested dataset at a quantile. This pair of functions enables retrospective analysis using continuous aggregates, and is expected to be the primary API.There are a number of open questions about making an API out of these capabilities:
approximate_quantile()
functions for users that don't care which particular algorithm they use.Frontrunner
The current frontrunner is a fully-descriptive API for general usage, along with a
percentile_approx(...)
function that forwards to uddsketch for users who are unsure of what they want. That is, the API would consist ofused like
Beta Was this translation helpful? Give feedback.
All reactions