Approximate Percentile #35

WireBaron · 2021-01-15T18:45:18Z

WireBaron
Jan 15, 2021

Original issue

**What's the functionality you would like to add** An approximate percentile function such as [t-digest](https://github.com/tdunning/t-digest). This would have two main advantages over `percentile_cont`:

It can use less space when calculating over large datasets.
The digesting part can be separated from the calculation of the percentile, allowing post-hoc analysis, and more powerful VIEWs.
It can constructed in a way that can be use with continuous aggregates.

How would the function be used

Basic percentile calculation works just like exat percentile calculation, expect it takes in an accuracy measure (for t-digest the number of centroids)

-- example dataset
SELECT * FROM data;
          time          | value 
------------------------+-------
 2020-01-03 01:00:00-05 |   1.0
 2020-01-03 01:10:00-05 |   2.0
 2020-01-03 01:20:00-05 |   3.0
 2020-01-03 01:30:00-05 |   4.0
 2020-01-03 01:40:00-05 |   5.0
           ...
 2020-01-03 16:50:00-05 |  96.0
 2020-01-03 17:00:00-05 |  97.0
 2020-01-03 17:10:00-05 |  98.0
 2020-01-03 17:20:00-05 |  99.0
 2020-01-03 17:30:00-05 | 100.0
(10 rows)

-- regular approximate percentile call
SELECT approx_percentile(value, 0.50, accuracy => 10) FROM data;
 approx_percentile
-------------------
              51.5
(1 row)

-- increasing the number of centroids increases the accuracy at the cost of more storage
SELECT approx_percentile(value, 0.50, accuracy => 100) FROM data;
 approx_percentile
-------------------
              50.5
(1 row)

When storing the data, for instance in continuous aggregates, the digest itself can be stored, instead of the percentile, allowing future analysis to chose which data it wants to get out.

-- example dataset
SELECT * FROM data;
          time          | value 
------------------------+-------
 2020-01-03 01:00:00-05 |   1.0
 2020-01-03 01:10:00-05 |   2.0
 2020-01-03 01:20:00-05 |   3.0
 2020-01-03 01:30:00-05 |   4.0
 2020-01-03 01:40:00-05 |   5.0
           ...
 2020-01-03 16:50:00-05 |  96.0
 2020-01-03 17:00:00-05 |  97.0
 2020-01-03 17:10:00-05 |  98.0
 2020-01-03 17:20:00-05 |  99.0
 2020-01-03 17:30:00-05 | 100.0
(100 rows)

-- we output the /digest/ not the calculated percentile, with a fixed accuracy
-- in real usage this would likely be a continuous aggregate instead of a regular view
CREATE VIEW aggregated AS SELECT precentile_digest(value, accuracy => 100) FROM data;
INSERT 0 1

-- we can output various percentiles from the same digest
SELECT
    approx_percentile(digest, 0.10) tenth,
    approx_percentile(digest, 0.50) median,
    approx_percentile(digest, 0.50) ninetieth
FROM data;
 tenth | median | ninetieth
-------+--------+-----------
  10.9 |   50.5 |  90.10001
(1 row)

-- and also other certain other outputs, depending on the digest
SELECT avg(digest), min_val(digest), max_val(digest), num_elements(digest) FROM aggregated;
  avg | min |  max  | num_elements
------+-----+-------+--------------
 50.5 | 1.0 | 100.0 |          100
(1 row)

Open Questions
TBD

WireBaron · 2021-01-15T18:49:42Z

WireBaron
Jan 15, 2021
Author

The proposed solution of using a T-Digest should work great for any arbitrary precision numeric types (decimal, double precision, etc). However, it's not clear that this is a great choice for integer types. Should we consider using a different underlying abstraction for these types? Does anyone have any suggestions off the top of their head for a better solution?

0 replies

Sytten · 2021-01-21T18:34:24Z

Sytten
Jan 21, 2021

I would like to strongly push for the use of UDDSketch which build upon the DDSketch from Datadog. That is was we are using for graphmetrics.io and currently doing most of the merge and approximation work using SQL functions. DDSketch is great for unbounded series and UDDSketch allows for bounded series while keeping a fixed relative error (increasing with each downsampling obviously)
DDSketch:

paper: https://arxiv.org/abs/1908.10693
implementation: https://github.com/DataDog/sketches-go, (reworked for int here https://github.com/GraphMetrics/ddsketch-go)

UDDSketch:

paper: https://arxiv.org/abs/2004.08604 and https://arxiv.org/abs/2101.06758
implementation: https://github.com/cafaro/UDDSketch and https://github.com/cafaro/PUDDSKETCH

I wrote a more detailed explanation here: #41

3 replies

davidkohn88 Jan 21, 2021

@Sytten Awesome! What are the main differences that would make you choose this over the t-digest sketch? (Will also look at the papers etc in more detail, but might be nice to spell out for others and will probably help us with documentation later if we offer both!)

WireBaron Jan 21, 2021
Author

This looks very interesting I'll take a look at the paper and try to throw together a quick skeleton implementation.

Sytten Jan 21, 2021

So I didn't dig super deep into t-digest, but my understanding is that is still uses rank based error instead of a relative error. They did do a lot of optimization so the result is pretty good, but the relative accuracy becomes quite high with distributions that have a long tail. t-digest is also one-way mergeable vs fully mergeable for the DDsketch family.
DDSketch is a derivative of the popular HDR histogram as is the circllhist implementation.
My preference for UDDSketch is that they proved that a fixed size histograms can keep a constant relative accuracy. This is really novel (the second paper was released this week!) and for my use case of https://graphmetrics.io, it was ideal.
The circllhist like ddsketch are unbounded, that is ok for some applications but not for others. When you don't know how your distribution looks like it can be problematic (the DDsketch paper discusses that when they try the pareto distribution). Circllhist claims that most of their histograms in production are less than 300 buckets (usually <5% error) while DDsketch (datadog) says that usually it is using maximum 1024 buckets for constant 1% relative accuracy.
I put some more information in this google doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approximate Percentile #35

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Approximate Percentile #35

WireBaron Jan 15, 2021

Replies: 2 comments · 3 replies

WireBaron Jan 15, 2021 Author

Sytten Jan 21, 2021

davidkohn88 Jan 21, 2021

WireBaron Jan 21, 2021 Author

Sytten Jan 21, 2021

WireBaron
Jan 15, 2021

Replies: 2 comments 3 replies

WireBaron
Jan 15, 2021
Author

Sytten
Jan 21, 2021

WireBaron Jan 21, 2021
Author