Top N Aggregates #40

davidkohn88 · 2021-01-19T18:34:04Z

davidkohn88
Jan 19, 2021

Original issue in #38:
In situations where one has zipfian or other long tail distributions of items, one often wants an aggregate that computes the topN most frequent values and lops off the rest for storage and lookup efficiency reasons. Take the case where you're monitoring traffic by URL and want to show the top 100 URLs by traffic in a day, there might be millions of URLs that are hit, but most of them only have a few hits and what you care about are the top 100. An aggregate that stored the top 100 urls and their counts on a daily basis and then you could re aggregate across multiple days if needed (though it would be an approximation of the true top100, it would be relatively accurate). I think this could be significantly more useful than the count min sketch in #6 .

davidkohn88 · 2021-01-19T18:34:31Z

davidkohn88
Jan 19, 2021
Author

What would the best representation of this be? Would we do compression on the data in the aggregate?

2 replies

JLockerman Jan 19, 2021

One tempting option is to try to store fixed-size arrays of (count, value) ordered by decreasing count. This would allow finding the top-n elements very quickly, but merging is not trivial; the obvious may is to temporarily store a value -> count map, update all the counts, re-sort the tuples, and throw out the smallest values. We'll have to looks at the related works to see if there're better things to do.

JLockerman Jan 19, 2021

http://archive.dimacs.rutgers.edu/Workshops/WGUnifyingTheory/Slides/cormode.pdf looks like a decent overview of various algorithms

davidkohn88 · 2021-01-19T18:40:15Z

davidkohn88
Jan 19, 2021
Author

What would the API look like? I assume we would unnest the data? Would we only support text values (and just have people cast ints or whatever else to that?) and would you unnest the results? Could you downsample and just get the top10 from a top100 agg?

0 replies

JLockerman · 2021-01-19T18:42:34Z

JLockerman
Jan 19, 2021

Some links that might be relevant:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Top N Aggregates #40

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Top N Aggregates #40

davidkohn88 Jan 19, 2021

Replies: 3 comments · 2 replies

davidkohn88 Jan 19, 2021 Author

JLockerman Jan 19, 2021

JLockerman Jan 19, 2021

davidkohn88 Jan 19, 2021 Author

JLockerman Jan 19, 2021

davidkohn88
Jan 19, 2021

Replies: 3 comments 2 replies

davidkohn88
Jan 19, 2021
Author

davidkohn88
Jan 19, 2021
Author

JLockerman
Jan 19, 2021