Approximate membership sketches #34
Replies: 2 comments 3 replies
-
I'm not sure I see the point of doing an approximate query in a non-cached context for us, it would just end up being less efficient if this isn't pre-calculated (ie if you're going to scan the whole table, you may as well just look for the value). We might want to update the example to reflect that. As such, I think it's likely that there are 3 ways we'd want to introduce this type of functionality,
|
Beta Was this translation helpful? Give feedback.
-
I'm intrigued by two other use cases in the continuous aggregate context that are sort of different types of "optimizations" on top of this:
|
Beta Was this translation helpful? Give feedback.
-
Discussion for Approximate Member Queries
Original issue
What's the functionality you would like to add
We'd like to provide an approximate member query, such as quotient filter, that can quickly answer if an element likely belongs to a set. This could then be used a filter to quickly rule out the existence of value in a database (or some range of a database) without performing a more expensive query.
How would the function be used
The implementation would provide a postgres aggregate which would compute the filter over some column of a table.
Consider a table listing real estate transactions, we might create a filter over the house location as follows:
These filters should also be combinable, allowing efficient queries over distinct subsets (or partial aggregations in the case of continuous aggregation).
Why should this feature be added?
Postgres does not provide an efficient mechanism to test for the existence of a particular value short of actually looking up the row.
What scale is this useful at?
This will be more useful for large datasets. In order for this to be useful the cost to lookup a particular row by the filtered value will have to be significant. This will also be most useful with a workflow that expects a high rate of negative checks (i.e. where most values checked are not present in the table).
Drawbacks
Quotient filters can be fairly large. In order to maintain efficient lookups, it may be necessary to store anywhere from 1.3 to 2.5 hashes per value (perhaps more depending on the implementation).
Open Questions
How should we determine how large to make the filter?
The false positive rate is directly related to both the size of the filter and the size of the table. We should be able to allow the user to specify a maximum tolerable false positive rate and size/resize the filter based upon that, but this could result in an unacceptably large table if set too aggresively on too large of a table. Alternatively we could have the user specify an allowed size and leave them to deal with the resulting false positive rate. It should be fairly straightforward to provide functionality to resize the table (by factors of 2), so it's not critical that this is set correctly initially. If the resizing is cheap enough, we'll also likely want to store smaller filters for partial aggregates.
Alternatives
Bloom filters also provide this functionality, but using a quotient filter based approach provides the following benefits over building an implementation on top of a bloom filter:
1. Quotient filters can be efficiently merged without affecting their false positive rate.
2. Quotient filters handle hash collisions more efficiently than bloom filters
3. At false positive target rates less than 1/64, quotient filters require less space than bloom filters
Another approximate member query is the Cuckoo filter which looks like it may be even more space efficient than the quotient filter. However, it look like this filter drops off dramatically in terms of lookup performance and space efficiency at lower rates of false positives.
Beta Was this translation helpful? Give feedback.
All reactions