Approximate membership sketches #34

davidkohn88 · 2021-01-14T15:22:54Z

davidkohn88
Jan 14, 2021

Discussion for Approximate Member Queries

Original issue

What's the functionality you would like to add

We'd like to provide an approximate member query, such as quotient filter, that can quickly answer if an element likely belongs to a set. This could then be used a filter to quickly rule out the existence of value in a database (or some range of a database) without performing a more expensive query.

How would the function be used

The implementation would provide a postgres aggregate which would compute the filter over some column of a table.

Consider a table listing real estate transactions, we might create a filter over the house location as follows:

# \d home_sales
                       Table "public.home_sales"
  Column   |           Type           | Collation | Nullable | Default
-----------+--------------------------+-----------+----------+---------
 sale_date | timestamp with time zone |           |          |
 address   | text                     |           |          |
 seller    | text                     |           |          |
 buyer     | text                     |           |          |
 price     | double precision         |           |          |

# CREATE VIEW address_filter AS SELECT quotient_filter(address) FROM home_sales;
CREATE VIEW

# SELECT prob_has_member(address_filter, '1342 Palmdale Ln');
 prob_has_member
-----------------
 f
(1 row)

These filters should also be combinable, allowing efficient queries over distinct subsets (or partial aggregations in the case of continuous aggregation).

# SELECT prob_has_member(combine_filters(address_filter_2019, address_filter_2020), '1342 Palmdale Ln');
 prob_has_member
-----------------
 t
(1 row)

Why should this feature be added?

Postgres does not provide an efficient mechanism to test for the existence of a particular value short of actually looking up the row.

What scale is this useful at?

This will be more useful for large datasets. In order for this to be useful the cost to lookup a particular row by the filtered value will have to be significant. This will also be most useful with a workflow that expects a high rate of negative checks (i.e. where most values checked are not present in the table).

Drawbacks

Quotient filters can be fairly large. In order to maintain efficient lookups, it may be necessary to store anywhere from 1.3 to 2.5 hashes per value (perhaps more depending on the implementation).

Open Questions

How should we determine how large to make the filter?
The false positive rate is directly related to both the size of the filter and the size of the table. We should be able to allow the user to specify a maximum tolerable false positive rate and size/resize the filter based upon that, but this could result in an unacceptably large table if set too aggresively on too large of a table. Alternatively we could have the user specify an allowed size and leave them to deal with the resulting false positive rate. It should be fairly straightforward to provide functionality to resize the table (by factors of 2), so it's not critical that this is set correctly initially. If the resizing is cheap enough, we'll also likely want to store smaller filters for partial aggregates.

Alternatives

Bloom filters also provide this functionality, but using a quotient filter based approach provides the following benefits over building an implementation on top of a bloom filter:
1. Quotient filters can be efficiently merged without affecting their false positive rate.
2. Quotient filters handle hash collisions more efficiently than bloom filters
3. At false positive target rates less than 1/64, quotient filters require less space than bloom filters

Another approximate member query is the Cuckoo filter which looks like it may be even more space efficient than the quotient filter. However, it look like this filter drops off dramatically in terms of lookup performance and space efficiency at lower rates of false positives.

davidkohn88 · 2021-01-14T15:42:28Z

davidkohn88
Jan 14, 2021
Author

# CREATE VIEW address_filter AS SELECT quotient_filter(address) FROM home_sales;
CREATE VIEW

# SELECT prob_has_member(address_filter, '1342 Palmdale Ln');
prob_has_member
-----------------
f
(1 row)

I'm not sure I see the point of doing an approximate query in a non-cached context for us, it would just end up being less efficient if this isn't pre-calculated (ie if you're going to scan the whole table, you may as well just look for the value). We might want to update the example to reflect that. As such, I think it's likely that there are 3 ways we'd want to introduce this type of functionality,

in the continuous aggregate context, which is mentioned below that, but I'd want to explore how it would be used there...I guess one really important application here is that it can tell you if you're missing a value in a period, which is important info, but it might be handicapped by not being able to tell for sure if the value's actually there, so I suppose it depends on false positive rate...
as an index, specifically a BRIN type index, I think this could be quite good for use in tables with things like URLs where you have a lot of distinct values and you're sort of searching for a needle in a haystack. I also wonder given its combinability if you can make a "tree" of these as a speed tradeoff?
similar sort of thing but for compression where the filter would allow the executor to skip unpacking segments that can't have a given value...

2 replies

WireBaron Jan 14, 2021

I agree that this doesn't make any sense as a "create a filter so we can check for approximate membership of one element" type of object. The example was intended to be more about syntax than presenting a compelling use case. I definitely see this as being more useful as an index or a continuous aggregate.

This third case looks interesting, but I'm not sure I fully understand it. Do you see this as part of a deeper modification to the executor itself?

davidkohn88 Jan 14, 2021
Author

well, right now we use min max of a column to do exclusion of segments, min and max aren't the only things one could use for that.

davidkohn88 · 2021-01-14T15:47:27Z

davidkohn88
Jan 14, 2021
Author

I'm intrigued by two other use cases in the continuous aggregate context that are sort of different types of "optimizations" on top of this:

if you have a relatively static set of a large number of reporting devices can you use equality on the quotient filter to determine if the overall set changed from one period to another?
On a different note, I'm wondering if there's a way to use some sort of random seed in some cases so that false positives in successive periods are less correlated, for instance, could you use the time bucket as a seed for the hash and then be able to tell with more certainty that if you see an item in multiple successive periods that it is in fact there? So then, if you do a continuous aggregate on 5 minute periods and you see a value for multiple of the 5 min periods you don't need to go and look to make sure it exists, but if you miss it in one it's more likely to be a false positive in the other and you should check? I dunno.

1 reply

WireBaron Jan 14, 2021

It seems like this should work. Equality on the quotient filter wouldn't necessarily be a fast operation, but certainly much faster than comparing group members.
Hmm, this seems problematic if I'm understanding it correctly. This would prevent filters from being combinable, and would certainly break the equality check from point 1. However, if you're willing to accept these trade offs, then it seems likely you'd have less correlation of false positives like you describe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approximate membership sketches #34

{{title}}

{{editor}}'s edit

{{editor}}'s edit

What's the functionality you would like to add

How would the function be used

Why should this feature be added?

What scale is this useful at?

Drawbacks

Open Questions

Alternatives

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Approximate membership sketches #34

davidkohn88 Jan 14, 2021

What's the functionality you would like to add

How would the function be used

Why should this feature be added?

What scale is this useful at?

Drawbacks

Open Questions

Alternatives

Replies: 2 comments · 3 replies

davidkohn88 Jan 14, 2021 Author

WireBaron Jan 14, 2021

davidkohn88 Jan 14, 2021 Author

davidkohn88 Jan 14, 2021 Author

WireBaron Jan 14, 2021

davidkohn88
Jan 14, 2021

Replies: 2 comments 3 replies

davidkohn88
Jan 14, 2021
Author

davidkohn88 Jan 14, 2021
Author

davidkohn88
Jan 14, 2021
Author