Add cardinality support to BloomFilter. #133

b4hand · 2017-05-16T16:16:36Z

This is based on the formula provided by the Wikipedia article:

https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter

b4hand · 2017-05-16T17:21:35Z

The tests pass for me locally on Java 7, and the build failure in Travis looks to be completely unrelated to this code. I'm not sure what to do about it.

b4hand · 2017-05-16T17:45:59Z

Ultimately, I think it would be nice to make BloomFilter implement ICardinality as well, but that will require a larger refactor since currently BloomFilter doesn't accept hashes of items directly. I think it is possible to support that functionality though with some bigger changes.

b4hand · 2017-05-16T19:55:44Z

Looking at the build history #108 has the same build failure as well, so it seems highly unlikely that the build failure is due to this code.

b4hand · 2017-05-30T19:38:28Z

I rebased to the latest master which drops the openjdk7 build, so the build now passes.

mythguided · 2017-05-31T23:59:57Z

Based on my read of the paper cited there, I'd want something that helps check whether A is sufficiently far from N for the estimate to be valid.

(Sorry for the unintentional close there)

b4hand · 2017-06-05T23:51:26Z

I believe when A approaches N, that means fractionOfBits approaches 1 (ie. the filter is saturated). According to the documentation for Math.log1p, as -fractionOfBits approaches -1, Math.log1p(-fractionOfBits) approaches negative infinity which means the expression -m * Math.log1p(-fractionOfBits) / hashCount will become positive infinity. Math.round says that it returns Long.MAX_VALUE in this case, so that seems reasonable behavior to me. I can play around with a few values and see if it produces reasonable numbers at these outliers, but I specifically chose Math.log1p instead of Math.log for this behavior.

The alternative given in the paper is to use equation 5 under these circumstances for an alternative estimate. What threshold do you think is appropriate for "sufficiently far"?

abramsm · 2017-07-07T20:38:01Z

can you add more tests of larger, random data sets to show this works for non-trivial use cases?

b4hand mentioned this pull request May 16, 2017

Add support for intersection to BloomFilter. #134

Open

Add cardinality support to BloomFilter.

5c8d34a

b4hand force-pushed the add-cardinality branch from 0e2adba to 5c8d34a Compare May 30, 2017 19:25

mythguided closed this May 31, 2017

mythguided reopened this Jun 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cardinality support to BloomFilter. #133

Add cardinality support to BloomFilter. #133

b4hand commented May 16, 2017

b4hand commented May 16, 2017

b4hand commented May 16, 2017 •

edited

Loading

b4hand commented May 16, 2017

b4hand commented May 30, 2017

mythguided commented May 31, 2017 •

edited

Loading

b4hand commented Jun 5, 2017

abramsm commented Jul 7, 2017

Add cardinality support to BloomFilter. #133

Are you sure you want to change the base?

Add cardinality support to BloomFilter. #133

Conversation

b4hand commented May 16, 2017

b4hand commented May 16, 2017

b4hand commented May 16, 2017 • edited Loading

b4hand commented May 16, 2017

b4hand commented May 30, 2017

mythguided commented May 31, 2017 • edited Loading

b4hand commented Jun 5, 2017

abramsm commented Jul 7, 2017

b4hand commented May 16, 2017 •

edited

Loading

mythguided commented May 31, 2017 •

edited

Loading