Skip to content
This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

Add cardinality support to BloomFilter. #133

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

b4hand
Copy link
Contributor

@b4hand b4hand commented May 16, 2017

This is based on the formula provided by the Wikipedia article:

https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter

@b4hand
Copy link
Contributor Author

b4hand commented May 16, 2017

The tests pass for me locally on Java 7, and the build failure in Travis looks to be completely unrelated to this code. I'm not sure what to do about it.

@b4hand
Copy link
Contributor Author

b4hand commented May 16, 2017

Ultimately, I think it would be nice to make BloomFilter implement ICardinality as well, but that will require a larger refactor since currently BloomFilter doesn't accept hashes of items directly. I think it is possible to support that functionality though with some bigger changes.

@b4hand
Copy link
Contributor Author

b4hand commented May 16, 2017

Looking at the build history #108 has the same build failure as well, so it seems highly unlikely that the build failure is due to this code.

@b4hand
Copy link
Contributor Author

b4hand commented May 30, 2017

I rebased to the latest master which drops the openjdk7 build, so the build now passes.

@mythguided
Copy link
Contributor

mythguided commented May 31, 2017

Based on my read of the paper cited there, I'd want something that helps check whether A is sufficiently far from N for the estimate to be valid.

(Sorry for the unintentional close there)

@mythguided mythguided closed this May 31, 2017
@mythguided mythguided reopened this Jun 1, 2017
@b4hand
Copy link
Contributor Author

b4hand commented Jun 5, 2017

I believe when A approaches N, that means fractionOfBits approaches 1 (ie. the filter is saturated). According to the documentation for Math.log1p, as -fractionOfBits approaches -1, Math.log1p(-fractionOfBits) approaches negative infinity which means the expression -m * Math.log1p(-fractionOfBits) / hashCount will become positive infinity. Math.round says that it returns Long.MAX_VALUE in this case, so that seems reasonable behavior to me. I can play around with a few values and see if it produces reasonable numbers at these outliers, but I specifically chose Math.log1p instead of Math.log for this behavior.

The alternative given in the paper is to use equation 5 under these circumstances for an alternative estimate. What threshold do you think is appropriate for "sufficiently far"?

@abramsm
Copy link
Contributor

abramsm commented Jul 7, 2017

can you add more tests of larger, random data sets to show this works for non-trivial use cases?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants