public final class BloomFilter extends Object
A Bloom filter is a data structure that can be used for probabilistic set membership.
When querying a Bloom filter, there are no false positives. Specifically: When querying an item that has already been inserted to the filter, the filter will always indicate that the item is present. There is a chance of false positives, where querying an item that has never been presented to the filter will indicate that the item has already been seen. Consequently, any query should be interpreted as "might have seen."
A standard Bloom filter is unlike typical sketches in that it is not sub-linear in size and does not resize itself. A Bloom filter will work up to a target number of distinct items, beyond which it will saturate and the false positive rate will start to increase. The size of a Bloom filter will be linear in the expected number of distinct items.
See the BloomFilterBuilder class for methods to create a filter, especially one sized correctly for a target number of distinct elements and a target false positive probability.
This implementation uses xxHash64 and follows the approach in Kirsch and Mitzenmacher, "Less Hashing, Same Performance: Building a Better Bloom Filter," Wiley Interscience, 2008, pp. 187-218.
Modifier and Type | Field and Description |
---|---|
static long |
MAX_SIZE_BITS
The maximum size of a bloom filter in bits.
|
Modifier and Type | Method and Description |
---|---|
long |
getBitsUsed()
Returns the number of bits in the BloomFilter that are set to 1.
|
long |
getCapacity()
Returns the total number of bits in the BloomFilter.
|
double |
getFillPercentage()
Returns the percentage of all bits in the BloomFilter set to 1.
|
short |
getNumHashes()
Returns the configured number of hash functions for this BloomFilter
|
long |
getSeed()
Returns the hash seed for this BloomFilter.
|
static long |
getSerializedSize(long numBits)
Returns the serialized length of a non-empty BloomFilter of the given size, in bytes
|
long |
getSerializedSizeBytes()
Returns the length of this BloomFilter when serialized, in bytes
|
boolean |
hasMemory()
Returns whether the filter has a backing Memory object
|
static BloomFilter |
heapify(org.apache.datasketches.memory.Memory mem)
Reads a serialized image of a BloomFilter from the provided Memory
|
void |
intersect(BloomFilter other)
Intersects two BloomFilters by applying a logical AND.
|
void |
invert()
Inverts all the bits of the BloomFilter.
|
boolean |
isCompatible(BloomFilter other)
Helps identify if two BloomFilters may be unioned or intersected.
|
boolean |
isDirect()
Returns whether the filter is a direct (off-heap) or on-heap object.
|
boolean |
isEmpty()
Checks if the BloomFilter has processed any items
|
boolean |
isReadOnly()
Returns whether the filter is in read-only mode.
|
boolean |
query(byte[] data)
Queries the filter with the provided byte[] and returns whether the
array might have been seen previously.
|
boolean |
query(char[] data)
Queries the filter with the provided char[] and returns whether the
array might have been seen previously.
|
boolean |
query(double item)
Queries the filter with the provided double and returns whether the
value might have been seen previously.
|
boolean |
query(int[] data)
Queries the filter with the provided int[] and returns whether the
array might have been seen previously.
|
boolean |
query(long item)
Queries the filter with the provided long and returns whether the
value might have been seen previously.
|
boolean |
query(long[] data)
Queries the filter with the provided long[] and returns whether the
array might have been seen previously.
|
boolean |
query(org.apache.datasketches.memory.Memory mem)
Queries the filter with the provided Memory and returns whether the
data might have been seen previously.
|
boolean |
query(short[] data)
Queries the filter with the provided short[] and returns whether the
array might have been seen previously.
|
boolean |
query(String item)
Queries the filter with the provided double and returns whether the
value might have been seen previously.
|
boolean |
queryAndUpdate(byte[] data)
Updates the filter with the provided byte[] and
returns the result from quering that array prior to the update.
|
boolean |
queryAndUpdate(char[] data)
Updates the filter with the provided char[] and
returns the result from quering that array prior to the update.
|
boolean |
queryAndUpdate(double item)
Updates the filter with the provided double and
returns the result from quering that value prior to the update.
|
boolean |
queryAndUpdate(int[] data)
Updates the filter with the provided int[] and
returns the result from quering that array prior to the update.
|
boolean |
queryAndUpdate(long item)
Updates the filter with the provided long and
returns the result from quering that value prior to the update.
|
boolean |
queryAndUpdate(long[] data)
Updates the filter with the provided long[] and
returns the result from quering that array prior to the update.
|
boolean |
queryAndUpdate(org.apache.datasketches.memory.Memory mem)
Updates the filter with the provided Memory and
returns the result from quering that Memory prior to the update.
|
boolean |
queryAndUpdate(short[] data)
Updates the filter with the provided short[] and
returns the result from quering that array prior to the update.
|
boolean |
queryAndUpdate(String item)
Updates the filter with the provided String and
returns the result from quering that value prior to the update.
|
void |
reset()
Resets the BloomFilter to an empty state
|
byte[] |
toByteArray()
Serializes the current BloomFilter to an array of bytes.
|
long[] |
toLongArray()
Serializes the current BloomFilter to an array of longs.
|
String |
toString() |
void |
union(BloomFilter other)
Unions two BloomFilters by applying a logical OR.
|
void |
update(byte[] data)
Updates the filter with the provided byte[].
|
void |
update(char[] data)
Updates the filter with the provided char[].
|
void |
update(double item)
Updates the filter with the provided double value.
|
void |
update(int[] data)
Updates the filter with the provided int[].
|
void |
update(long item)
Updates the filter with the provided long value.
|
void |
update(long[] data)
Updates the filter with the provided long[].
|
void |
update(org.apache.datasketches.memory.Memory mem)
Updates the filter with the data in the provided Memory.
|
void |
update(short[] data)
Updates the filter with the provided short[].
|
void |
update(String item)
Updates the filter with the provided String.
|
static BloomFilter |
wrap(org.apache.datasketches.memory.Memory mem)
Wraps the given Memory into this filter class.
|
static BloomFilter |
writableWrap(org.apache.datasketches.memory.WritableMemory wmem)
Wraps the given WritableMemory into this filter class.
|
public static final long MAX_SIZE_BITS
public static BloomFilter heapify(org.apache.datasketches.memory.Memory mem)
mem
- Memory containing a previously serialized BloomFilterpublic static BloomFilter wrap(org.apache.datasketches.memory.Memory mem)
mem
- the given Memory objectpublic static BloomFilter writableWrap(org.apache.datasketches.memory.WritableMemory wmem)
wmem
- the given WritableMemory objectpublic void reset()
public boolean isEmpty()
public long getBitsUsed()
public long getCapacity()
public short getNumHashes()
public long getSeed()
public boolean hasMemory()
public boolean isReadOnly()
public boolean isDirect()
public double getFillPercentage()
public void update(long item)
item
- an item with which to update the filterpublic void update(double item)
item
- an item with which to update the filterpublic void update(String item)
Note: this will not produce the same output hash values as the update(char[])
method and will generally be a little slower depending on the complexity of the UTF8 encoding.
item
- an item with which to update the filterpublic void update(byte[] data)
data
- an array with which to update the filterpublic void update(char[] data)
data
- an array with which to update the filterpublic void update(short[] data)
data
- an array with which to update the filterpublic void update(int[] data)
data
- an array with which to update the filterpublic void update(long[] data)
data
- an array with which to update the filterpublic void update(org.apache.datasketches.memory.Memory mem)
mem
- a Memory object with which to update the filterpublic boolean queryAndUpdate(long item)
item
- an item with which to update the filterpublic boolean queryAndUpdate(double item)
item
- an item with which to update the filterpublic boolean queryAndUpdate(String item)
Note: this will not produce the same output hash values as the queryAndUpdate(char[])
method and will generally be a little slower depending on the complexity of the UTF8 encoding.
item
- an item with which to update the filterpublic boolean queryAndUpdate(byte[] data)
data
- an array with which to update the filterpublic boolean queryAndUpdate(char[] data)
data
- an array with which to update the filterpublic boolean queryAndUpdate(short[] data)
data
- an array with which to update the filterpublic boolean queryAndUpdate(int[] data)
data
- an array with which to update the filterpublic boolean queryAndUpdate(long[] data)
data
- an array with which to update the filterpublic boolean queryAndUpdate(org.apache.datasketches.memory.Memory mem)
mem
- an array with which to update the filterpublic boolean query(long item)
item
- an item with which to query the filterpublic boolean query(double item)
item
- an item with which to query the filterpublic boolean query(String item)
Note: this will not produce the same output hash values as the update(char[])
method and will generally be a little slower depending on the complexity of the UTF8 encoding.
item
- an item with which to query the filterpublic boolean query(byte[] data)
data
- an array with which to query the filterpublic boolean query(char[] data)
data
- an array with which to query the filterpublic boolean query(short[] data)
data
- an array with which to query the filterpublic boolean query(int[] data)
data
- an array with which to query the filterpublic boolean query(long[] data)
data
- an array with which to query the filterpublic boolean query(org.apache.datasketches.memory.Memory mem)
mem
- a Memory array with which to query the filterpublic void union(BloomFilter other)
other
- A BloomFilter to union with this onepublic void intersect(BloomFilter other)
other
- A BloomFilter to union with this onepublic void invert()
public boolean isCompatible(BloomFilter other)
other
- A BloomFilter to check for compatibility with this onepublic long getSerializedSizeBytes()
public static long getSerializedSize(long numBits)
numBits
- The number of bits of to use for size computationpublic byte[] toByteArray()
Note: Method throws if the serialized size exceeds Integer.MAX_VALUE
.
public long[] toLongArray()
toByteArray()
,
this method can handle any size filter.Copyright © 2015–2024 The Apache Software Foundation. All rights reserved.