public class HllSketch extends Object
This implementation offers three different types of HLL sketch, each with different
trade-offs with accuracy, space and performance. These types are specified with the
TgtHllType
parameter.
In terms of accuracy, all three types, for the same lgConfigK, have the same error distribution as a function of n, the number of unique values fed to the sketch. The configuration parameter lgConfigK is the log-base-2 of K, where K is the number of buckets or slots for the sketch.
During warmup, when the sketch has only received a small number of unique items (up to about 10% of K), this implementation leverages a new class of estimator algorithms with significantly better accuracy.
This sketch also offers the capability of operating off-heap. Given a WritableMemory object created by the user, the sketch will perform all of its updates and internal phase transitions in that object, which can actually reside either on-heap or off-heap based on how it is configured. In large systems that must update and merge many millions of sketches, having the sketch operate off-heap avoids the serialization and deserialization costs of moving sketches to and from off-heap memory-mapped files, for example, and eliminates big garbage collection delays.
Modifier and Type | Field and Description |
---|---|
static TgtHllType |
DEFAULT_HLL_TYPE
The default HLL-TYPE is HLL_4
|
static int |
DEFAULT_LG_K
The default Log_base2 of K
|
Constructor and Description |
---|
HllSketch()
Constructs a new on-heap sketch with the default lgConfigK and tgtHllType.
|
HllSketch(int lgConfigK)
Constructs a new on-heap sketch with the default tgtHllType.
|
HllSketch(int lgConfigK,
TgtHllType tgtHllType)
Constructs a new on-heap sketch with the type of HLL sketch to configure.
|
HllSketch(int lgConfigK,
TgtHllType tgtHllType,
org.apache.datasketches.memory.WritableMemory dstMem)
Constructs a new sketch with the type of HLL sketch to configure and the given
WritableMemory as the destination for the sketch.
|
Modifier and Type | Method and Description |
---|---|
HllSketch |
copy()
Return a copy of this sketch onto the Java heap.
|
HllSketch |
copyAs(TgtHllType tgtHllType)
Return a deep copy of this sketch onto the Java heap with the specified TgtHllType.
|
int |
getCompactSerializationBytes()
Gets the size in bytes of the current sketch when serialized using
toCompactByteArray().
|
double |
getCompositeEstimate()
This is less accurate than the
getEstimate() method and is automatically used
when the sketch has gone through union operations where the more accurate HIP estimator
cannot be used. |
double |
getEstimate()
Return the cardinality estimate
|
int |
getLgConfigK()
Gets the lgConfigK.
|
double |
getLowerBound(int numStdDev)
Gets the approximate lower error bound given the specified number of Standard Deviations.
|
static int |
getMaxUpdatableSerializationBytes(int lgConfigK,
TgtHllType tgtHllType)
Returns the maximum size in bytes that this sketch can grow to given lgConfigK.
|
double |
getRelErr(boolean upperBound,
boolean unioned,
int lgConfigK,
int numStdDev)
Gets the current (approximate) Relative Error (RE) asymptotic values given several
parameters.
|
static int |
getSerializationVersion()
Returns the current serialization version.
|
static int |
getSerializationVersion(org.apache.datasketches.memory.Memory mem)
Returns the current serialization version of the given Memory.
|
TgtHllType |
getTgtHllType()
Gets the
TgtHllType |
int |
getUpdatableSerializationBytes()
Gets the size in bytes of the current sketch when serialized using
toUpdatableByteArray().
|
double |
getUpperBound(int numStdDev)
Gets the approximate upper error bound given the specified number of Standard Deviations.
|
static HllSketch |
heapify(byte[] byteArray)
Heapify the given byte array, which must be a valid HllSketch image and may have data.
|
static HllSketch |
heapify(org.apache.datasketches.memory.Memory srcMem)
Heapify the given Memory, which must be a valid HllSketch image and may have data.
|
boolean |
isCompact()
Returns true if the backing memory of this sketch is in compact form.
|
boolean |
isEmpty()
Returns true if empty
|
boolean |
isEstimationMode()
This HLL family of sketches and operators is always estimating, even for very small values.
|
boolean |
isMemory()
Returns true if this sketch was created using Memory.
|
boolean |
isOffHeap()
Returns true if the backing memory for this sketch is off-heap.
|
boolean |
isSameResource(org.apache.datasketches.memory.Memory mem)
Returns true if the given Memory refers to the same underlying resource as this sketch.
|
void |
reset()
Resets to empty, but does not change the configured values of lgConfigK and tgtHllType.
|
byte[] |
toCompactByteArray()
Serializes this sketch as a byte array in compact form.
|
String |
toString()
Human readable summary as a string.
|
String |
toString(boolean summary,
boolean detail,
boolean auxDetail)
Human readable summary with optional detail.
|
String |
toString(boolean summary,
boolean detail,
boolean auxDetail,
boolean all)
Human readable summary with optional detail
|
static String |
toString(byte[] byteArr)
Returns a human readable string of the preamble of a byte array image of an HllSketch.
|
static String |
toString(org.apache.datasketches.memory.Memory mem)
Returns a human readable string of the preamble of a Memory image of an HllSketch.
|
byte[] |
toUpdatableByteArray()
Serializes this sketch as a byte array in an updatable form.
|
void |
update(byte[] data)
Present the given byte array as a potential unique item.
|
void |
update(char[] data)
Present the given char array as a potential unique item.
|
void |
update(double datum)
Present the given double (or float) datum as a potential unique item.
|
void |
update(int[] data)
Present the given integer array as a potential unique item.
|
void |
update(long datum)
Present the given long as a potential unique item.
|
void |
update(long[] data)
Present the given long array as a potential unique item.
|
void |
update(String datum)
Present the given String as a potential unique item.
|
static HllSketch |
wrap(org.apache.datasketches.memory.Memory srcMem)
Wraps the given read-only Memory that must be a image of a valid sketch,
which may be in compact or updatable form, and should have data.
|
static HllSketch |
writableWrap(org.apache.datasketches.memory.WritableMemory srcWmem)
Wraps the given WritableMemory, which must be a image of a valid updatable sketch,
and may have data.
|
public static final int DEFAULT_LG_K
public static final TgtHllType DEFAULT_HLL_TYPE
public HllSketch()
public HllSketch(int lgConfigK)
lgConfigK
- The Log2 of K for the target HLL sketch. This value must be
between 4 and 21 inclusively.public HllSketch(int lgConfigK, TgtHllType tgtHllType)
lgConfigK
- The Log2 of K for the target HLL sketch. This value must be
between 4 and 21 inclusively.tgtHllType
- the desired Hll type.public HllSketch(int lgConfigK, TgtHllType tgtHllType, org.apache.datasketches.memory.WritableMemory dstMem)
The given dstMem is checked for the required capacity as determined by
getMaxUpdatableSerializationBytes(int, TgtHllType)
.
lgConfigK
- The Log2 of K for the target HLL sketch. This value must be
between 4 and 21 inclusively.tgtHllType
- the desired Hll type.dstMem
- the destination memory for the sketch.public static final HllSketch heapify(byte[] byteArray)
byteArray
- the given byte array. This byteArray is not modified and is not retained
by the on-heap sketch.public static final HllSketch heapify(org.apache.datasketches.memory.Memory srcMem)
srcMem
- the given Memory, which is read-only.public static final HllSketch writableWrap(org.apache.datasketches.memory.WritableMemory srcWmem)
The given dstMem is checked for the required capacity as determined by
getMaxUpdatableSerializationBytes(int, TgtHllType)
.
srcWmem
- an writable image of a valid source sketch with data.public static final HllSketch wrap(org.apache.datasketches.memory.Memory srcMem)
srcMem
- a read-only image of a valid source sketch.public HllSketch copy()
public HllSketch copyAs(TgtHllType tgtHllType)
tgtHllType
- the TgtHllType enumpublic double getCompositeEstimate()
getEstimate()
method and is automatically used
when the sketch has gone through union operations where the more accurate HIP estimator
cannot be used.
This is made public only for error characterization software that exists in separate
packages and is not intended for normal use.public double getEstimate()
public int getLgConfigK()
public int getCompactSerializationBytes()
public double getLowerBound(int numStdDev)
numStdDev
- This must be an integer between 1 and 3, inclusive.
See Number of Standard Deviationspublic static final int getMaxUpdatableSerializationBytes(int lgConfigK, TgtHllType tgtHllType)
lgConfigK
- The Log2 of K for the target HLL sketch. This value must be
between 4 and 21 inclusively.tgtHllType
- the desired Hll typepublic TgtHllType getTgtHllType()
TgtHllType
public int getUpdatableSerializationBytes()
public double getUpperBound(int numStdDev)
numStdDev
- This must be an integer between 1 and 3, inclusive.
Number of Standard Deviationspublic boolean isCompact()
public boolean isEmpty()
public boolean isMemory()
public boolean isOffHeap()
public boolean isSameResource(org.apache.datasketches.memory.Memory mem)
This is only relevant for HLL_4 sketches that have been configured for off-heap using WritableMemory or Memory. For on-heap sketches or unions this will return false.
It is rare, but possible, the the off-heap memory that has been allocated to an HLL_4 sketch may not be large enough. If this should happen, the sketch makes a request for more memory from the owner of the resource and then moves itself to this new location. This all happens transparently to the user. This method provides a means for the user to inquire of the sketch if it has, in fact, moved itself.
mem
- the given Memorypublic void reset()
public byte[] toCompactByteArray()
Union union; HllSketch sk, sk2;
int lgK = 12;
sk = new HllSketch(lgK, TgtHllType.HLL_4); //can be 4, 6, or 8
for (int i = 0; i < (2 << lgK); i++) { sk.update(i); }
byte[] arr = HllSketch.toCompactByteArray();
//...
union = Union.heapify(arr); //initializes the union using data from the array.
//OR, if used in an off-heap environment:
union = Union.heapify(Memory.wrap(arr)); //same as above, except from Memory object.
//To recover an updatable heap sketch:
sk2 = HllSketch.heapify(arr);
//OR, if used in an off-heap environment:
sk2 = HllSketch.heapify(Memory.wrap(arr));
The sketch "wrapping" operation skips actual deserialization thus is quite fast. However, any attempt to update the derived HllSketch will result in a Read-only exception.
Note that in some cases, based on the state of the sketch, the compact form is indistiguishable from the updatable form. In these cases the updatable form is returned and the compact flag bit will not be set.
public byte[] toUpdatableByteArray()
Union union; HllSketch sk;
int lgK = 12;
sk = new HllSketch(lgK, TgtHllType.HLL_8) //must be 8
for (int i = 0; i < (2 << lgK); i++) { sk.update(i); }
byte[] arr = sk.toUpdatableByteArray();
WritableMemory wmem = WritableMemory.wrap(arr);
//...
union = Union.writableWrap(wmem); //no deserialization!
public String toString(boolean summary, boolean detail, boolean auxDetail, boolean all)
summary
- if true, output the sketch summarydetail
- if true, output the internal data arrayauxDetail
- if true, output the internal Aux array, if it exists.all
- if true, outputs all entries including empty onespublic static String toString(byte[] byteArr)
byteArr
- the given byte arraypublic static String toString(org.apache.datasketches.memory.Memory mem)
mem
- the given Memory objectpublic static final int getSerializationVersion()
public static final int getSerializationVersion(org.apache.datasketches.memory.Memory mem)
mem
- the given Memory containing a serialized HllSketch image.public double getRelErr(boolean upperBound, boolean unioned, int lgConfigK, int numStdDev)
upperBound
- return the RE for the Upper Bound, otherwise for the Lower Bound.unioned
- set true if the sketch is the result of a union operation.lgConfigK
- the configured value for the sketch.numStdDev
- the given number of Standard Deviations. This must be an integer between
1 and 3, inclusive.
Number of Standard Deviationspublic boolean isEstimationMode()
public String toString()
public String toString(boolean summary, boolean detail, boolean auxDetail)
summary
- if true, output the sketch summarydetail
- if true, output the internal data arrayauxDetail
- if true, output the internal Aux array, if it exists.public void update(long datum)
datum
- The given long datum.public void update(double datum)
datum
- The given double datum.public void update(String datum)
Note: About 2X faster performance can be obtained by first converting the String to a char[] and updating the sketch with that. This bypasses the complexity of the Java UTF_8 encoding. This, of course, will not produce the same internal hash values as updating directly with a String. So be consistent! Unioning two sketches, one fed with strings and the other fed with char[] will be meaningless.
datum
- The given String.public void update(byte[] data)
data
- The given byte array.public void update(char[] data)
Note: this will not produce the same output hash values as the update(String)
method but will be a little faster as it avoids the complexity of the UTF8 encoding.
data
- The given char array.public void update(int[] data)
data
- The given int array.public void update(long[] data)
data
- The given long array.Copyright © 2015–2021 The Apache Software Foundation. All rights reserved.