public final class CpcSketch extends Object
This sketch is extremely space-efficient when serialized. In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, this new algorithm simultaneously wins on the two dimensions of the space/accuracy tradeoff and produces sketches that are smaller than the entropy of HLL, so no possible implementation of compressed HLL can match its space efficiency for a given accuracy. As described in the paper this sketch implements a newly developed ICON estimator algorithm that survives unioning operations, another well-known estimator, the Historical Inverse Probability (HIP) estimator does not. The update speed performance of this sketch is quite fast and is comparable to the speed of HLL. The unioning (merging) capability of this sketch also allows for merging of sketches with different configurations of K.
For additional security this sketch can be configured with a user-specified hash seed.
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_LG_K
The default Log_base2 of K
|
Constructor and Description |
---|
CpcSketch()
Constructor with default log_base2 of k
|
CpcSketch(int lgK)
Constructor with log_base2 of k.
|
CpcSketch(int lgK,
long seed)
Constructor with log_base2 of k and seed.
|
Modifier and Type | Method and Description |
---|---|
double |
getEstimate()
Returns the best estimate of the cardinality of the sketch.
|
static Family |
getFamily()
Return the DataSketches identifier for this CPC family of sketches.
|
int |
getLgK()
Return the parameter LgK.
|
double |
getLowerBound(int kappa)
Returns the best estimate of the lower bound of the confidence interval given kappa,
the number of standard deviations from the mean.
|
static int |
getMaxSerializedBytes(int lgK)
The actual size of a compressed CPC sketch has a small random variance, but the following
empirically measured size should be large enough for at least 99.9 percent of sketches.
|
double |
getUpperBound(int kappa)
Returns the best estimate of the upper bound of the confidence interval given kappa,
the number of standard deviations from the mean.
|
static CpcSketch |
heapify(byte[] byteArray)
Return the given byte array as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
|
static CpcSketch |
heapify(byte[] byteArray,
long seed)
Return the given byte array as a CpcSketch on the Java heap.
|
static CpcSketch |
heapify(org.apache.datasketches.memory.Memory mem)
Return the given Memory as a CpcSketch on the Java heap using the DEFAULT_UPDATE_SEED.
|
static CpcSketch |
heapify(org.apache.datasketches.memory.Memory mem,
long seed)
Return the given Memory as a CpcSketch on the Java heap.
|
boolean |
isEmpty()
Return true if this sketch is empty
|
void |
reset()
Resets this sketch to empty but retains the original LgK and Seed.
|
byte[] |
toByteArray()
Return this sketch as a compressed byte array.
|
String |
toString()
Return a human-readable string summary of this sketch
|
String |
toString(boolean detail)
Return a human-readable string summary of this sketch
|
static String |
toString(byte[] byteArr,
boolean detail)
Returns a human readable string of the preamble of a byte array image of a CpcSketch.
|
static String |
toString(org.apache.datasketches.memory.Memory mem,
boolean detail)
Returns a human readable string of the preamble of a Memory image of a CpcSketch.
|
void |
update(byte[] data)
Present the given byte array as a potential unique item.
|
void |
update(ByteBuffer data)
Present the given ByteBuffer as a potential unique item
If the ByteBuffer is null or empty no update attempt is made and the method returns
|
void |
update(char[] data)
Present the given char array as a potential unique item.
|
void |
update(double datum)
Present the given double (or float) datum as a potential unique item.
|
void |
update(int[] data)
Present the given integer array as a potential unique item.
|
void |
update(long datum)
Present the given long as a potential unique item.
|
void |
update(long[] data)
Present the given long array as a potential unique item.
|
void |
update(String datum)
Present the given String as a potential unique item.
|
boolean |
validate()
Convience function that this Sketch is valid.
|
public static final int DEFAULT_LG_K
public CpcSketch()
public CpcSketch(int lgK)
lgK
- the given log_base2 of kpublic CpcSketch(int lgK, long seed)
lgK
- the given log_base2 of kseed
- the given seedpublic double getEstimate()
public static Family getFamily()
public int getLgK()
public double getLowerBound(int kappa)
kappa
- the given number of standard deviations from the mean: 1, 2 or 3.public static int getMaxSerializedBytes(int lgK)
For small values of n the size can be much smaller.
lgK
- the given value of lgK.public double getUpperBound(int kappa)
kappa
- the given number of standard deviations from the mean: 1, 2 or 3.public static CpcSketch heapify(org.apache.datasketches.memory.Memory mem)
mem
- the given Memorypublic static CpcSketch heapify(byte[] byteArray)
byteArray
- the given byte arraypublic static CpcSketch heapify(org.apache.datasketches.memory.Memory mem, long seed)
mem
- the given Memoryseed
- the seed used to create the original sketch from which the Memory was derived.public static CpcSketch heapify(byte[] byteArray, long seed)
byteArray
- the given byte arrayseed
- the seed used to create the original sketch from which the byte array was derived.public boolean isEmpty()
public final void reset()
public byte[] toByteArray()
public void update(long datum)
datum
- The given long datum.public void update(double datum)
datum
- The given double datum.public void update(String datum)
Note: About 2X faster performance can be obtained by first converting the String to a char[] and updating the sketch with that. This bypasses the complexity of the Java UTF_8 encoding. This, of course, will not produce the same internal hash values as updating directly with a String. So be consistent! Unioning two sketches, one fed with strings and the other fed with char[] will be meaningless.
datum
- The given String.public void update(byte[] data)
data
- The given byte array.public void update(ByteBuffer data)
data
- The given ByteBufferpublic void update(char[] data)
Note: this will not produce the same output hash values as the update(String)
method but will be a little faster as it avoids the complexity of the UTF8 encoding.
data
- The given char array.public void update(int[] data)
data
- The given int array.public void update(long[] data)
data
- The given long array.public boolean validate()
If you are starting with a serialized image as a byte array, first heapify the byte array to a sketch, which performs a number of checks. Then use this function as one additional check on the sketch.
public String toString()
public String toString(boolean detail)
detail
- include data detailpublic static String toString(byte[] byteArr, boolean detail)
byteArr
- the given byte arraydetail
- if true, a dump of the compressed window and surprising value streams will be
included.public static String toString(org.apache.datasketches.memory.Memory mem, boolean detail)
mem
- the given Memorydetail
- if true, a dump of the compressed window and surprising value streams will be
included.Copyright © 2015–2024 The Apache Software Foundation. All rights reserved.