public class KllFloatsSketch extends Object
This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution of values from a very large stream in a single pass, requiring only that the values are comparable. The analysis is obtained using getQuantile() or getQuantiles() functions or the inverse functions getRank(), getPMF() (Probability Mass Function), and getCDF() (Cumulative Distribution Function).
Given an input stream of N numeric values, the absolute rank of any specific value is defined as its index (0 to N-1) in the hypothetical sorted stream of all N input values.
The normalized rank (rank) of any specific value is defined as its absolute rank divided by N. Thus, the normalized rank is a value between zero and one. In the documentation and Javadocs for this sketch absolute rank is never used so any reference to just rank should be interpreted to mean normalized rank.
This sketch is configured with a parameter k, which affects the size of the sketch and its estimation error.
The estimation error is commonly called epsilon (or eps) and is a fraction between zero and one. Larger values of k result in smaller values of epsilon. Epsilon is always with respect to the rank and cannot be applied to the corresponding values.
The relationship between the normalized rank and the corresponding values can be viewed as a two dimensional monotonic plot with the normalized rank on one axis and the corresponding values on the other axis. If the y-axis is specified as the value-axis and the x-axis as the normalized rank, then y = getQuantile(x) is a monotonically increasing function.
The functions getQuantile(rank) and getQuantiles(...) translate ranks into corresponding values. The functions getRank(value), getCDF(...) (Cumulative Distribution Function), and getPMF(...) (Probability Mass Function) perform the opposite operation and translate values into ranks.
The getPMF(...) function has about 13 to 47% worse rank error (depending on k) than the other queries because the mass of each "bin" of the PMF has "double-sided" error from the upper and lower edges of the bin as a result of a subtraction, as the errors from the two edges can sometimes add.
The default k of 200 yields a "single-sided" epsilon of about 1.33% and a "double-sided" (PMF) epsilon of about 1.65%.
A getQuantile(rank) query has the following guarantees:
A getRank(value) query has the following guarantees:
A getPMF(...) query has the following guarantees:
A getCDF(...) query has the following guarantees;
From the above, it might seem like we could make some estimates to bound the value returned from a call to getQuantile(). The sketch, however, does not let us derive error bounds or confidences around values. Because errors are independent, we can approximately bracket a value as shown below, but there are no error estimates available. Additionally, the interval may be quite large for certain distributions.
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_K
The default value of K.
|
Constructor and Description |
---|
KllFloatsSketch()
Heap constructor with the default k = 200, which has a rank error of about 1.65%.
|
KllFloatsSketch(int k)
Heap constructor with a given parameter k.
|
Modifier and Type | Method and Description |
---|---|
double[] |
getCDF(float[] splitPoints)
Returns an approximation to the Cumulative Distribution Function (CDF), which is the
cumulative analog of the PMF, of the input stream given a set of splitPoint (values).
|
int |
getK()
Returns the parameter k
|
static int |
getKFromEpsilon(double epsilon,
boolean pmf)
Gets the approximate value of k to use given epsilon, the normalized rank error.
|
static int |
getMaxSerializedSizeBytes(int k,
long n)
Returns upper bound on the serialized size of a sketch given a parameter k and stream
length.
|
float |
getMaxValue()
Returns the max value of the stream.
|
float |
getMinValue()
Returns the min value of the stream.
|
long |
getN()
Returns the length of the input stream.
|
double |
getNormalizedRankError()
Deprecated.
v2.0.0. Replaced by
getNormalizedRankError(boolean) |
double |
getNormalizedRankError(boolean pmf)
Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
|
static double |
getNormalizedRankError(int k)
Deprecated.
v2.0.0. Replaced by
getNormalizedRankError(int, boolean) |
static double |
getNormalizedRankError(int k,
boolean pmf)
Gets the normalized rank error given k and pmf.
|
int |
getNumRetained()
Returns the number of retained items (samples) in the sketch.
|
double[] |
getPMF(float[] splitPoints)
Returns an approximation to the Probability Mass Function (PMF) of the input stream
given a set of splitPoints (values).
|
float |
getQuantile(double fraction)
Returns an approximation to the value of the data item
that would be preceded by the given fraction of a hypothetical sorted
version of the input stream so far.
|
float |
getQuantileLowerBound(double fraction)
Gets the lower bound of the value interval in which the true quantile of the given rank
exists with a confidence of at least 99%.
|
float[] |
getQuantiles(double[] fractions)
This is a more efficient multiple-query version of getQuantile().
|
float[] |
getQuantiles(int numEvenlySpaced)
This is also a more efficient multiple-query version of getQuantile() and allows the caller to
specify the number of evenly spaced fractional ranks.
|
float |
getQuantileUpperBound(double fraction)
Gets the upper bound of the value interval in which the true quantile of the given rank
exists with a confidence of at least 99%.
|
double |
getRank(float value)
Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1,
inclusive.
|
int |
getSerializedSizeBytes()
Returns the number of bytes this sketch would require to store.
|
static KllFloatsSketch |
heapify(org.apache.datasketches.memory.Memory mem)
Factory heapify takes the sketch image in Memory and instantiates an on-heap sketch.
|
boolean |
isEmpty()
Returns true if this sketch is empty.
|
boolean |
isEstimationMode()
Returns true if this sketch is in estimation mode.
|
KllFloatsSketchIterator |
iterator() |
void |
merge(KllFloatsSketch other)
Merges another sketch into this one.
|
byte[] |
toByteArray()
Returns serialized sketch in a byte array form.
|
String |
toString() |
String |
toString(boolean withLevels,
boolean withData)
Returns a summary of the sketch as a string.
|
void |
update(float value)
Updates this sketch with the given data item.
|
public static final int DEFAULT_K
public KllFloatsSketch()
public KllFloatsSketch(int k)
k
- parameter that controls size of the sketch and accuracy of estimatespublic static KllFloatsSketch heapify(org.apache.datasketches.memory.Memory mem)
mem
- a Memory image of a sketch serialized by this sketch.
See Memorypublic double[] getCDF(float[] splitPoints)
The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns null.
splitPoints
- an array of m unique, monotonically increasing float values
that divide the real number line into m+1 consecutive disjoint intervals.
The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and
exclusive of the right splitPoint, with the exception that the last interval will include
the maximum value.
It is not necessary to include either the min or max values in these split points.public int getK()
public static int getKFromEpsilon(double epsilon, boolean pmf)
epsilon
- the normalized rank error between zero and one.pmf
- if true, this function returns the value of k assuming the input epsilon
is the desired "double-sided" epsilon for the getPMF() function. Otherwise, this function
returns the value of k assuming the input epsilon is the desired "single-sided"
epsilon for all the other queries.KllFloatsSketch
public float getMaxValue()
public float getMinValue()
public long getN()
@Deprecated public double getNormalizedRankError()
getNormalizedRankError(boolean)
KllFloatsSketch
public double getNormalizedRankError(boolean pmf)
pmf
- if true, returns the "double-sided" normalized rank error for the getPMF() function.
Otherwise, it is the "single-sided" normalized rank error for all the other queries.KllFloatsSketch
@Deprecated public static double getNormalizedRankError(int k)
getNormalizedRankError(int, boolean)
getNormalizedRankError()
that
specifies k.k
- the configuration parameterKllFloatsSketch
public static double getNormalizedRankError(int k, boolean pmf)
getNormalizedRankError(boolean)
.k
- the configuation parameterpmf
- if true, returns the "double-sided" normalized rank error for the getPMF() function.
Otherwise, it is the "single-sided" normalized rank error for all the other queries.KllFloatsSketch
public int getNumRetained()
public static int getMaxSerializedSizeBytes(int k, long n)
k
- parameter that controls size of the sketch and accuracy of estimatesn
- stream lengthpublic double[] getPMF(float[] splitPoints)
The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(true) function.
If the sketch is empty this returns null.
splitPoints
- an array of m unique, monotonically increasing float values
that divide the real number line into m+1 consecutive disjoint intervals.
The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and
exclusive of the right splitPoint, with the exception that the last interval will include
the maximum value.
It is not necessary to include either the min or max values in these split points.public float getQuantile(double fraction)
We note that this method has a fairly large overhead (microseconds instead of nanoseconds) so it should not be called multiple times to get different quantiles from the same sketch. Instead use getQuantiles(), which pays the overhead only once.
If the sketch is empty this returns NaN.
fraction
- the specified fractional position in the hypothetical sorted stream.
These are also called normalized ranks or fractional ranks.
If fraction = 0.0, the true minimum value of the stream is returned.
If fraction = 1.0, the true maximum value of the stream is returned.public float getQuantileUpperBound(double fraction)
fraction
- the given normalized rank as a fractionpublic float getQuantileLowerBound(double fraction)
fraction
- the given normalized rank as a fractionpublic float[] getQuantiles(double[] fractions)
This returns an array that could have been generated by using getQuantile() with many different fractional ranks, but would be very inefficient. This method incurs the internal set-up overhead once and obtains multiple quantile values in a single query. It is strongly recommend that this method be used instead of multiple calls to getQuantile().
If the sketch is empty this returns null.
fractions
- given array of fractional positions in the hypothetical sorted stream.
These are also called normalized ranks or fractional ranks.
These fractions must be in the interval [0.0, 1.0], inclusive.public float[] getQuantiles(int numEvenlySpaced)
If the sketch is empty this returns null.
numEvenlySpaced
- an integer that specifies the number of evenly spaced fractional ranks.
This must be a positive integer greater than 0. A value of 1 will return the min value.
A value of 2 will return the min and the max value. A value of 3 will return the min,
the median and the max value, etc.public double getRank(float value)
The resulting approximation has a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns NaN.
value
- to be rankedpublic int getSerializedSizeBytes()
public boolean isEmpty()
public boolean isEstimationMode()
public KllFloatsSketchIterator iterator()
public void merge(KllFloatsSketch other)
other
- sketch to merge into this onepublic byte[] toByteArray()
public String toString(boolean withLevels, boolean withData)
withLevels
- if true include information about levelswithData
- if true include sketch datapublic void update(float value)
value
- an item from a stream of items. NaNs are ignored.Copyright © 2015–2021 The Apache Software Foundation. All rights reserved.