public abstract class DoublesSketch extends Object
Consider a large stream of one million values such as packet sizes coming into a network node. The absolute rank of any specific size value is simply its index in the hypothetical sorted array of values. The normalized rank (or fractional rank) is the absolute rank divided by the stream size, in this case one million. The value corresponding to the normalized rank of 0.5 represents the 50th percentile or median value of the distribution, or getQuantile(0.5). Similarly, the 95th percentile is obtained from getQuantile(0.95). Using the getQuantiles(0.0, 1.0) will return the min and max values seen by the sketch.
From the min and max values, for example, 1 and 1000 bytes, you can obtain the PMF from getPMF(100, 500, 900) that will result in an array of 4 fractional values such as {.4, .3, .2, .1}, which means that
The accuracy of this sketch is a function of the configured value k, which also affects the overall size of the sketch. Accuracy of this quantile sketch is always with respect to the normalized rank. A k of 128 produces a normalized, rank error of about 1.7%. For example, the median value returned from getQuantile(0.5) will be between the actual values from the hypothetically sorted array of input values at normalized ranks of 0.483 and 0.517, with a confidence of about 99%.
Table Guide for DoublesSketch Size in Bytes and Approximate Error: K => | 16 32 64 128 256 512 1,024 ~ Error => | 12.145% 6.359% 3.317% 1.725% 0.894% 0.463% 0.239% N | Size in Bytes -> ------------------------------------------------------------------------ 0 | 8 8 8 8 8 8 8 1 | 72 72 72 72 72 72 72 3 | 72 72 72 72 72 72 72 7 | 104 104 104 104 104 104 104 15 | 168 168 168 168 168 168 168 31 | 296 296 296 296 296 296 296 63 | 424 552 552 552 552 552 552 127 | 552 808 1,064 1,064 1,064 1,064 1,064 255 | 680 1,064 1,576 2,088 2,088 2,088 2,088 511 | 808 1,320 2,088 3,112 4,136 4,136 4,136 1,023 | 936 1,576 2,600 4,136 6,184 8,232 8,232 2,047 | 1,064 1,832 3,112 5,160 8,232 12,328 16,424 4,095 | 1,192 2,088 3,624 6,184 10,280 16,424 24,616 8,191 | 1,320 2,344 4,136 7,208 12,328 20,520 32,808 16,383 | 1,448 2,600 4,648 8,232 14,376 24,616 41,000 32,767 | 1,576 2,856 5,160 9,256 16,424 28,712 49,192 65,535 | 1,704 3,112 5,672 10,280 18,472 32,808 57,384 131,071 | 1,832 3,368 6,184 11,304 20,520 36,904 65,576 262,143 | 1,960 3,624 6,696 12,328 22,568 41,000 73,768 524,287 | 2,088 3,880 7,208 13,352 24,616 45,096 81,960 1,048,575 | 2,216 4,136 7,720 14,376 26,664 49,192 90,152 2,097,151 | 2,344 4,392 8,232 15,400 28,712 53,288 98,344 4,194,303 | 2,472 4,648 8,744 16,424 30,760 57,384 106,536 8,388,607 | 2,600 4,904 9,256 17,448 32,808 61,480 114,728 16,777,215 | 2,728 5,160 9,768 18,472 34,856 65,576 122,920 33,554,431 | 2,856 5,416 10,280 19,496 36,904 69,672 131,112 67,108,863 | 2,984 5,672 10,792 20,520 38,952 73,768 139,304 134,217,727 | 3,112 5,928 11,304 21,544 41,000 77,864 147,496 268,435,455 | 3,240 6,184 11,816 22,568 43,048 81,960 155,688 536,870,911 | 3,368 6,440 12,328 23,592 45,096 86,056 163,880 1,073,741,823 | 3,496 6,696 12,840 24,616 47,144 90,152 172,072 2,147,483,647 | 3,624 6,952 13,352 25,640 49,192 94,248 180,264 4,294,967,295 | 3,752 7,208 13,864 26,664 51,240 98,344 188,456
There is more documentation available on datasketches.apache.org.
This is an implementation of the Low Discrepancy Mergeable Quantiles Sketch, using double values, described in section 3.2 of the journal version of the paper "Mergeable Summaries" by Agarwal, Cormode, Huang, Phillips, Wei, and Yi.
This algorithm is independent of the distribution of values, which can be anywhere in the range of the IEEE-754 64-bit doubles.
This algorithm intentionally inserts randomness into the sampling process for values that ultimately get retained in the sketch. The results produced by this algorithm are not deterministic. For example, if the same stream is inserted into two different instances of this sketch, the answers obtained from the two sketches may not be be identical.
Similarly, there may be directional inconsistencies. For example, the resulting array of values obtained from getQuantiles(fractions[]) input into the reverse directional query getPMF(splitPoints[]) may not result in the original fractional values.
Modifier and Type | Method and Description |
---|---|
static DoublesSketchBuilder |
builder()
Returns a new builder
|
DoublesSketch |
downSample(DoublesSketch srcSketch,
int smallerK,
org.apache.datasketches.memory.WritableMemory dstMem)
From an source sketch, create a new sketch that must have a smaller value of K.
|
double[] |
getCDF(double[] splitPoints)
Returns an approximation to the Cumulative Distribution Function (CDF), which is the
cumulative analog of the PMF, of the input stream given a set of splitPoint (values).
|
int |
getCompactStorageBytes()
Returns the number of bytes this sketch would require to store in compact form, which is not
updatable.
|
static int |
getCompactStorageBytes(int k,
long n)
Returns the number of bytes a DoublesSketch would require to store in compact form
given the values of k and n.
|
int |
getK()
Returns the configured value of K
|
static int |
getKFromEpsilon(double epsilon,
boolean pmf)
Gets the approximate value of k to use given epsilon, the normalized rank error.
|
abstract double |
getMaxValue()
Returns the max value of the stream.
|
abstract double |
getMinValue()
Returns the min value of the stream.
|
abstract long |
getN()
Returns the length of the input stream so far.
|
double |
getNormalizedRankError()
Deprecated.
v2.0.0. Replaced by
getNormalizedRankError(boolean) |
double |
getNormalizedRankError(boolean pmf)
Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
|
static double |
getNormalizedRankError(int k)
Deprecated.
v2.0.0. Replaced by
getNormalizedRankError(int, boolean) |
static double |
getNormalizedRankError(int k,
boolean pmf)
Gets the normalized rank error given k and pmf.
|
double[] |
getPMF(double[] splitPoints)
Returns an approximation to the Probability Mass Function (PMF) of the input stream
given a set of splitPoints (values).
|
double |
getQuantile(double fraction)
This returns an approximation to the value of the data item
that would be preceded by the given fraction of a hypothetical sorted
version of the input stream so far.
|
double |
getQuantileLowerBound(double fraction)
Gets the lower bound of the value interval in which the true quantile of the given rank
exists with a confidence of at least 99%.
|
double[] |
getQuantiles(double[] fRanks)
This is a more efficient multiple-query version of getQuantile().
|
double[] |
getQuantiles(int evenlySpaced)
This is also a more efficient multiple-query version of getQuantile() and allows the caller to
specify the number of evenly spaced fractional ranks.
|
double |
getQuantileUpperBound(double fraction)
Gets the upper bound of the value interval in which the true quantile of the given rank
exists with a confidence of at least 99%.
|
double |
getRank(double value)
Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1
inclusive.
|
int |
getRetainedItems()
Computes the number of retained items (samples) in the sketch
|
int |
getStorageBytes()
Returns the number of bytes this sketch would require to store in native form: compact for
a CompactDoublesSketch, non-compact for an UpdateDoublesSketch.
|
int |
getUpdatableStorageBytes()
Returns the number of bytes this sketch would require to store in updatable form.
|
static int |
getUpdatableStorageBytes(int k,
long n)
Returns the number of bytes a sketch would require to store in updatable form.
|
static DoublesSketch |
heapify(org.apache.datasketches.memory.Memory srcMem)
Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.
|
abstract boolean |
isDirect()
Returns true if this sketch is direct
|
boolean |
isEmpty()
Returns true if this sketch is empty
|
boolean |
isEstimationMode()
Returns true if this sketch is in estimation mode.
|
boolean |
isSameResource(org.apache.datasketches.memory.Memory that)
Returns true if the backing resource of this is identical with the backing resource
of that.
|
DoublesSketchIterator |
iterator() |
void |
putMemory(org.apache.datasketches.memory.WritableMemory dstMem)
Puts the current sketch into the given Memory in compact form if there is sufficient space,
otherwise, it throws an error.
|
void |
putMemory(org.apache.datasketches.memory.WritableMemory dstMem,
boolean compact)
Puts the current sketch into the given Memory if there is sufficient space, otherwise,
throws an error.
|
byte[] |
toByteArray()
Serialize this sketch to a byte array.
|
byte[] |
toByteArray(boolean compact)
Serialize this sketch in a byte array form.
|
String |
toString()
Returns summary information about this sketch.
|
String |
toString(boolean sketchSummary,
boolean dataDetail)
Returns summary information about this sketch.
|
static String |
toString(byte[] byteArr)
Returns a human readable string of the preamble of a byte array image of a DoublesSketch.
|
static String |
toString(org.apache.datasketches.memory.Memory mem)
Returns a human readable string of the preamble of a Memory image of a DoublesSketch.
|
static DoublesSketch |
wrap(org.apache.datasketches.memory.Memory srcMem)
Wrap this sketch around the given Memory image of a DoublesSketch, compact or non-compact.
|
public static final DoublesSketchBuilder builder()
public static DoublesSketch heapify(org.apache.datasketches.memory.Memory srcMem)
srcMem
- a Memory image of a Sketch.
See Memorypublic static DoublesSketch wrap(org.apache.datasketches.memory.Memory srcMem)
srcMem
- the given Memory image of a DoublesSketch that may have data,public double getQuantile(double fraction)
We note that this method has a fairly large overhead (microseconds instead of nanoseconds) so it should not be called multiple times to get different quantiles from the same sketch. Instead use getQuantiles(), which pays the overhead only once.
If the sketch is empty this returns Double.NaN.
fraction
- the specified fractional position in the hypothetical sorted stream.
These are also called normalized ranks or fractional ranks.
If fraction = 0.0, the true minimum value of the stream is returned.
If fraction = 1.0, the true maximum value of the stream is returned.public double getQuantileUpperBound(double fraction)
fraction
- the given normalized rank as a fractionpublic double getQuantileLowerBound(double fraction)
fraction
- the given normalized rank as a fractionpublic double[] getQuantiles(double[] fRanks)
This returns an array that could have been generated by using getQuantile() with many different fractional ranks, but would be very inefficient. This method incurs the internal set-up overhead once and obtains multiple quantile values in a single query. It is strongly recommend that this method be used instead of multiple calls to getQuantile().
If the sketch is empty this returns null.
fRanks
- the given array of fractional (or normalized) ranks in the hypothetical
sorted stream of all the input values seen so far.
These fRanks must all be in the interval [0.0, 1.0] inclusively.public double[] getQuantiles(int evenlySpaced)
If the sketch is empty this returns null.
evenlySpaced
- an integer that specifies the number of evenly spaced fractional ranks.
This must be a positive integer greater than 1.
A value of 2 will return the min and the max value. A value of 3 will return the min,
the median and the max value, etc.public double getRank(double value)
The resulting approximation has a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns NaN.
value
- to be rankedpublic double[] getPMF(double[] splitPoints)
The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(true) function.
If the sketch is empty this returns null.
splitPoints
- an array of m unique, monotonically increasing double values
that divide the real number line into m+1 consecutive disjoint intervals.
The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and
exclusive of the right splitPoint, with the exception that the last interval will include
the maximum value.
It is not necessary to include either the min or max values in these splitpoints.public double[] getCDF(double[] splitPoints)
The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns null.
splitPoints
- an array of m unique, monotonically increasing double values
that divide the real number line into m+1 consecutive disjoint intervals.
The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and
exclusive of the right splitPoint, with the exception that the last interval will include
the maximum value.
It is not necessary to include either the min or max values in these splitpoints.public int getK()
public abstract double getMinValue()
public abstract double getMaxValue()
public abstract long getN()
@Deprecated public double getNormalizedRankError()
getNormalizedRankError(boolean)
Suppose the sketch is presented with N values. The raw rank (0 to N-1) of an item would be its index position in the sorted version of the input stream. If we divide the raw rank by N, it becomes the normalized rank, which is between 0 and 1.0.
For example, choosing a K of 256 yields a normalized rank error of less than 1%. The upper bound on the median value obtained by getQuantile(0.5) would be the value in the hypothetical ordered stream of values at the normalized rank of 0.51. The lower bound would be the value in the hypothetical ordered stream of values at the normalized rank of 0.49.
The error of this sketch cannot be translated into an error (relative or absolute) of the returned quantile values.
public double getNormalizedRankError(boolean pmf)
pmf
- if true, returns the "double-sided" normalized rank error for the getPMF() function.
Otherwise, it is the "single-sided" normalized rank error for all the other queries.@Deprecated public static double getNormalizedRankError(int k)
getNormalizedRankError(int, boolean)
getNormalizedRankError()
k
- the configuration parameter of a DoublesSketchpublic static double getNormalizedRankError(int k, boolean pmf)
getNormalizedRankError(boolean)
.k
- the configuation parameterpmf
- if true, returns the "double-sided" normalized rank error for the getPMF() function.
Otherwise, it is the "single-sided" normalized rank error for all the other queries.KllFloatsSketch
public static int getKFromEpsilon(double epsilon, boolean pmf)
epsilon
- the normalized rank error between zero and one.pmf
- if true, this function returns the value of k assuming the input epsilon
is the desired "double-sided" epsilon for the getPMF() function. Otherwise, this function
returns the value of k assuming the input epsilon is the desired "single-sided"
epsilon for all the other queries.KllFloatsSketch
public boolean isEmpty()
public abstract boolean isDirect()
public boolean isEstimationMode()
public boolean isSameResource(org.apache.datasketches.memory.Memory that)
that
- A different non-null objectpublic byte[] toByteArray()
public byte[] toByteArray(boolean compact)
compact
- if true the sketch will be serialized in compact form.
DirectCompactDoublesSketch can wrap() only a compact byte array;
DirectUpdateDoublesSketch can wrap() only a non-compact byte array.public String toString()
public String toString(boolean sketchSummary, boolean dataDetail)
sketchSummary
- if true includes sketch summarydataDetail
- if true includes data detailpublic static String toString(byte[] byteArr)
byteArr
- the given byte arraypublic static String toString(org.apache.datasketches.memory.Memory mem)
mem
- the given Memorypublic DoublesSketch downSample(DoublesSketch srcSketch, int smallerK, org.apache.datasketches.memory.WritableMemory dstMem)
srcSketch
- the sourcing sketchsmallerK
- the new sketch's value of K that must be smaller than this value of K.
It is required that this.getK() = smallerK * 2^(nonnegative integer).dstMem
- the destination Memory. It must not overlap the Memory of this sketch.
If null, a heap sketch will be returned, otherwise it will be off-heap.public int getRetainedItems()
public int getCompactStorageBytes()
public static int getCompactStorageBytes(int k, long n)
k
- the size configuration parameter for the sketchn
- the number of items input into the sketchpublic int getStorageBytes()
public int getUpdatableStorageBytes()
public static int getUpdatableStorageBytes(int k, long n)
k
- the size configuration parameter for the sketchn
- the number of items input into the sketchpublic void putMemory(org.apache.datasketches.memory.WritableMemory dstMem)
dstMem
- the given memory.public void putMemory(org.apache.datasketches.memory.WritableMemory dstMem, boolean compact)
dstMem
- the given memory.compact
- if true, compacts and sorts the base buffer, which optimizes merge
performance at the cost of slightly increased serialization time.public DoublesSketchIterator iterator()
Copyright © 2015–2021 The Apache Software Foundation. All rights reserved.