T
- type of itempublic final class ItemsSketch<T> extends Object
The documentation for DoublesSketch
applies here except that the size of an ItemsSketch
is very dependent on the Items input into the sketch, so there is no comparable size table as
for the DoublesSketch.
There is more documentation available on datasketches.apache.org.
Modifier and Type | Field and Description |
---|---|
static Random |
rand
Setting the seed makes the results of the sketch deterministic if the input values are
received in exactly the same order.
|
Modifier and Type | Method and Description |
---|---|
ItemsSketch<T> |
downSample(int newK)
From an existing sketch, this creates a new sketch that can have a smaller value of K.
|
double[] |
getCDF(T[] splitPoints)
Returns an approximation to the Cumulative Distribution Function (CDF), which is the
cumulative analog of the PMF, of the input stream given a set of splitPoints (values).
|
static <T> ItemsSketch<T> |
getInstance(Comparator<? super T> comparator)
Obtains a new instance of an ItemsSketch using the DEFAULT_K.
|
static <T> ItemsSketch<T> |
getInstance(int k,
Comparator<? super T> comparator)
Obtains a new instance of an ItemsSketch.
|
static <T> ItemsSketch<T> |
getInstance(org.apache.datasketches.memory.Memory srcMem,
Comparator<? super T> comparator,
ArrayOfItemsSerDe<T> serDe)
Heapifies the given srcMem, which must be a Memory image of a ItemsSketch
|
int |
getK()
Returns the configured value of K
|
static int |
getKFromEpsilon(double epsilon,
boolean pmf)
Gets the approximate value of k to use given epsilon, the normalized rank error.
|
T |
getMaxValue()
Returns the max value of the stream
|
T |
getMinValue()
Returns the min value of the stream
|
long |
getN()
Returns the length of the input stream so far.
|
double |
getNormalizedRankError()
Deprecated.
v2.0.0. Replaced by
getNormalizedRankError(boolean) |
double |
getNormalizedRankError(boolean pmf)
Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
|
static double |
getNormalizedRankError(int k)
Deprecated.
v2.0.0. Replaced by
getNormalizedRankError(int, boolean) |
static double |
getNormalizedRankError(int k,
boolean pmf)
Gets the normalized rank error given k and pmf.
|
double[] |
getPMF(T[] splitPoints)
Returns an approximation to the Probability Mass Function (PMF) of the input stream
given a set of splitPoints (values).
|
T |
getQuantile(double fraction)
This returns an approximation to the value of the data item
that would be preceded by the given fraction of a hypothetical sorted
version of the input stream so far.
|
T |
getQuantileLowerBound(double fraction)
Gets the lower bound of the value interval in which the true quantile of the given rank
exists with a confidence of at least 99%.
|
T[] |
getQuantiles(double[] fRanks)
This is a more efficient multiple-query version of getQuantile().
|
T[] |
getQuantiles(int evenlySpaced)
This is also a more efficient multiple-query version of getQuantile() and allows the caller to
specify the number of evenly spaced fractional ranks.
|
T |
getQuantileUpperBound(double fraction)
Gets the upper bound of the value interval in which the true quantile of the given rank
exists with a confidence of at least 99%.
|
double |
getRank(T value)
Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1
inclusive.
|
int |
getRetainedItems()
Computes the number of retained entries (samples) in the sketch
|
boolean |
isDirect() |
boolean |
isEmpty()
Returns true if this sketch is empty
|
boolean |
isEstimationMode() |
ItemsSketchIterator<T> |
iterator() |
void |
putMemory(org.apache.datasketches.memory.WritableMemory dstMem,
ArrayOfItemsSerDe<T> serDe)
Puts the current sketch into the given Memory if there is sufficient space.
|
void |
reset()
Resets this sketch to a virgin state, but retains the original value of k.
|
byte[] |
toByteArray(ArrayOfItemsSerDe<T> serDe)
Serialize this sketch to a byte array form.
|
byte[] |
toByteArray(boolean ordered,
ArrayOfItemsSerDe<T> serDe)
Serialize this sketch to a byte array form.
|
String |
toString()
Returns summary information about this sketch.
|
String |
toString(boolean sketchSummary,
boolean dataDetail)
Returns summary information about this sketch.
|
static String |
toString(byte[] byteArr)
Returns a human readable string of the preamble of a byte array image of an ItemsSketch.
|
static String |
toString(org.apache.datasketches.memory.Memory mem)
Returns a human readable string of the preamble of a Memory image of an ItemsSketch.
|
void |
update(T dataItem)
Updates this sketch with the given double data item
|
public static final Random rand
public static <T> ItemsSketch<T> getInstance(Comparator<? super T> comparator)
T
- type of itemcomparator
- to compare itemspublic static <T> ItemsSketch<T> getInstance(int k, Comparator<? super T> comparator)
T
- type of itemk
- Parameter that controls space usage of sketch and accuracy of estimates.
Must be greater than 2 and less than 65536 and a power of 2.comparator
- to compare itemspublic static <T> ItemsSketch<T> getInstance(org.apache.datasketches.memory.Memory srcMem, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
T
- type of itemsrcMem
- a Memory image of a sketch.
See Memorycomparator
- to compare itemsserDe
- an instance of ArrayOfItemsSerDepublic void update(T dataItem)
dataItem
- an item from a stream of items. NaNs are ignored.public T getQuantile(double fraction)
We note that this method has a fairly large overhead (microseconds instead of nanoseconds) so it should not be called multiple times to get different quantiles from the same sketch. Instead use getQuantiles(). which pays the overhead only once.
fraction
- the specified fractional position in the hypothetical sorted stream.
These are also called normalized ranks or fractional ranks.
If fraction = 0.0, the true minimum value of the stream is returned.
If fraction = 1.0, the true maximum value of the stream is returned.public T getQuantileUpperBound(double fraction)
fraction
- the given normalized rank as a fractionpublic T getQuantileLowerBound(double fraction)
fraction
- the given normalized rank as a fractionpublic T[] getQuantiles(double[] fRanks)
This returns an array that could have been generated by using getQuantile() with many different fractional ranks, but would be very inefficient. This method incurs the internal set-up overhead once and obtains multiple quantile values in a single query. It is strongly recommend that this method be used instead of multiple calls to getQuantile().
If the sketch is empty this returns null.
fRanks
- the given array of fractional (or normalized) ranks in the hypothetical
sorted stream of all the input values seen so far.
These fRanks must all be in the interval [0.0, 1.0] inclusively.public T[] getQuantiles(int evenlySpaced)
evenlySpaced
- an integer that specifies the number of evenly spaced fractional ranks.
This must be a positive integer greater than 1.
A value of 2 will return the min and the max value. A value of 3 will return the min,
the median and the max value, etc.public double getRank(T value)
The resulting approximation has a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns NaN.
value
- to be rankedpublic double[] getPMF(T[] splitPoints)
The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(true) function.
If the sketch is empty this returns null.
splitPoints
- an array of m unique, monotonically increasing item values
that divide the ordered space into m+1 consecutive disjoint intervals.
The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and
exclusive of the right splitPoint, with the exception that the last interval will include
the maximum value.
It is not necessary to include either the min or max values in these splitpoints.public double[] getCDF(T[] splitPoints)
The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns null.
splitPoints
- an array of m unique, monotonically increasing item values
that divide the ordered space into m+1 consecutive disjoint intervals.
The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and
exclusive of the right splitPoint, with the exception that the last interval will include
the maximum value.
It is not necessary to include either the min or max values in these splitpoints.public int getK()
public T getMinValue()
public T getMaxValue()
public long getN()
@Deprecated public double getNormalizedRankError()
getNormalizedRankError(boolean)
Suppose the sketch is presented with N values. The raw rank (0 to N-1) of an item would be its index position in the sorted version of the input stream. If we divide the raw rank by N, it becomes the normalized rank, which is between 0 and 1.0.
For example, choosing a K of 227 yields a normalized rank error of about 1%. The upper bound on the median value obtained by getQuantile(0.5) would be the value in the hypothetical ordered stream of values at the normalized rank of 0.51. The lower bound would be the value in the hypothetical ordered stream of values at the normalized rank of 0.49.
The error of this sketch cannot be translated into an error (relative or absolute) of the returned quantile values.
public double getNormalizedRankError(boolean pmf)
pmf
- if true, returns the "double-sided" normalized rank error for the getPMF() function.
Otherwise, it is the "single-sided" normalized rank error for all the other queries.@Deprecated public static double getNormalizedRankError(int k)
getNormalizedRankError(int, boolean)
getNormalizedRankError()
k
- the configuration parameter of a ItemsSketchpublic static double getNormalizedRankError(int k, boolean pmf)
getNormalizedRankError(boolean)
.k
- the configuation parameterpmf
- if true, returns the "double-sided" normalized rank error for the getPMF() function.
Otherwise, it is the "single-sided" normalized rank error for all the other queries.public static int getKFromEpsilon(double epsilon, boolean pmf)
epsilon
- the normalized rank error between zero and one.pmf
- if true, this function returns the value of k assuming the input epsilon
is the desired "double-sided" epsilon for the getPMF() function. Otherwise, this function
returns the value of k assuming the input epsilon is the desired "single-sided"
epsilon for all the other queries.public boolean isEmpty()
public boolean isDirect()
public boolean isEstimationMode()
public void reset()
public byte[] toByteArray(ArrayOfItemsSerDe<T> serDe)
serDe
- an instance of ArrayOfItemsSerDepublic byte[] toByteArray(boolean ordered, ArrayOfItemsSerDe<T> serDe)
ordered
- if true the base buffer will be ordered (default == false).serDe
- an instance of ArrayOfItemsSerDepublic String toString()
public String toString(boolean sketchSummary, boolean dataDetail)
sketchSummary
- if true includes sketch summarydataDetail
- if true includes data detailpublic static String toString(byte[] byteArr)
byteArr
- the given byte arraypublic static String toString(org.apache.datasketches.memory.Memory mem)
mem
- the given Memorypublic ItemsSketch<T> downSample(int newK)
newK
- the new value of K that must be smaller than current value of K.
It is required that this.getK() = newK * 2^(nonnegative integer).public int getRetainedItems()
public void putMemory(org.apache.datasketches.memory.WritableMemory dstMem, ArrayOfItemsSerDe<T> serDe)
dstMem
- the given memory.serDe
- an instance of ArrayOfItemsSerDepublic ItemsSketchIterator<T> iterator()
Copyright © 2015–2021 The Apache Software Foundation. All rights reserved.