T
- The sketch data typepublic final class ItemsSketch<T> extends Object implements QuantilesGenericAPI<T>
A k of 128 produces a normalized, rank error of about 1.7%. For example, the median returned from getQuantile(0.5) will be between the actual quantiles from the hypothetically sorted array of input quantiles at normalized ranks of 0.483 and 0.517, with a confidence of about 99%.
The size of an ItemsSketch is very dependent on the size of the generic Items input into the sketch, so there is no comparable size table as there is for the DoublesSketch.
QuantilesAPI
Modifier and Type | Field and Description |
---|---|
static Random |
rand
Setting the seed makes the results of the sketch deterministic if the input items are
received in exactly the same order.
|
EMPTY_MSG, MEM_REQ_SVR_NULL_MSG, NOT_SINGLE_ITEM_MSG, SELF_MERGE_MSG, TGT_IS_READ_ONLY_MSG, UNSUPPORTED_MSG
Modifier and Type | Method and Description |
---|---|
ItemsSketch<T> |
downSample(int newK)
From an existing sketch, this creates a new sketch that can have a smaller K.
|
double[] |
getCDF(T[] splitPoints,
QuantileSearchCriteria searchCrit)
Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream
as a monotonically increasing array of double ranks (or cumulative probabilities) on the interval [0.0, 1.0],
given a set of splitPoints.
|
Class<T> |
getClassOfT() |
Comparator<? super T> |
getComparator()
Returns the Comparator of T
|
static <T> ItemsSketch<T> |
getInstance(Class<T> clazz,
Comparator<? super T> comparator)
Obtains a new instance of an ItemsSketch using the DEFAULT_K.
|
static <T> ItemsSketch<T> |
getInstance(Class<T> clazz,
int k,
Comparator<? super T> comparator)
Obtains a new instance of an ItemsSketch.
|
static <T> ItemsSketch<T> |
getInstance(Class<T> clazz,
org.apache.datasketches.memory.Memory srcMem,
Comparator<? super T> comparator,
ArrayOfItemsSerDe<T> serDe)
Heapifies the given srcMem, which must be a Memory image of a ItemsSketch
|
int |
getK()
Gets the user configured parameter k, which controls the accuracy of the sketch
and its memory space usage.
|
static int |
getKFromEpsilon(double epsilon,
boolean pmf)
Gets the approximate k to use given epsilon, the normalized rank error.
|
T |
getMaxItem()
Returns the maximum item of the stream.
|
T |
getMinItem()
Returns the minimum item of the stream.
|
long |
getN()
Gets the length of the input stream offered to the sketch..
|
double |
getNormalizedRankError(boolean pmf)
Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
|
static double |
getNormalizedRankError(int k,
boolean pmf)
Gets the normalized rank error given k and pmf.
|
int |
getNumRetained()
Gets the number of quantiles retained by the sketch.
|
GenericPartitionBoundaries<T> |
getPartitionBoundariesFromNumParts(int numEquallySizedParts,
QuantileSearchCriteria searchCrit)
This method returns an instance of
GenericPartitionBoundaries which provides
sufficient information for the user to create the given number of equally sized partitions, where "equally sized"
refers to an approximately equal number of items per partition. |
GenericPartitionBoundaries<T> |
getPartitionBoundariesFromPartSize(long nominalPartSizeItems,
QuantileSearchCriteria searchCrit)
This method returns an instance of
GenericPartitionBoundaries which provides
sufficient information for the user to create the given number of equally sized partitions, where "equally sized"
refers to an approximately equal number of items per partition. |
double[] |
getPMF(T[] splitPoints,
QuantileSearchCriteria searchCrit)
Returns an approximation to the Probability Mass Function (PMF) of the input stream
as an array of probability masses as doubles on the interval [0.0, 1.0],
given a set of splitPoints.
|
T |
getQuantile(double rank,
QuantileSearchCriteria searchCrit)
Gets the approximate quantile of the given normalized rank and the given search criterion.
|
T |
getQuantileLowerBound(double rank)
Gets the lower bound of the quantile confidence interval in which the quantile of the
given rank exists.
|
T[] |
getQuantiles(double[] ranks,
QuantileSearchCriteria searchCrit)
Gets an array of quantiles from the given array of normalized ranks.
|
T |
getQuantileUpperBound(double rank)
Gets the upper bound of the quantile confidence interval in which the true quantile of the
given rank exists.
|
double |
getRank(T quantile,
QuantileSearchCriteria searchCrit)
Gets the normalized rank corresponding to the given a quantile.
|
double |
getRankLowerBound(double rank)
Gets the lower bound of the rank confidence interval in which the true rank of the
given rank exists.
|
double[] |
getRanks(T[] quantiles,
QuantileSearchCriteria searchCrit)
Gets an array of normalized ranks corresponding to the given array of quantiles and the given
search criterion.
|
double |
getRankUpperBound(double rank)
Gets the upper bound of the rank confidence interval in which the true rank of the
given rank exists.
|
ItemsSketchSortedView<T> |
getSortedView()
Gets the sorted view of this sketch
|
boolean |
hasMemory()
Returns true if this sketch's data structure is backed by Memory or WritableMemory.
|
boolean |
isDirect()
Returns true if this sketch's data structure is off-heap (a.k.a., Direct or Native memory).
|
boolean |
isEmpty()
Returns true if this sketch is empty.
|
boolean |
isEstimationMode()
Returns true if this sketch is in estimation mode.
|
boolean |
isReadOnly()
Returns true if this sketch is read only.
|
QuantilesGenericSketchIterator<T> |
iterator()
Gets the iterator for this sketch, which is not sorted.
|
void |
putMemory(org.apache.datasketches.memory.WritableMemory dstMem,
ArrayOfItemsSerDe<T> serDe)
Puts the current sketch into the given Memory if there is sufficient space.
|
void |
reset()
Resets this sketch to the empty state.
|
byte[] |
toByteArray(ArrayOfItemsSerDe<T> serDe)
Serialize this sketch to a byte array form.
|
byte[] |
toByteArray(boolean ordered,
ArrayOfItemsSerDe<T> serDe)
Serialize this sketch to a byte array form.
|
String |
toString()
Returns human readable summary information about this sketch.
|
String |
toString(boolean withLevels,
boolean withLevelsAndItems)
Returns human readable summary information about this sketch.
|
static String |
toString(byte[] byteArr)
Returns a human readable string of the preamble of a byte array image of an ItemsSketch.
|
static String |
toString(org.apache.datasketches.memory.Memory mem)
Returns a human readable string of the preamble of a Memory image of an ItemsSketch.
|
void |
update(T item)
Updates this sketch with the given item.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
getCDF, getMaxPartitions, getPMF, getQuantile, getQuantiles, getRank, getRanks
getPartitionBoundariesFromNumParts, getPartitionBoundariesFromPartSize
getMinPartitionSizeItems
public static final Random rand
public static <T> ItemsSketch<T> getInstance(Class<T> clazz, Comparator<? super T> comparator)
T
- The sketch data typeclazz
- the given class of Tcomparator
- to compare itemspublic static <T> ItemsSketch<T> getInstance(Class<T> clazz, int k, Comparator<? super T> comparator)
T
- The sketch data typeclazz
- the given class of Tk
- Parameter that controls space usage of sketch and accuracy of estimates.
Must be greater than 2 and less than 65536 and a power of 2.comparator
- to compare itemspublic static <T> ItemsSketch<T> getInstance(Class<T> clazz, org.apache.datasketches.memory.Memory srcMem, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
T
- The sketch data typeclazz
- the given class of TsrcMem
- a Memory image of a sketch.
See Memorycomparator
- to compare itemsserDe
- an instance of ArrayOfItemsSerDepublic double[] getCDF(T[] splitPoints, QuantileSearchCriteria searchCrit)
QuantilesGenericAPI
The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.
getCDF
in interface QuantilesGenericAPI<T>
splitPoints
- an array of m unique, monotonically increasing items
(of the same type as the input items)
that divide the item input domain into m+1 overlapping intervals.
The start of each interval is below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and the end of the interval is the rank or cumulative probability corresponding to the split point.
The (m+1)th interval represents 100% of the distribution represented by the sketch and consistent with the definition of a cumulative probability distribution, thus the (m+1)th rank or probability in the returned array is always 1.0.
If a split point exactly equals a retained item of the sketch and the search criterion is:
It is not recommended to include either the minimum or maximum items of the input stream.
searchCrit
- the desired search criteria.public Class<T> getClassOfT()
getClassOfT
in interface QuantilesGenericAPI<T>
public Comparator<? super T> getComparator()
QuantilesGenericAPI
getComparator
in interface QuantilesGenericAPI<T>
public T getMaxItem()
QuantilesGenericAPI
getMaxItem
in interface QuantilesGenericAPI<T>
public T getMinItem()
QuantilesGenericAPI
getMinItem
in interface QuantilesGenericAPI<T>
public GenericPartitionBoundaries<T> getPartitionBoundariesFromNumParts(int numEquallySizedParts, QuantileSearchCriteria searchCrit)
PartitioningFeature
GenericPartitionBoundaries
which provides
sufficient information for the user to create the given number of equally sized partitions, where "equally sized"
refers to an approximately equal number of items per partition.
The sketch must not be empty.
getPartitionBoundariesFromNumParts
in interface PartitioningFeature<T>
numEquallySizedParts
- an integer that specifies the number of equally sized partitions between
getMinItem()
and
getMaxItem()
.
This must be a positive integer less than
getMaxPartitions()
searchCrit
- If INCLUSIVE, all the returned quantiles are the upper boundaries of the equally sized partitions
with the exception of the lowest returned quantile, which is the lowest boundary of the lowest ranked partition.
If EXCLUSIVE, all the returned quantiles are the lower boundaries of the equally sized partitions
with the exception of the highest returned quantile, which is the upper boundary of the highest ranked partition.GenericPartitionBoundaries
.public GenericPartitionBoundaries<T> getPartitionBoundariesFromPartSize(long nominalPartSizeItems, QuantileSearchCriteria searchCrit)
PartitioningFeature
GenericPartitionBoundaries
which provides
sufficient information for the user to create the given number of equally sized partitions, where "equally sized"
refers to an approximately equal number of items per partition.
The sketch must not be empty.
getPartitionBoundariesFromPartSize
in interface PartitioningFeature<T>
nominalPartSizeItems
- an integer that specifies the nominal size, in items, of each target partition.
This must be a positive integer greater than
getMinPartitionSizeItems()
.searchCrit
- If INCLUSIVE, all the returned quantiles are the upper boundaries of the equally sized partitions
with the exception of the lowest returned quantile, which is the lowest boundary of the lowest ranked partition.
If EXCLUSIVE, all the returned quantiles are the lower boundaries of the equally sized partitions
with the exception of the highest returned quantile, which is the upper boundary of the highest ranked partition.GenericPartitionBoundaries
.public double[] getPMF(T[] splitPoints, QuantileSearchCriteria searchCrit)
QuantilesGenericAPI
The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(true) function.
getPMF
in interface QuantilesGenericAPI<T>
splitPoints
- an array of m unique, monotonically increasing items
(of the same type as the input items)
that divide the item input domain into m+1 consecutive, non-overlapping intervals.
Each interval except for the end intervals starts with a split point and ends with the next split point in sequence.
The first interval starts below the lowest item retained by the sketch corresponding to a zero rank or zero probability, and ends with the first split point
The last (m+1)th interval starts with the last split point and ends after the last item retained by the sketch corresponding to a rank or probability of 1.0.
The sum of the probability masses of all (m+1) intervals is 1.0.
If the search criterion is:
It is not recommended to include either the minimum or maximum items of the input stream.
searchCrit
- the desired search criteria.public T getQuantile(double rank, QuantileSearchCriteria searchCrit)
QuantilesGenericAPI
getQuantile
in interface QuantilesGenericAPI<T>
rank
- the given normalized rank, a double in the range [0.0, 1.0].searchCrit
- If INCLUSIVE, the given rank includes all quantiles ≤
the quantile directly corresponding to the given rank.
If EXCLUSIVE, he given rank includes all quantiles <
the quantile directly corresponding to the given rank.QuantileSearchCriteria
public T getQuantileLowerBound(double rank)
QuantilesGenericAPI
Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile confidence interval as an additive or multiplicative percent of the true quantile.
getQuantileLowerBound
in interface QuantilesGenericAPI<T>
rank
- the given normalized rankpublic T getQuantileUpperBound(double rank)
QuantilesGenericAPI
Although it is possible to estimate the probability that the true quantile exists within the quantile confidence interval specified by the upper and lower quantile bounds, it is not possible to guarantee the width of the quantile interval as an additive or multiplicative percent of the true quantile.
getQuantileUpperBound
in interface QuantilesGenericAPI<T>
rank
- the given normalized rankpublic T[] getQuantiles(double[] ranks, QuantileSearchCriteria searchCrit)
QuantilesGenericAPI
getQuantiles
in interface QuantilesGenericAPI<T>
ranks
- the given array of normalized ranks, each of which must be
in the interval [0.0,1.0].searchCrit
- if INCLUSIVE, the given ranks include all quantiles ≤
the quantile directly corresponding to each rank.QuantileSearchCriteria
public double getRank(T quantile, QuantileSearchCriteria searchCrit)
QuantilesGenericAPI
getRank
in interface QuantilesGenericAPI<T>
quantile
- the given quantilesearchCrit
- if INCLUSIVE the given quantile is included into the rank.QuantileSearchCriteria
public double getRankLowerBound(double rank)
QuantilesAPI
getRankLowerBound
in interface QuantilesAPI
rank
- the given normalized rank.public double getRankUpperBound(double rank)
QuantilesAPI
getRankUpperBound
in interface QuantilesAPI
rank
- the given normalized rank.public double[] getRanks(T[] quantiles, QuantileSearchCriteria searchCrit)
QuantilesGenericAPI
getRanks
in interface QuantilesGenericAPI<T>
quantiles
- the given array of quantilessearchCrit
- if INCLUSIVE, the given quantiles include the rank directly corresponding to each quantile.QuantileSearchCriteria
public QuantilesGenericSketchIterator<T> iterator()
QuantilesGenericAPI
iterator
in interface QuantilesGenericAPI<T>
public int getK()
QuantilesAPI
getK
in interface QuantilesAPI
public long getN()
QuantilesAPI
getN
in interface QuantilesAPI
getN
in interface SketchPartitionLimits
public double getNormalizedRankError(boolean pmf)
QuantilesAPI
getNormalizedRankError
in interface QuantilesAPI
pmf
- if true, returns the "double-sided" normalized rank error for the getPMF() function.
Otherwise, it is the "single-sided" normalized rank error for all the other queries.public static double getNormalizedRankError(int k, boolean pmf)
getNormalizedRankError(boolean)
.k
- the configuration parameterpmf
- if true, returns the "double-sided" normalized rank error for the getPMF() function.
Otherwise, it is the "single-sided" normalized rank error for all the other queries.public static int getKFromEpsilon(double epsilon, boolean pmf)
epsilon
- the normalized rank error between zero and one.pmf
- if true, this function returns k assuming the input epsilon
is the desired "double-sided" epsilon for the getPMF() function. Otherwise, this function
returns k assuming the input epsilon is the desired "single-sided"
epsilon for all the other queries.public boolean hasMemory()
QuantilesAPI
hasMemory
in interface QuantilesAPI
public boolean isEmpty()
QuantilesAPI
isEmpty
in interface QuantilesAPI
public boolean isDirect()
QuantilesAPI
isDirect
in interface QuantilesAPI
public boolean isEstimationMode()
QuantilesAPI
isEstimationMode
in interface QuantilesAPI
public boolean isReadOnly()
QuantilesAPI
isReadOnly
in interface QuantilesAPI
public void reset()
QuantilesAPI
The parameter k will not change.
reset
in interface QuantilesAPI
public byte[] toByteArray(ArrayOfItemsSerDe<T> serDe)
serDe
- an instance of ArrayOfItemsSerDepublic byte[] toByteArray(boolean ordered, ArrayOfItemsSerDe<T> serDe)
ordered
- if true the base buffer will be ordered (default == false).serDe
- an instance of ArrayOfItemsSerDepublic String toString()
toString
in interface QuantilesAPI
toString
in class Object
public String toString(boolean withLevels, boolean withLevelsAndItems)
withLevels
- if true includes sketch levels array summary informationwithLevelsAndItems
- if true include detail of levels array and items array togetherpublic static String toString(byte[] byteArr)
byteArr
- the given byte arraypublic static String toString(org.apache.datasketches.memory.Memory mem)
mem
- the given Memorypublic ItemsSketch<T> downSample(int newK)
newK
- the new K that must be smaller than current K.
It is required that this.getK() = newK * 2^(nonnegative integer).public int getNumRetained()
QuantilesAPI
getNumRetained
in interface QuantilesAPI
public void putMemory(org.apache.datasketches.memory.WritableMemory dstMem, ArrayOfItemsSerDe<T> serDe)
dstMem
- the given memory.serDe
- an instance of ArrayOfItemsSerDepublic void update(T item)
QuantilesGenericAPI
update
in interface QuantilesGenericAPI<T>
item
- from a stream of items. Nulls are ignored.public ItemsSketchSortedView<T> getSortedView()
QuantilesGenericAPI
getSortedView
in interface QuantilesGenericAPI<T>
Copyright © 2015–2024 The Apache Software Foundation. All rights reserved.