T
- The type of item to be tracked by this sketchpublic class ItemsSketch<T> extends Object
This sketch is useful for tracking approximate frequencies of items of type <T> with optional associated counts (<T> item, long count) that are members of a multiset of such items. The true frequency of an item is defined to be the sum of associated counts.
This implementation provides the following capabilities:
Space Usage
The sketch is initialized with a maxMapSize that specifies the maximum physical length of the internal hash map of the form (<T> item, long count). The maxMapSize must be a power of 2.
The hash map starts at a very small size (8 entries), and grows as needed up to the specified maxMapSize.
Excluding external space required for the item objects, the internal memory space usage of this sketch is 18 * mapSize bytes (assuming 8 bytes for each Java reference), plus a small constant number of additional bytes. The internal memory space usage of this sketch will never exceed 18 * maxMapSize bytes, plus a small constant number of additional bytes.
Maximum Capacity of the Sketch
The LOAD_FACTOR for the hash map is internally set at 75%, which means at any time the map capacity of (item, count) pairs is mapCap = 0.75 * mapSize. The maximum capacity of (item, count) pairs of the sketch is maxMapCap = 0.75 * maxMapSize.
Updating the sketch with (item, count) pairs
If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count, otherwise, a new counter "(item, count) pair" is created. If the number of tracked counters reaches the maximum capacity of the hash map the sketch decrements all of the counters (by an approximately computed median), and removes any non-positive counters.
Accuracy
If fewer than 0.75 * maxMapSize different items are inserted into the sketch the estimated frequencies returned by the sketch will be exact.
The logic of the frequent items sketch is such that the stored counts and true counts are never too different. More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).
For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item. Specifically, (UB- LB) ≤ W * epsilon, where W denotes the sum of all item counts, and epsilon = 3.5/M, where M is the maxMapSize.
This is a worst case guarantee that applies to arbitrary inputs.1 For inputs typically seen in practice (UB-LB) is usually much smaller.
Background
This code implements a variant of what is commonly known as the "Misra-Gries algorithm". Variants of it were discovered and rediscovered and redesigned several times over the years:
Modifier and Type | Class and Description |
---|---|
static class |
ItemsSketch.Row<T>
Row class that defines the return values from a getFrequentItems query.
|
Constructor and Description |
---|
ItemsSketch(int maxMapSize)
Construct this sketch with the parameter maxMapSize and the default initialMapSize (8).
|
Modifier and Type | Method and Description |
---|---|
static double |
getAprioriError(int maxMapSize,
long estimatedTotalStreamWeight)
Returns the estimated a priori error given the maxMapSize for the sketch and the
estimatedTotalStreamWeight.
|
int |
getCurrentMapCapacity()
Returns the current number of counters the sketch is configured to support.
|
static double |
getEpsilon(int maxMapSize)
Returns epsilon used to compute a priori error.
|
long |
getEstimate(T item)
Gets the estimate of the frequency of the given item.
|
ItemsSketch.Row<T>[] |
getFrequentItems(ErrorType errorType)
Returns an array of Rows that include frequent items, estimates, upper and lower bounds
given an ErrorCondition and the default threshold.
|
ItemsSketch.Row<T>[] |
getFrequentItems(long threshold,
ErrorType errorType)
Returns an array of Rows that include frequent items, estimates, upper and lower bounds
given a threshold and an ErrorCondition.
|
static <T> ItemsSketch<T> |
getInstance(org.apache.datasketches.memory.Memory srcMem,
ArrayOfItemsSerDe<T> serDe)
Returns a sketch instance of this class from the given srcMem,
which must be a Memory representation of this sketch class.
|
long |
getLowerBound(T item)
Gets the guaranteed lower bound frequency of the given item, which can never be
negative.
|
long |
getMaximumError() |
int |
getMaximumMapCapacity()
Returns the maximum number of counters the sketch is configured to support.
|
int |
getNumActiveItems() |
long |
getStreamLength()
Returns the sum of the frequencies in the stream seen so far by the sketch
|
long |
getUpperBound(T item)
Gets the guaranteed upper bound frequency of the given item.
|
boolean |
isEmpty()
Returns true if this sketch is empty
|
ItemsSketch<T> |
merge(ItemsSketch<T> other)
This function merges the other sketch into this one.
|
void |
reset()
Resets this sketch to a virgin state.
|
byte[] |
toByteArray(ArrayOfItemsSerDe<T> serDe)
Returns a byte array representation of this sketch
|
String |
toString()
Returns a human readable summary of this sketch.
|
static String |
toString(byte[] byteArr)
Returns a human readable string of the preamble of a byte array image of a ItemsSketch.
|
static String |
toString(org.apache.datasketches.memory.Memory mem)
Returns a human readable string of the preamble of a Memory image of a ItemsSketch.
|
void |
update(T item)
Update this sketch with an item and a frequency count of one.
|
void |
update(T item,
long count)
Update this sketch with an item and a positive frequency count.
|
public ItemsSketch(int maxMapSize)
maxMapSize
- Determines the physical size of the internal hash map managed by this
sketch and must be a power of 2. The maximum capacity of this internal hash map is
0.75 times * maxMapSize. Both the ultimate accuracy and size of this sketch are
functions of maxMapSize.public static <T> ItemsSketch<T> getInstance(org.apache.datasketches.memory.Memory srcMem, ArrayOfItemsSerDe<T> serDe)
T
- The type of item that this sketch will tracksrcMem
- a Memory representation of a sketch of this class.
See MemoryserDe
- an instance of ArrayOfItemsSerDepublic static double getAprioriError(int maxMapSize, long estimatedTotalStreamWeight)
maxMapSize
- the planned map size to be used when constructing this sketch.estimatedTotalStreamWeight
- the estimated total stream weight.public int getCurrentMapCapacity()
public static double getEpsilon(int maxMapSize)
maxMapSize
- the planned map size to be used when constructing this sketch.public long getEstimate(T item)
item
- the given itempublic long getLowerBound(T item)
item
- the given item.public ItemsSketch.Row<T>[] getFrequentItems(long threshold, ErrorType errorType)
The method first examines all active items in the sketch (items that have a counter).
If ErrorType = NO_FALSE_NEGATIVES, this will include an item in the result list if getUpperBound(item) > threshold. There will be no false negatives, i.e., no Type II error. There may be items in the set with true frequencies less than the threshold (false positives).
If ErrorType = NO_FALSE_POSITIVES, this will include an item in the result list if getLowerBound(item) > threshold. There will be no false positives, i.e., no Type I error. There may be items omitted from the set with true frequencies greater than the threshold (false negatives).
threshold
- to include items in the result listerrorType
- determines whether no false positives or no false negatives are
desired.public ItemsSketch.Row<T>[] getFrequentItems(ErrorType errorType)
errorType
- determines whether no false positives or no false negatives are
desired.public long getMaximumError()
public int getMaximumMapCapacity()
public int getNumActiveItems()
public long getStreamLength()
public long getUpperBound(T item)
item
- the given itempublic boolean isEmpty()
public ItemsSketch<T> merge(ItemsSketch<T> other)
other
- sketch of this classpublic void reset()
public byte[] toByteArray(ArrayOfItemsSerDe<T> serDe)
serDe
- an instance of ArrayOfItemsSerDepublic String toString()
public static String toString(byte[] byteArr)
byteArr
- the given byte arraypublic static String toString(org.apache.datasketches.memory.Memory mem)
mem
- the given Memory objectpublic void update(T item)
item
- for which the frequency should be increased.public void update(T item, long count)
item
- for which the frequency should be increased. The sketch uses
hashCode() and equals() methods of the type T.count
- the amount by which the frequency of the item should be increased.
A count of zero is a no-op, and a negative count will throw an exception.Copyright © 2015–2024 The Apache Software Foundation. All rights reserved.