public final class UniqueCountMap extends Object
The space consumed by this map is quite sensitive to the actual distribution of identifiers per key, so you should characterize and or experiment with your typical input streams. Nonetheless, our experiments on live streams of over 100M keys required about 1.4GB of space. This is about 14 bytes per key for key storage and unique count storage.
Given such highly-skewed distributions, using this map is far more efficient space-wise than
the alternative of dedicating an HLL sketch per key. Based on our use cases, after
subtracting the space required for key storage, the average bytes per key required for unique
count estimation (getAverageSketchMemoryPerKey()
) is about 10.
Internally, this map is implemented as a hierarchy of internal hash maps with progressively increasing storage allocated for unique count estimation. As a key acquires more identifiers it is "promoted" up to a higher internal map. The final map of keys is a map of compact HLL sketches.
The unique values in all the internal maps, except the final HLL map, are stored in a special form called a coupon. A coupon is a 16-bit value that fully describes a k=1024 HLL bin. It contains 10 bits of address and a 6-bit HLL value.
All internal maps use a prime number size and Knuth's Open Addressing Double Hash (OADH) search algorithm.
The internal base map holds all the keys and each key is associated with one 16-bit value. Initially, the value is a single coupon. Once the key is promoted, this 16-bit field contains a reference to the internal map where the key is still active.
The intermediate maps between the base map and the final HLL map are of two types. The first few of these are called traverse maps where the coupons are stored as unsorted arrays. After the traverse maps are the coupon hash maps, where the coupons are stored in small OASH hash tables.
All the intermediate maps support deletes and can dynamically grow and shrink as required by the input stream.
The sketch estimator algorithms are unbiased with a Relative Standard Error (RSE) of about 2.6% with 68% confidence, or equivalently, about 5.2% with a 95% confidence.
In a parallel package in the sketches-misc repository, there are 2 classes that can be used from the command line to feed this mapping sketch piped from standard-in for experimental evaluation. The first is ProcessIpStream, which processes simple IP/ID pairs and the second, ProcessDistributionStream, which processes pairs that describe a distribution. In this same package is the VariousMapRSETest class that was used to generate the error plots for the web site. Please refer to the javadocs for those classes for more information.
Constructor and Description |
---|
UniqueCountMap(int keySizeBytes)
Constructs a UniqueCountMap with an initial capacity of one million entries.
|
UniqueCountMap(int initialNumEntries,
int keySizeBytes)
Constructs a UniqueCountMap with a given initial number of entries.
|
Modifier and Type | Method and Description |
---|---|
int |
getActiveEntries()
Returns the number of active, unique keys across all internal maps
|
double |
getAverageSketchMemoryPerKey()
Returns the average memory storage per key that is dedicated to sketching the unique counts.
|
double |
getEstimate(byte[] key)
Retrieves the current estimate of unique count for a given key.
|
long |
getKeyMemoryUsageBytes()
Returns total bytes used for key storage
|
double |
getLowerBound(byte[] key)
Returns the lower bound cardinality with respect to
getEstimate(byte[]) associated
with the given key. |
long |
getMemoryUsageBytes()
Returns total bytes used by all internal maps
|
double |
getUpperBound(byte[] key)
Returns the upper bound cardinality with respect to
getEstimate(byte[]) associated
with the given key. |
String |
toString()
Returns a string with a human-readable summary of the UniqueCountMap and all the internal maps
|
double |
update(byte[] key,
byte[] identifier)
Updates the map with a given key and identifier and returns the estimate of the number of
unique identifiers encountered so far for the given key.
|
public UniqueCountMap(int keySizeBytes)
keySizeBytes
- must be at least 4 bytes to have sufficient entropy.public UniqueCountMap(int initialNumEntries, int keySizeBytes)
initialNumEntries
- The initial number of entries provides a tradeoff between
wasted space, if too high, and wasted time resizing the table, if too low.keySizeBytes
- must be at least 4 bytes to have sufficient entropypublic double update(byte[] key, byte[] identifier)
key
- the given keyidentifier
- the given identifier for unique counting associated with the keypublic double getEstimate(byte[] key)
key
- given keypublic double getUpperBound(byte[] key)
getEstimate(byte[])
associated
with the given key.key
- the given keygetEstimate(byte[])
associated
with the given key.public double getLowerBound(byte[] key)
getEstimate(byte[])
associated
with the given key.key
- the given keygetEstimate(byte[])
associated
with the given key.public int getActiveEntries()
public long getMemoryUsageBytes()
public long getKeyMemoryUsageBytes()
public double getAverageSketchMemoryPerKey()
Copyright © 2015–2024 The Apache Software Foundation. All rights reserved.