public final class FdtSketch extends ArrayOfStringsSketch
Suppose our data is a stream of pairs {IP address, User ID} and we want to identify the IP addresses that have the most distinct User IDs. Or conversely, we would like to identify the User IDs that have the most distinct IP addresses. This is a common challenge in the analysis of big data and the FDT sketch helps solve this problem using probabilistic techniques.
More generally, given a multiset of tuples with dimensions {d1,d2, d3, ..., dN}, and a primary subset of dimensions M < N, our task is to identify the combinations of M subset dimensions that have the most frequent number of distinct combinations of the N-M non-primary dimensions.
Please refer to the web page https://datasketches.apache.org/docs/Frequency/FrequentDistinctTuplesSketch.html for a more complete discussion about this sketch.
PREAMBLE_LONGS, summaryFactory_
Constructor and Description |
---|
FdtSketch(double threshold,
double rse)
Create a new instance of Frequent Distinct Tuples sketch with a size determined by the given
threshold and rse.
|
FdtSketch(FdtSketch sketch)
Copy Constructor
|
FdtSketch(int lgK)
Create new instance of Frequent Distinct Tuples sketch with the given
Log-base2 of required nominal entries.
|
Modifier and Type | Method and Description |
---|---|
CompactSketch<S> |
compact()
Converts the current state of the sketch into a compact sketch
|
FdtSketch |
copy() |
int |
getCountLessThanThetaLong(long thetaLong)
Gets the number of hash values less than the given theta expressed as a long.
|
int |
getCurrentCapacity()
Get current capacity
|
int |
getLgK()
Get log_base2 of Nominal Entries
|
int |
getNominalEntries()
Get configured nominal number of entries
|
PostProcessor |
getPostProcessor()
Returns the PostProcessor that enables multiple queries against the sketch results.
|
PostProcessor |
getPostProcessor(Group group,
char sep)
Returns the PostProcessor that enables multiple queries against the sketch results.
|
ResizeFactor |
getResizeFactor()
Get configured resize factor
|
List<Group> |
getResult(int[] priKeyIndices,
int limit,
int numStdDev,
char sep)
Returns an ordered List of Groups of the most frequent distinct population of subset tuples
represented by the count of entries of each group.
|
int |
getRetainedEntries() |
float |
getSamplingProbability()
Get configured sampling probability
|
protected void |
insertSummary(int index,
S summary) |
TupleSketchIterator<S> |
iterator()
Returns a SketchIterator
|
void |
reset()
Resets this sketch an empty state.
|
byte[] |
toByteArray()
Deprecated.
As of 3.0.0, serializing an UpdatableSketch is deprecated.
This capability will be removed in a future release.
Serializing a CompactSketch is not deprecated.
|
void |
trim()
Rebuilds reducing the actual number of entries to the nominal number of entries if needed
|
void |
update(String[] tuple)
Update the sketch with the given string array tuple.
|
update
update, update, update, update, update, update, update
getEstimate, getEstimate, getLowerBound, getLowerBound, getSummaryFactory, getTheta, getThetaLong, getUpperBound, getUpperBound, isEmpty, isEstimationMode, toString
public FdtSketch(int lgK)
lgK
- Log-base2 of required nominal entries.public FdtSketch(double threshold, double rse)
threshold
- : the fraction, between zero and 1.0, of the total distinct stream length
that defines a "Frequent" (or heavy) item.rse
- the maximum Relative Standard Error for the estimate of the distinct population of a
reported tuple (selected with a primary key) at the threshold.public FdtSketch(FdtSketch sketch)
sketch
- the sketch to copypublic FdtSketch copy()
copy
in class ArrayOfStringsSketch
public void update(String[] tuple)
tuple
- the given string array tuple.public List<Group> getResult(int[] priKeyIndices, int limit, int numStdDev, char sep)
priKeyIndices
- these indices define the dimensions used for the Primary Keys.limit
- the maximum number of groups to return. If this value is ≤ 0, all
groups will be returned.numStdDev
- the number of standard deviations for the upper and lower error bounds,
this value is an integer and must be one of 1, 2, or 3.
See Number of Standard Deviationssep
- the separator characterpublic PostProcessor getPostProcessor()
public PostProcessor getPostProcessor(Group group, char sep)
group
- the Group class to use during post processing.sep
- the separator character.public int getRetainedEntries()
getRetainedEntries
in class Sketch<S extends Summary>
public int getCountLessThanThetaLong(long thetaLong)
Sketch
getCountLessThanThetaLong
in class Sketch<S extends Summary>
thetaLong
- the given theta as a long between zero and Long.MAX_VALUE.public int getNominalEntries()
public int getLgK()
public float getSamplingProbability()
public int getCurrentCapacity()
public ResizeFactor getResizeFactor()
public void trim()
public void reset()
public CompactSketch<S> compact()
@Deprecated public byte[] toByteArray()
toByteArray
in class Sketch<S extends Summary>
protected void insertSummary(int index, S summary)
public TupleSketchIterator<S> iterator()
Sketch
Copyright © 2015–2024 The Apache Software Foundation. All rights reserved.