Apache Commons Statistics User GuideContentsOverview
Apache Commons Statistics provides utilities for statistical applications. The code
originated in the Commons Statistics is divided into a number of submodules:
Example ModulesIn addition to the modules above, the Commons Statistics source distribution contains example code demonstrating library functionality and/or providing useful development utilities. These modules are not part of the public API of the library and no guarantees are made concerning backwards compatibility. The example module parent page contains a listing of available modules. Descriptive Statistics
The Overview
The module provides classes to compute univariate statistics on
Computation of an individual statistic involves creating an instance of
Computation of multiple statistics uses a
Statistic
enumeration to define the statistics to evaluate. A container class is created to
compute the desired statistics together and allows multiple statistics to be computed
concurrently using the Java stream API. Each statistic result is obtained using the
Note that
If the data is an integer type then it is
preferred to use the integer specializations of the statistics.
Many implementations use exact integer math for the computation. This is faster than
using a Some statistics cannot be computed using a stream since they require all values for computation, for example the median. These are evaluated on an array using an instance of a computing class. The instance allows computation options to be changed. Instances are immutable and the computation is thread-safe. ExamplesComputation of a single statistic from an array of values, or a stream of data: int[] values = {1, 1, 2, 3, 5, 8, 13, 21}; double v = IntVariance.of(values).getAsDouble(); double m = Stream.of("one", "two", "three", "four") .mapToInt(String::length) .collect(IntMean::create, IntMean::accept, IntMean::combine) .getAsDouble();
Computation of multiple statistics uses the double[] data = {1, 2, 3, 4, 5, 6, 7, 8}; DoubleStatistics stats = DoubleStatistics.of( EnumSet.of(Statistic.MIN, Statistic.MAX, Statistic.VARIANCE), data); stats.getAsDouble(Statistic.MIN); // 1.0 stats.getAsDouble(Statistic.MAX); // 8.0 stats.getAsDouble(Statistic.VARIANCE); // 6.0 // Get other statistics supported by the underlying computations stats.isSupported(Statistic.STANDARD_DEVIATION)); // true stats.getAsDouble(Statistic.STANDARD_DEVIATION); // 2.449...
Computation of multiple statistics on individual values can accumulate the results
using the IntStatistics stats = IntStatistics.of( Statistic.MIN, Statistic.MAX, Statistic.MEAN); Stream.of("one", "two", "three", "four") .mapToInt(String::length) .forEach(stats::accept); stats.getAsInt(Statistic.MIN); // 3 stats.getAsInt(Statistic.MAX); // 5 stats.getAsDouble(Statistic.MEAN); // 15.0 / 4
Computation of multiple statistics on a stream of values in parallel.
This requires use of a IntStatistics.Builder builder = IntStatistics.builder( Statistic.MIN, Statistic.MAX, Statistic.MEAN); IntStatistics stats = Stream.of("one", "two", "three", "four") .parallel() .mapToInt(String::length) .collect(builder::build, IntConsumer::accept, IntStatistics::combine); stats.getAsInt(Statistic.MIN); // 3 stats.getAsInt(Statistic.MAX); // 5 stats.getAsDouble(Statistic.MEAN); // 15.0 / 4
Computation on multiple arrays. This requires use of a double[][] data = { {1, 2, 3, 4}, {5, 6, 7, 8}, }; DoubleStatistics.Builder builder = DoubleStatistics.builder( Statistic.MIN, Statistic.MAX, Statistic.VARIANCE); DoubleStatistics stats = Arrays.stream(data) .map(builder::build) .reduce(DoubleStatistics::combine) .get(); stats.getAsDouble(Statistic.MIN); // 1.0 stats.getAsDouble(Statistic.MAX); // 8.0 stats.getAsDouble(Statistic.VARIANCE); // 6.0 // Get other statistics supported by the underlying computations stats.isSupported(Statistic.MEAN)); // true stats.getAsDouble(Statistic.MEAN); // 4.5
If computation on multiple arrays is to be repeated then this can be done with a
re-useable double[][] data = { {1, 2, 3, 4}, {5, 6, 7, 8}, }; DoubleStatistics.Builder builder = DoubleStatistics.builder( Statistic.MIN, Statistic.MAX, Statistic.VARIANCE); Collector<double[], DoubleStatistics, DoubleStatistics> collector = Collector.of(builder::build, (s, d) -> s.combine(builder.build(d)), DoubleStatistics::combine); DoubleStatistics stats = Arrays.stream(data).collect(collector); stats.getAsDouble(Statistic.MIN); // 1.0 stats.getAsDouble(Statistic.MAX); // 8.0 stats.getAsDouble(Statistic.VARIANCE); // 6.0 Combination of multiple statistics requires them to be compatible, i.e. all supported statistics in one container are also supported in the other. Note that combining another container ignores any unsupported statistics and the compatibility may be asymmetric. double[] data1 = {1, 2, 3, 4}; double[] data2 = {5, 6, 7, 8}; DoubleStatistics varStats = DoubleStatistics.builder(Statistic.VARIANCE).build(data1); DoubleStatistics meanStats = DoubleStatistics.builder(Statistic.MEAN).build(data2); // throws IllegalArgumentException varStats.combine(meanStats); // OK - mean is updated to 4.5 meanStats.combine(varStats)
Computation of a statistic that requires all data (i.e. does not support the
double[] data = {8, 7, 6, 5, 4, 3, 2, 1}; // Configure the statistic double m = Median.withDefaults() .withCopy(true) // do not modify the input array .with(NaNPolicy.ERROR) // raise an exception for NaN .evaluate(data); // m = 4.5 Computation of multiple values of a statistic that requires all data: int size = 10000; double origin = 0; double bound = 100; double[] data = new SplittableRandom(123) .doubles(size, origin, bound) .toArray(); // Evaluate multiple statistics on the same data double[] q = Quantile.withDefaults() .evaluate(data, 0.25, 0.5, 0.75); // probabilities // q ~ [25.0, 50.0, 75.0] Probability DistributionsOverview
The APIThe distribution framework provides the means to compute probability density, probability mass and cumulative probability functions for several well-known discrete (integer-valued) and continuous probability distributions. The API also allows for the computation of inverse cumulative probabilities and sampling from distributions.
For an instance TDistribution t = TDistribution.of(29); double lowerTail = t.cumulativeProbability(-2.656); // P(T(29) <= -2.656) double upperTail = t.survivalProbability(2.75); // P(T(29) > 2.75)
For discrete
PoissonDistribution pd = PoissonDistribution.of(1.23); double p1 = pd.probability(5); double p2 = pd.probability(5, 5); double p3 = pd.probability(4, 5); // p2 == 0 // p1 == p3
Inverse distribution functions can be computed using the
\[ x = \begin{cases} \inf \{ x \in \mathbb R : P(X \le x) \ge p\} & \text{for } 0 \lt p \le 1 \\ \inf \{ x \in \mathbb R : P(X \le x) \gt 0 \} & \text{for } p = 0 \end{cases} \]
where \[ x = \begin{cases} \inf \{ x \in \mathbb R : P(X \gt x) \le p\} & \text{for } 0 \le p \lt 1 \\ \inf \{ x \in \mathbb R : P(X \gt x) \lt 1 \} & \text{for } p = 1 \end{cases} \] NormalDistribution n = NormalDistribution.of(0, 1); double x1 = n.inverseCumulativeProbability(1e-300); double x2 = n.inverseSurvivalProbability(1e-300); // x1 == -x2 ~ -37.0471
For discrete All distributions provide accessors for the parameters used to create the distribution, and a mean and variance. The return value when the mean or variance is undefined is noted in the class javadoc. ChiSquaredDistribution chi2 = ChiSquaredDistribution.of(42); double df = chi2.getDegreesOfFreedom(); // 42 double mean = chi2.getMean(); // 42 double variance = chi2.getVariance(); // 84 CauchyDistribution cauchy = CauchyDistribution.of(1.23, 4.56); double location = cauchy.getLocation(); // 1.23 double scale = cauchy.getScale(); // 4.56 double undefined1 = cauchy.getMean(); // NaN double undefined2 = cauchy.getVariance(); // NaN
The supported domain of the distribution is provided by the
BinomialDistribution b = BinomialDistribution.of(13, 0.15); int lower = b.getSupportLowerBound(); // 0 int upper = b.getSupportUpperBound(); // 13
All distributions implement a // From Commons RNG Simple UniformRandomProvider rng = RandomSource.KISS.create(123L); NormalDistribution n = NormalDistribution.of(0, 1); double x = n.createSampler(rng).sample(); // Generate a number of samples GeometricDistribution g = GeometricDistribution.of(0.75); int[] k = g.createSampler(rng).samples(100).toArray(); // k.length == 100
Note that even when distributions are immutable, the sampler is not immutable as it
depends on the instance of the mutable Implementation Details
Instances are constructed using factory methods, typically a static method in the
distribution class named Exceptions will be raised by the factory method when constructing the distribution using invalid parameters. See the class javadoc for exception conditions. Unless otherwise noted, distribution instances are immutable. This allows sharing an instance between threads for computations.
Exceptions will not be raised by distributions for an invalid
An exception will be raised by distributions for an invalid Complementary Probabilities
The distributions provide the cumulative probability The difference is illustrated with the result of computing the upper tail of a probability distribution. ChiSquaredDistribution chi2 = ChiSquaredDistribution.of(42); double q1 = 1 - chi2.cumulativeProbability(168); double q2 = chi2.survivalProbability(168); // q1 == 0 // q2 != 0
In this case the value
Probability computations should use the appropriate cumulative or survival function
to calculate the lower or upper tail respectively. The same care should be applied
when inverting probability distributions. It is preferred to compute either
ChiSquaredDistribution chi2 = ChiSquaredDistribution.of(42); double q = 5.43e-17; // Incorrect: p = 1 - q == 1.0 !!! double x1 = chi2.inverseCumulativeProbability(1 - q); // Correct: invert q double x2 = chi2.inverseSurvivalProbability(q); // x1 == +infinity // x2 ~ 168.0
Note: The survival probability functions were not present in the
Inference
The Overview
The module provides test classes that implement a single, or family, of statistical
tests. Each test class provides methods to compute a test statistic and a p-value for the
significance of the statistic. These can be computed together using a
Alternatively a
A test is obtained using the ExamplesA chi-square test that the observed counts conform to the expected frequencies. double[] expected = {0.25, 0.5, 0.25}; long[] observed = {57, 123, 38}; SignificanceResult result = ChiSquareTest.withDefaults() .test(expected, observed); result.getPValue(); // 0.0316148 result.reject(0.05); // true result.reject(0.01); // false A paired t-test that the marks in the math exam were greater than the science exam. This fails to reject the null hypothesis (that there was no difference) with 95% confidence. double[] math = {53, 69, 65, 65, 67, 79, 86, 65, 62, 69}; // mean = 68.0 double[] science = {75, 65, 68, 63, 55, 65, 73, 45, 51, 52}; // mean = 61.2 SignificanceResult result = TTest.withDefaults() .with(AlternativeHypothesis.GREATER_THAN) .pairedTest(math, science); result.getPValue(); // 0.05764 result.reject(0.05); // false A G-test that the allele frequencies conform to the expected Hardy-Weinberg proportions. This is an example of an intrinsic hypothesis where the expected frequencies are computed using the observations and the degrees of freedom must be adjusted. The data is from McDonald (1989) Selection component analysis of the Mpi locus in the amphipod Platorchestia platensis. Heredity 62: 243-249. // Allele frequencies: Mpi 90/90, Mpi 90/100, Mpi 100/100 long[] observed = {1203, 2919, 1678}; // Mpi 90 proportion double p = (2.0 * observed[0] + observed[1]) / (2 * Arrays.stream(observed).sum()); // 5325 / 11600 = 0.459 // Hardy-Weinberg proportions double[] expected = {p * p, 2 * p * (1 - p), (1 - p) * (1 - p)}; // 0.211, 0.497, 0.293 SignificanceResult result = GTest.withDefaults() .withDegreesOfFreedomAdjustment(1) .test(expected, observed); result.getStatistic(); // 1.03 result.getPValue(); // 0.309 result.reject(0.05); // false A one-way analysis of variance test. This is an example where the result has more information than the test statistic and the p-value. The data is from McDonald et al (1991) Allozymes and morphometric characters of three species of Mytilus in the Northern and Southern Hemispheres. Marine Biology 111: 323-333. double[] tillamook = {0.0571, 0.0813, 0.0831, 0.0976, 0.0817, 0.0859, 0.0735, 0.0659, 0.0923, 0.0836}; double[] newport = {0.0873, 0.0662, 0.0672, 0.0819, 0.0749, 0.0649, 0.0835, 0.0725}; double[] petersburg = {0.0974, 0.1352, 0.0817, 0.1016, 0.0968, 0.1064, 0.105}; double[] magadan = {0.1033, 0.0915, 0.0781, 0.0685, 0.0677, 0.0697, 0.0764, 0.0689}; double[] tvarminne = {0.0703, 0.1026, 0.0956, 0.0973, 0.1039, 0.1045}; Collection<double[]> data = Arrays.asList(tillamook, newport, petersburg, magadan, tvarminne); OneWayAnova.Result result = OneWayAnova.withDefaults() .test(data); result.getStatistic(); // 7.12 result.getPValue(); // 2.8e-4 result.reject(0.001); // true The result also provides the between and within group degrees of freedom and the mean squares allowing reporting of the results in a table:
Ranking
The
The NaturalRanking ranking = new NaturalRanking(); ranking.apply(new double[] {5, 6, 7, 8}); // 1, 2, 3, 4 ranking.apply(new double[] {8, 5, 7, 6}); // 4, 1, 3, 2
The special case of double[] data = new double[] {6, 5, Double.NaN, 7}; new NaturalRanking().apply(data); // IllegalArgumentException new NaturalRanking(NaNStrategy.MINIMAL).apply(data); // (4, 2, 1, 3) new NaturalRanking(NaNStrategy.MAXIMAL).apply(data); // (3, 1, 4, 2) new NaturalRanking(NaNStrategy.REMOVED).apply(data); // (3, 1, 2) new NaturalRanking(NaNStrategy.FIXED).apply(data); // (3, 1, NaN, 2) new NaturalRanking(NaNStrategy.FAILED).apply(data); // IllegalArgumentException
Ties are handled using the configured double[] data = new double[] {7, 5, 7, 6}; new NaturalRanking().apply(data); // (3.5, 1, 3.5, 2) new NaturalRanking(TiesStrategy.SEQUENTIAL).apply(data); // (3, 1, 4, 2) new NaturalRanking(TiesStrategy.MINIMUM).apply(data); // (3, 1, 3, 2) new NaturalRanking(TiesStrategy.MAXIMUM).apply(data); // (4, 1, 4, 2) new NaturalRanking(TiesStrategy.AVERAGE).apply(data); // (3.5, 1, 3.5, 2) new NaturalRanking(TiesStrategy.RANDOM).apply(data); // (3, 1, 4, 2) or (4, 1, 3, 2)
The source of randomness defaults to a system supplied generator. The randomness can be
provided as a double[] data = new double[] {7, 5, 7, 6}; new NaturalRanking(TiesStrategy.RANDOM).apply(data); new NaturalRanking(new SplittableRandom()::nextInt).apply(data); // From Commons RNG UniformRandomProvider rng = RandomSource.KISS.create(); new NaturalRanking(rng::nextInt).apply(data); // ranks: (3, 1, 4, 2) or (4, 1, 3, 2) |