pyspark.RDD.repartitionAndSortWithinPartitions#

RDD.repartitionAndSortWithinPartitions(numPartitions=None, partitionFunc=<function portable_hash>, ascending=True, keyfunc=<function RDD.<lambda>>)[source]#

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.

New in version 1.2.0.

Parameters

numPartitionsint, optional: the number of partitions in new RDD
partitionFuncfunction, optional, default portable_hash: a function to compute the partition index
ascendingbool, optional, default True: sort the keys in ascending or descending order
keyfuncfunction, optional, default identity mapping: a function to compute the key

Returns

RDD: a new RDD

See also

RDD.repartition()
RDD.partitionBy()
RDD.sortBy()
RDD.sortByKey()

Examples

>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, True)
>>> rdd2.glom().collect()
[[(0, 5), (0, 8), (2, 6)], [(1, 3), (3, 8), (3, 8)]]