pyspark.pandas.DataFrame.fillna¶

DataFrame.fillna(value: Union[Any, Dict[Union[Any, Tuple[Any, …]], Any], None] = None, method: Optional[str] = None, axis: Union[int, str, None] = None, inplace: bool = False, limit: Optional[int] = None) → Optional[pyspark.pandas.frame.DataFrame][source]¶

Fill NA/NaN values.

Note

the current implementation of ‘method’ parameter in fillna uses Spark’s Window without specifying partition specification. This leads to move all data into single partition in single machine and could cause serious performance degradation. Avoid this method against very large dataset.

Parameters

valuescalar, dict, Series: Value to use to fill holes. alternately a dict/Series of values specifying which value to use for each column. DataFrame is not supported.
method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None: Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
axis{0 or index}: 1 and columns are not supported.
inplaceboolean, default False: Fill in place (do not create a new object)
limitint, default None: If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None

Returns

DataFrame: DataFrame with NA entries filled.

Examples

>>> df = ps.DataFrame({
...     'A': [None, 3, None, None],
...     'B': [2, 4, None, 3],
...     'C': [None, None, None, 1],
...     'D': [0, 1, 5, 4]
...     },
...     columns=['A', 'B', 'C', 'D'])
>>> df
     A    B    C  D
0  NaN  2.0  NaN  0
1  3.0  4.0  NaN  1
2  NaN  NaN  NaN  5
3  NaN  3.0  1.0  4

Replace all NaN elements with 0s.

>>> df.fillna(0)
     A    B    C  D
0  0.0  2.0  0.0  0
1  3.0  4.0  0.0  1
2  0.0  0.0  0.0  5
3  0.0  3.0  1.0  4

We can also propagate non-null values forward or backward.

>>> df.fillna(method='ffill')
     A    B    C  D
0  NaN  2.0  NaN  0
1  3.0  4.0  NaN  1
2  3.0  4.0  NaN  5
3  3.0  3.0  1.0  4

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
>>> df.fillna(value=values)
     A    B    C  D
0  0.0  2.0  2.0  0
1  3.0  4.0  2.0  1
2  0.0  1.0  2.0  5
3  0.0  3.0  1.0  4

pyspark.pandas.DataFrame.dropna pyspark.pandas.DataFrame.replace