pyspark.pandas.DataFrame.pipe

DataFrame.pipe(func: Callable[[…], Any], *args: Any, **kwargs: Any) → Any

Apply func(self, *args, **kwargs).

Parameters
func: function

function to apply to the DataFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the DataFrames.

args: iterable, optional

positional arguments passed into func.

kwargs: mapping, optional

a dictionary of keyword arguments passed into func.

Returns
object: the return type of func.

Notes

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. For example, given

>>> df = ps.DataFrame({'category': ['A', 'A', 'B'],
...                    'col1': [1, 2, 3],
...                    'col2': [4, 5, 6]},
...                   columns=['category', 'col1', 'col2'])
>>> def keep_category_a(df):
...     return df[df['category'] == 'A']
>>> def add_one(df, column):
...     return df.assign(col3=df[column] + 1)
>>> def multiply(df, column1, column2):
...     return df.assign(col4=df[column1] * df[column2])

instead of writing

>>> multiply(add_one(keep_category_a(df), column="col1"), column1="col2", column2="col3")
  category  col1  col2  col3  col4
0        A     1     4     2     8
1        A     2     5     3    15

You can write

>>> (df.pipe(keep_category_a)
...    .pipe(add_one, column="col1")
...    .pipe(multiply, column1="col2", column2="col3")
... )
  category  col1  col2  col3  col4
0        A     1     4     2     8
1        A     2     5     3    15

If you have a function that takes the data as the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as df:

>>> def multiply_2(column1, df, column2):
...     return df.assign(col4=df[column1] * df[column2])

Then you can write

>>> (df.pipe(keep_category_a)
...    .pipe(add_one, column="col1")
...    .pipe((multiply_2, 'df'), column1="col2", column2="col3")
... )
  category  col1  col2  col3  col4
0        A     1     4     2     8
1        A     2     5     3    15

You can use lambda as well

>>> ps.Series([1, 2, 3]).pipe(lambda x: (x + 1).rename("value"))
0    2
1    3
2    4
Name: value, dtype: int64