pyspark.pandas.Series.diff#

Series.diff(periods=1)[source]#

First discrete difference of element.

Calculates the difference of a Series element compared with another element in the DataFrame (default is the element in the same column of the previous row).

Note

the current implementation of diff uses Spark���s Window without specifying partition specification. This leads to moveing all data into a single partition in a single machine and could cause serious performance degradation. Avoid this method with very large datasets.

Parameters
periodsint, default 1

Periods to shift for calculating difference, accepts negative values.

Returns
diffedSeries

Examples

>>> df = ps.DataFrame({'a': [1, 2, 3, 4, 5, 6],
...                    'b': [1, 1, 2, 3, 5, 8],
...                    'c': [1, 4, 9, 16, 25, 36]}, columns=['a', 'b', 'c'])
>>> df
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36
>>> df.b.diff()
0    NaN
1    0.0
2    1.0
3    1.0
4    2.0
5    3.0
Name: b, dtype: float64

Difference with previous value

>>> df.c.diff(periods=3)
0     NaN
1     NaN
2     NaN
3    15.0
4    21.0
5    27.0
Name: c, dtype: float64

Difference with following value

>>> df.c.diff(periods=-1)
0    -3.0
1    -5.0
2    -7.0
3    -9.0
4   -11.0
5     NaN
Name: c, dtype: float64