12/11/2015

Multicolumn aggregation in pandas using function over multiple columns (weighted average example)

Here is how to perform multicolumn aggregation over a dataframe with user defined functions that depends on data in other dataframe columns. Code below is an example of weighted average implementation.
Let say we have the following dataframe

In [149]: df
Out[149]: 
                       revenue  profit
Level1 Level2 Company                 
g1     sg1    1st           10       5
              2nd           11       3
       sg2    3rd           14       7
g2     sg3    3rd           14       7
              1st           10       5
g3     NaN    2nd           11       3
              3rd           14       7
In case we want to calculate weighted average it is quite straiforward with apply function:

In [153]: def w_avg(group):
   .....:     d = group['profit']
   .....:     w = group['revenue']
   .....:     return (d * w).sum() / w.sum()
   .....: 

In [154]: df.groupby(level='Level2').apply(w_avg)
Out[154]: 
Level2
sg1    3
sg2    7
sg3    6
dtype: int64
ok simple, but what if we want to make multi aggregation - aggregate each column with each own funtion. The way to do this in pandas is using agg function like this:

In [155]: df.groupby(level='Level2').agg({'profit': np.mean, 'revenue': np.sum})
Out[155]: 
        profit  revenue
Level2                 
sg1          4       21
sg2          7       14
sg3          6       24
What if we want to pass our weighted average as aggregation function to one of the columns? Here is a way to do this:

In [156]: import functools

In [157]: def wavg(g, df, weight_series):
   .....:     w = df.ix[g.index][weight_series]
   .....:     return (g * w).sum() / w.sum()
   .....: 

In [160]: fnc = functools.partial(wavg, df=df, weight_series='revenue')

In [161]: df.groupby(level='Level2').agg({'revenue': np.sum, 'profit': fnc})
Out[161]: 
        profit  revenue
Level2                 
sg1          3       21
sg2          7       14
sg3          6       24

3 comments:

  1. Hi.. can you explain a bit more your last screen? Why you can not use directly 'profit': wavg? What functools is used for?

    ReplyDelete
    Replies
    1. This is because function you pass to agg operates over a single series. We obtain such a function using partial application over wavg. So after [160] we have fnc as a function of a single argument and can pass it to agg. You can avoid using functools if you hardcode your dataframe and a column to weight on into your version of wavg function. The code above is a flexible version :)

      Delete
  2. I tried this on a a dataframe with ~ 120000 rows and ~ 10000 groups. And I found that this timed to about 2 mins per loop as compared with a plain `np.mean` restricting to a single `series` object. I commented on [this](https://github.com/pandas-dev/pandas/issues/10030) referencing your post in search of a faster answer. Thanks.

    ReplyDelete