Group and Aggregation

  • Aggretation and summarization require creating DataFrameGroupBy object from existing DataFrame
  • The GroupBy object is a very flexible abstraction. In many ways, you can simply treat it as if it's a collection of DataFrames, and it does the difficult things under the hood

In [243]:

company = pd.read_csv('data/company.csv')
company.head()

Out[243]:

  Company Department      Name  Age  Salary  Birthdate
0      C1         D1      Yong   45   15000   1/1/1970
1      C1         D1      Chew   35   12000   2/1/1980
2      C1         D2       Lim   34    8000  2/19/1977
3      C1         D3     Jessy   23    2500  3/15/1990
4      C1         D3  Hoi Ming   55   25000  4/15/1987

Creating Groups

In [244]:

com_grp = company.groupby(['Company','Department'])

In [245]:

com_grp

Out[245]:

<pandas.core.groupby.DataFrameGroupBy object at 0x000002C3255294A8>

Properties

Number of Groups Created

In [246]:

com_grp.ngroups

Out[246]:

9

Row Numbers Associated For Each Group

In [247]:

com_grp.groups  # return Dictionary

Out[247]:

{('C1', 'D1'): Int64Index([0, 1], dtype='int64'),
 ('C1', 'D2'): Int64Index([2], dtype='int64'),
 ('C1', 'D3'): Int64Index([3, 4, 5], dtype='int64'),
 ('C2', 'D1'): Int64Index([6], dtype='int64'),
 ('C2', 'D2'): Int64Index([7, 8, 9], dtype='int64'),
 ('C2', 'D3'): Int64Index([10, 11, 12], dtype='int64'),
 ('C3', 'D1'): Int64Index([14], dtype='int64'),
 ('C3', 'D2'): Int64Index([15], dtype='int64'),
 ('C3', 'D3'): Int64Index([13, 16, 17], dtype='int64')}

Methods

Number of Rows In Each Group

In [248]:

com_grp.size()  # return panda Series object

Out[248]:

Company  Department
C1       D1            2
         D2            1
         D3            3
C2       D1            1
         D2            3
         D3            3
C3       D1            1
         D2            1
         D3            3
dtype: int64

Valid (not Null) Data Count For Each Fields In The Group

In [249]:

com_grp.count()  # return panda DataFrame object

Out[249]:

                    Name  Age  Salary  Birthdate
Company Department                              
C1      D1             2    2       2          2
        D2             1    1       1          1
        D3             3    3       3          3
C2      D1             1    1       1          1
        D2             3    3       3          3
        D3             3    3       3          3
C3      D1             1    1       1          1
        D2             1    1       1          1
        D3             3    3       3          3

Retrieve Rows

All row retrieval operations return a dataframe

Retrieve N Rows For Each Groups

Example below retrieve 2 rows from each group

In [250]:

com_grp.head(2)

Out[250]:

   Company Department      Name  Age  Salary   Birthdate
0       C1         D1      Yong   45   15000    1/1/1970
1       C1         D1      Chew   35   12000    2/1/1980
2       C1         D2       Lim   34    8000   2/19/1977
3       C1         D3     Jessy   23    2500   3/15/1990
4       C1         D3  Hoi Ming   55   25000   4/15/1987
..     ...        ...       ...  ...     ...         ...
11      C2         D3   Jeannie   30   12500  12/31/1980
13      C3         D3     Chang   32    7900   7/26/1973
14      C3         D1       Ong   44   17500   8/21/1980
15      C3         D2      Lily   41   15300   7/17/1990
16      C3         D3     Sally   54   21000   7/19/1968

[14 rows x 6 columns]

Retrieve Rows In One Specific Group

In [251]:

com_grp.get_group(('C1','D3'))

Out[251]:

  Company Department      Name  Age  Salary  Birthdate
3      C1         D3     Jessy   23    2500  3/15/1990
4      C1         D3  Hoi Ming   55   25000  4/15/1987
5      C1         D3   Sui Wei   56    3000  6/15/1990

Retrieve n-th Row From Each Group

Row number is 0-based

In [252]:

com_grp.nth(-1)    # retireve last row from each group

Out[252]:

                    Age   Birthdate     Name  Salary
Company Department                                  
C1      D1           35    2/1/1980     Chew   12000
        D2           34   2/19/1977      Lim    8000
        D3           56   6/15/1990  Sui Wei    3000
C2      D1           18   7/15/1997     Anne     400
        D2           46  10/31/1988    Jimmy   14000
        D3           29   12/1/1963  Bernard    9800
C3      D1           44   8/21/1980      Ong   17500
        D2           41   7/17/1990     Lily   15300
        D3           37   3/16/1969   Esther   13500

Iteration

DataFrameGroupBy object can be thought as a collection of named groups

In [253]:

def print_groups (g):
    for name,group in g:
        print (name)
        print (group[:2])

print_groups (com_grp)
('C1', 'D1')
  Company Department  Name  Age  Salary Birthdate
0      C1         D1  Yong   45   15000  1/1/1970
1      C1         D1  Chew   35   12000  2/1/1980
('C1', 'D2')
  Company Department Name  Age  Salary  Birthdate
2      C1         D2  Lim   34    8000  2/19/1977
('C1', 'D3')
  Company Department      Name  Age  Salary  Birthdate
3      C1         D3     Jessy   23    2500  3/15/1990
4      C1         D3  Hoi Ming   55   25000  4/15/1987
('C2', 'D1')
  Company Department  Name  Age  Salary  Birthdate
6      C2         D1  Anne   18     400  7/15/1997
('C2', 'D2')
  Company Department     Name  Age  Salary  Birthdate
7      C2         D2  Deborah   30    8600  8/15/1984
8      C2         D2  Nikalus   51   12000  9/18/2000
('C2', 'D3')
   Company Department     Name  Age  Salary   Birthdate
10      C2         D3  Michael   38   17000  11/30/1997
11      C2         D3  Jeannie   30   12500  12/31/1980
('C3', 'D1')
   Company Department Name  Age  Salary  Birthdate
14      C3         D1  Ong   44   17500  8/21/1980
('C3', 'D2')
   Company Department  Name  Age  Salary  Birthdate
15      C3         D2  Lily   41   15300  7/17/1990
('C3', 'D3')
   Company Department   Name  Age  Salary  Birthdate
13      C3         D3  Chang   32    7900  7/26/1973
16      C3         D3  Sally   54   21000  7/19/1968

In [254]:

com_grp

Out[254]:

<pandas.core.groupby.DataFrameGroupBy object at 0x000002C3255294A8>

Apply Aggregate Functions to Groups

Aggregate apply functions to columns in every groups, and return a summary data for each group

Apply One Function to One or More Columns

In [255]:

com_grp['Age'].sum()

Out[255]:

Company  Department
C1       D1             80
         D2             34
         D3            134
C2       D1             18
         D2            127
         D3             97
C3       D1             44
         D2             41
         D3            123
Name: Age, dtype: int64

In [256]:

com_grp[['Age','Salary']].sum()

Out[256]:

                    Age  Salary
Company Department             
C1      D1           80   27000
        D2           34    8000
        D3          134   30500
C2      D1           18     400
        D2          127   34600
        D3           97   39300
C3      D1           44   17500
        D2           41   15300
        D3          123   42400

Apply One or More Functions To All Columns

In [257]:

com_grp.agg(np.mean)

Out[257]:

                          Age        Salary
Company Department                         
C1      D1          40.000000  13500.000000
        D2          34.000000   8000.000000
        D3          44.666667  10166.666667
C2      D1          18.000000    400.000000
        D2          42.333333  11533.333333
        D3          32.333333  13100.000000
C3      D1          44.000000  17500.000000
        D2          41.000000  15300.000000
        D3          41.000000  14133.333333

In [258]:

com_grp.agg([np.mean,np.sum])

Out[258]:

                          Age             Salary       
                         mean  sum          mean    sum
Company Department                                     
C1      D1          40.000000   80  13500.000000  27000
        D2          34.000000   34   8000.000000   8000
        D3          44.666667  134  10166.666667  30500
C2      D1          18.000000   18    400.000000    400
        D2          42.333333  127  11533.333333  34600
        D3          32.333333   97  13100.000000  39300
C3      D1          44.000000   44  17500.000000  17500
        D2          41.000000   41  15300.000000  15300
        D3          41.000000  123  14133.333333  42400

Apply Different Functions To Different Columns

In [259]:

com_grp.agg({'Age':np.mean, 'Salary': [np.min,np.max]})

Out[259]:

                          Age Salary       
                         mean   amin   amax
Company Department                         
C1      D1          40.000000  12000  15000
        D2          34.000000   8000   8000
        D3          44.666667   2500  25000
C2      D1          18.000000    400    400
        D2          42.333333   8600  14000
        D3          32.333333   9800  17000
C3      D1          44.000000  17500  17500
        D2          41.000000  15300  15300
        D3          41.000000   7900  21000

Transform

  • Transform is an operation used combined with DataFrameGroupBy object
  • transform() return a new DataFrame object

In [260]:

grp = company.groupby('Company')
grp.size()

Out[260]:

Company
C1    6
C2    7
C3    5
dtype: int64

transform() perform a function to a group, and expands and replicate it to multiple rows according to original DataFrame

In [261]:

grp[['Age','Salary']].transform('sum')

Out[261]:

    Age  Salary
0   248   65500
1   248   65500
2   248   65500
3   248   65500
4   248   65500
..  ...     ...
13  208   75200
14  208   75200
15  208   75200
16  208   75200
17  208   75200

[18 rows x 2 columns]

In [262]:

grp.transform( lambda x:x+10 )

Out[262]:

    Age  Salary
0    55   15010
1    45   12010
2    44    8010
3    33    2510
4    65   25010
..  ...     ...
13   42    7910
14   54   17510
15   51   15310
16   64   21010
17   47   13510

[18 rows x 2 columns]

results matching ""

    No results matching ""