Group and Aggregation¶
- Aggretation and summarization require creating DataFrameGroupBy object from existing DataFrame
- The GroupBy object is a very flexible abstraction. In many ways, you can simply treat it as if it's a collection of DataFrames, and it does the difficult things under the hood
In [243]:
company = pd.read_csv('data/company.csv')
company.head()
Out[243]:
Company Department Name Age Salary Birthdate 0 C1 D1 Yong 45 15000 1/1/1970 1 C1 D1 Chew 35 12000 2/1/1980 2 C1 D2 Lim 34 8000 2/19/1977 3 C1 D3 Jessy 23 2500 3/15/1990 4 C1 D3 Hoi Ming 55 25000 4/15/1987
Creating Groups¶
In [244]:
com_grp = company.groupby(['Company','Department'])
In [245]:
com_grp
Out[245]:
<pandas.core.groupby.DataFrameGroupBy object at 0x000002C3255294A8>
Properties¶
Number of Groups Created¶
In [246]:
com_grp.ngroups
Out[246]:
9
Row Numbers Associated For Each Group¶
In [247]:
com_grp.groups # return Dictionary
Out[247]:
{('C1', 'D1'): Int64Index([0, 1], dtype='int64'),
('C1', 'D2'): Int64Index([2], dtype='int64'),
('C1', 'D3'): Int64Index([3, 4, 5], dtype='int64'),
('C2', 'D1'): Int64Index([6], dtype='int64'),
('C2', 'D2'): Int64Index([7, 8, 9], dtype='int64'),
('C2', 'D3'): Int64Index([10, 11, 12], dtype='int64'),
('C3', 'D1'): Int64Index([14], dtype='int64'),
('C3', 'D2'): Int64Index([15], dtype='int64'),
('C3', 'D3'): Int64Index([13, 16, 17], dtype='int64')}
Methods¶
Number of Rows In Each Group¶
In [248]:
com_grp.size() # return panda Series object
Out[248]:
Company Department
C1 D1 2
D2 1
D3 3
C2 D1 1
D2 3
D3 3
C3 D1 1
D2 1
D3 3
dtype: int64
Valid (not Null) Data Count For Each Fields In The Group¶
In [249]:
com_grp.count() # return panda DataFrame object
Out[249]:
Name Age Salary Birthdate
Company Department
C1 D1 2 2 2 2
D2 1 1 1 1
D3 3 3 3 3
C2 D1 1 1 1 1
D2 3 3 3 3
D3 3 3 3 3
C3 D1 1 1 1 1
D2 1 1 1 1
D3 3 3 3 3
Retrieve Rows¶
All row retrieval operations return a dataframe
Retrieve N Rows For Each Groups¶
Example below retrieve 2 rows from each group
In [250]:
com_grp.head(2)
Out[250]:
Company Department Name Age Salary Birthdate 0 C1 D1 Yong 45 15000 1/1/1970 1 C1 D1 Chew 35 12000 2/1/1980 2 C1 D2 Lim 34 8000 2/19/1977 3 C1 D3 Jessy 23 2500 3/15/1990 4 C1 D3 Hoi Ming 55 25000 4/15/1987 .. ... ... ... ... ... ... 11 C2 D3 Jeannie 30 12500 12/31/1980 13 C3 D3 Chang 32 7900 7/26/1973 14 C3 D1 Ong 44 17500 8/21/1980 15 C3 D2 Lily 41 15300 7/17/1990 16 C3 D3 Sally 54 21000 7/19/1968 [14 rows x 6 columns]
Retrieve Rows In One Specific Group¶
In [251]:
com_grp.get_group(('C1','D3'))
Out[251]:
Company Department Name Age Salary Birthdate 3 C1 D3 Jessy 23 2500 3/15/1990 4 C1 D3 Hoi Ming 55 25000 4/15/1987 5 C1 D3 Sui Wei 56 3000 6/15/1990
Retrieve n-th Row From Each Group¶
Row number is 0-based
In [252]:
com_grp.nth(-1) # retireve last row from each group
Out[252]:
Age Birthdate Name Salary
Company Department
C1 D1 35 2/1/1980 Chew 12000
D2 34 2/19/1977 Lim 8000
D3 56 6/15/1990 Sui Wei 3000
C2 D1 18 7/15/1997 Anne 400
D2 46 10/31/1988 Jimmy 14000
D3 29 12/1/1963 Bernard 9800
C3 D1 44 8/21/1980 Ong 17500
D2 41 7/17/1990 Lily 15300
D3 37 3/16/1969 Esther 13500
Iteration¶
DataFrameGroupBy object can be thought as a collection of named groups
In [253]:
def print_groups (g):
for name,group in g:
print (name)
print (group[:2])
print_groups (com_grp)
('C1', 'D1')
Company Department Name Age Salary Birthdate
0 C1 D1 Yong 45 15000 1/1/1970
1 C1 D1 Chew 35 12000 2/1/1980
('C1', 'D2')
Company Department Name Age Salary Birthdate
2 C1 D2 Lim 34 8000 2/19/1977
('C1', 'D3')
Company Department Name Age Salary Birthdate
3 C1 D3 Jessy 23 2500 3/15/1990
4 C1 D3 Hoi Ming 55 25000 4/15/1987
('C2', 'D1')
Company Department Name Age Salary Birthdate
6 C2 D1 Anne 18 400 7/15/1997
('C2', 'D2')
Company Department Name Age Salary Birthdate
7 C2 D2 Deborah 30 8600 8/15/1984
8 C2 D2 Nikalus 51 12000 9/18/2000
('C2', 'D3')
Company Department Name Age Salary Birthdate
10 C2 D3 Michael 38 17000 11/30/1997
11 C2 D3 Jeannie 30 12500 12/31/1980
('C3', 'D1')
Company Department Name Age Salary Birthdate
14 C3 D1 Ong 44 17500 8/21/1980
('C3', 'D2')
Company Department Name Age Salary Birthdate
15 C3 D2 Lily 41 15300 7/17/1990
('C3', 'D3')
Company Department Name Age Salary Birthdate
13 C3 D3 Chang 32 7900 7/26/1973
16 C3 D3 Sally 54 21000 7/19/1968
In [254]:
com_grp
Out[254]:
<pandas.core.groupby.DataFrameGroupBy object at 0x000002C3255294A8>
Apply Aggregate Functions to Groups¶
Aggregate apply functions to columns in every groups, and return a summary data for each group
Apply One Function to One or More Columns¶
In [255]:
com_grp['Age'].sum()
Out[255]:
Company Department
C1 D1 80
D2 34
D3 134
C2 D1 18
D2 127
D3 97
C3 D1 44
D2 41
D3 123
Name: Age, dtype: int64
In [256]:
com_grp[['Age','Salary']].sum()
Out[256]:
Age Salary
Company Department
C1 D1 80 27000
D2 34 8000
D3 134 30500
C2 D1 18 400
D2 127 34600
D3 97 39300
C3 D1 44 17500
D2 41 15300
D3 123 42400
Apply One or More Functions To All Columns¶
In [257]:
com_grp.agg(np.mean)
Out[257]:
Age Salary
Company Department
C1 D1 40.000000 13500.000000
D2 34.000000 8000.000000
D3 44.666667 10166.666667
C2 D1 18.000000 400.000000
D2 42.333333 11533.333333
D3 32.333333 13100.000000
C3 D1 44.000000 17500.000000
D2 41.000000 15300.000000
D3 41.000000 14133.333333
In [258]:
com_grp.agg([np.mean,np.sum])
Out[258]:
Age Salary
mean sum mean sum
Company Department
C1 D1 40.000000 80 13500.000000 27000
D2 34.000000 34 8000.000000 8000
D3 44.666667 134 10166.666667 30500
C2 D1 18.000000 18 400.000000 400
D2 42.333333 127 11533.333333 34600
D3 32.333333 97 13100.000000 39300
C3 D1 44.000000 44 17500.000000 17500
D2 41.000000 41 15300.000000 15300
D3 41.000000 123 14133.333333 42400
Apply Different Functions To Different Columns¶
In [259]:
com_grp.agg({'Age':np.mean, 'Salary': [np.min,np.max]})
Out[259]:
Age Salary
mean amin amax
Company Department
C1 D1 40.000000 12000 15000
D2 34.000000 8000 8000
D3 44.666667 2500 25000
C2 D1 18.000000 400 400
D2 42.333333 8600 14000
D3 32.333333 9800 17000
C3 D1 44.000000 17500 17500
D2 41.000000 15300 15300
D3 41.000000 7900 21000
Transform¶
- Transform is an operation used combined with DataFrameGroupBy object
- transform() return a new DataFrame object
In [260]:
grp = company.groupby('Company')
grp.size()
Out[260]:
Company C1 6 C2 7 C3 5 dtype: int64
transform() perform a function to a group, and expands and replicate it to multiple rows according to original DataFrame
In [261]:
grp[['Age','Salary']].transform('sum')
Out[261]:
Age Salary 0 248 65500 1 248 65500 2 248 65500 3 248 65500 4 248 65500 .. ... ... 13 208 75200 14 208 75200 15 208 75200 16 208 75200 17 208 75200 [18 rows x 2 columns]
In [262]:
grp.transform( lambda x:x+10 )
Out[262]:
Age Salary 0 55 15010 1 45 12010 2 44 8010 3 33 2510 4 65 25010 .. ... ... 13 42 7910 14 54 17510 15 51 15310 16 64 21010 17 47 13510 [18 rows x 2 columns]