Grouping and Aggregating

Which manufacturer has the highest mean sugar content?

cereal['mfr'].value_counts()
mfr
K    23
G    22
P     9
     ..
Q     8
N     6
A     1
Name: count, Length: 7, dtype: int64

Let’s start with “K”:

cereal[cereal['mfr'] == 'K'].mean(numeric_only=True)[['sugars']]
sugars    7.565217
dtype: float64


Next “G”:

cereal[cereal['mfr'] == 'G'].mean(numeric_only=True)[['sugars']]
sugars    7.954545
dtype: float64

Using groupby

mfr_group = cereal.groupby(by='mfr')
mfr_group
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb084daecc0>
404 image
mfr_group.groups
{'A': [43], 'G': [5, 7, 11, 12, 13, 14, 18, 22, 31, 36, 40, 42, 47, 51, 59, 69, 70, 71, 72, 73, 75, 76], 'K': [2, 3, 6, 16, 17, 19, 21, 24, 25, 26, 28, 38, 39, 46, 48, 49, 50, 53, 58, 60, 62, 66, 67], 'N': [0, 20, 63, 64, 65, 68], 'P': [9, 27, 29, 30, 32, 33, 34, 37, 52], 'Q': [1, 10, 35, 41, 54, 55, 56, 57], 'R': [4, 8, 15, 23, 44, 45, 61, 74]}
mfr_group.groups['K']
Index([2, 3, 6, 16, 17, 19, 21, 24, 25, 26, 28, 38, 39, 46, 48, 49, 50, 53, 58, 60, 62, 66, 67], dtype='int64')
mfr_group.get_group('K')
name mfr type calories ... shelf weight cups rating
2 All-Bran K Cold 70 ... 3 1.0 0.33 59.425505
3 All-Bran with Extra Fiber K Cold 50 ... 3 1.0 0.50 93.704912
6 Apple Jacks K Cold 110 ... 2 1.0 1.00 33.174094
... ... ... ... ... ... ... ... ... ...
62 Rice Krispies K Cold 110 ... 1 1.0 1.00 40.560159
66 Smacks K Cold 110 ... 2 1.0 0.75 31.230054
67 Special K K Cold 110 ... 1 1.0 1.00 53.131324

23 rows × 16 columns

cereal['mfr'].value_counts()
mfr
K    23
G    22
P     9
     ..
Q     8
N     6
A     1
Name: count, Length: 7, dtype: int64


mfr_group.size()
mfr
A     1
G    22
K    23
     ..
P     9
Q     8
R     8
Length: 7, dtype: int64

Summary Statistics with Groups

mfr_group = cereal.drop(columns=["name", "type"]).groupby(by='mfr')
mfr_group.mean()
calories protein fat sodium ... shelf weight cups rating
mfr
A 100.000000 4.000000 1.000000 0.000000 ... 2.000000 1.000000 1.00000 54.850917
G 111.363636 2.318182 1.363636 200.454545 ... 2.136364 1.049091 0.87500 34.485852
... ... ... ... ... ... ... ... ... ...
Q 95.000000 2.625000 1.750000 92.500000 ... 2.375000 0.875000 0.82375 42.915990
R 115.000000 2.500000 1.250000 198.125000 ... 2.000000 1.000000 0.87125 41.542997

7 rows × 13 columns

mfr_group.max()
calories protein fat sodium ... shelf weight cups rating
mfr
A 100 4 1 0 ... 2 1.0 1.00 54.850917
G 140 6 3 290 ... 3 1.5 1.50 51.592193
... ... ... ... ... ... ... ... ... ...
Q 120 5 5 220 ... 3 1.0 1.00 63.005645
R 150 4 3 280 ... 3 1.0 1.13 49.787445

7 rows × 13 columns

Aggregating dataframes

cereal.select_dtypes(include=np.number).agg('mean')
calories    106.883117
protein       2.545455
fat           1.012987
               ...    
weight        1.029610
cups          0.821039
rating       42.665705
Length: 13, dtype: float64


cereal.mean(numeric_only=True)
calories    106.883117
protein       2.545455
fat           1.012987
               ...    
weight        1.029610
cups          0.821039
rating       42.665705
Length: 13, dtype: float64
cereal.select_dtypes(include=np.number).agg(['max', 'min', 'median'])
calories protein fat sodium ... shelf weight cups rating
max 160.0 6.0 5.0 320.0 ... 3.0 1.5 1.50 93.704912
min 50.0 1.0 0.0 0.0 ... 1.0 0.5 0.25 18.042851
median 110.0 3.0 1.0 180.0 ... 2.0 1.0 0.75 40.400208

3 rows × 13 columns

Aggregating groupby objects

mfr_group.agg(['max', 'min', 'median'])
calories protein ... cups rating
max min median max ... median max min median
mfr
A 100 100 100.0 4 ... 1.000 54.850917 54.850917 54.850917
G 140 100 110.0 6 ... 0.875 51.592193 19.823573 36.181877
K 160 50 110.0 6 ... 0.750 93.704912 29.924285 40.560159
... ... ... ... ... ... ... ... ... ...
P 120 90 110.0 3 ... 0.670 53.371007 28.025765 40.917047
Q 120 50 100.0 5 ... 0.875 63.005645 18.042851 47.419974
R 150 90 110.0 4 ... 0.875 49.787445 34.139765 41.721976

7 rows × 39 columns

Extra Fancy Aggregation

mfr_group.agg({"calories":['max', 'min'],
               "rating":['sum'],  
               "sugars":['mean', 'median']})
calories rating sugars
max min sum mean median
mfr
A 100 100 54.850917 3.000000 3.0
G 140 100 758.688737 7.954545 8.5
K 160 50 1012.884634 7.565217 7.0
... ... ... ... ... ...
P 120 90 375.351697 8.777778 10.0
Q 120 50 343.327919 5.500000 6.0
R 150 90 332.343977 6.125000 5.5

7 rows × 5 columns

Let’s apply what we learned!