Summary Statistics

cereal = pd.read_csv('data/cereal.csv')
cereal.head(15)
name mfr type calories ... shelf weight cups rating
0 100% Bran N Cold 70 ... 3 1.00 0.33 68.402973
1 100% Natural Bran Q Cold 120 ... 3 1.00 1.00 33.983679
2 All-Bran K Cold 70 ... 3 1.00 0.33 59.425505
3 All-Bran with Extra Fiber K Cold 50 ... 3 1.00 0.50 93.704912
4 Almond Delight R Cold 110 ... 3 1.00 0.75 34.384843
5 Apple Cinnamon Cheerios G Cold 110 ... 1 1.00 0.75 29.509541
6 Apple Jacks K Cold 110 ... 2 1.00 1.00 33.174094
7 Basic 4 G Cold 130 ... 3 1.33 0.75 37.038562
8 Bran Chex R Cold 90 ... 1 1.00 0.67 49.120253
9 Bran Flakes P Cold 90 ... 3 1.00 0.67 53.313813
10 Cap'n'Crunch Q Cold 120 ... 2 1.00 0.75 18.042851
11 Cheerios G Cold 110 ... 1 1.00 1.25 50.764999
12 Cinnamon Toast Crunch G Cold 120 ... 2 1.00 0.75 19.823573
13 Clusters G Cold 110 ... 3 1.00 0.50 40.400208
14 Cocoa Puffs G Cold 110 ... 2 1.00 1.00 22.736446

15 rows × 16 columns

Numerical and Categorical Columns

Categorical data

Consists of qualitative observations such as characteristics - things generally containing words.

Examples

  • Colours
  • Names


Numerical data

These data are usually expressed with numbers.

Examples

  • Measurements
  • Quantities

Pandas describe()

cereal.describe()
calories protein fat sodium ... shelf weight cups rating
count 77.000000 77.000000 77.000000 77.000000 ... 77.000000 77.000000 77.000000 77.000000
mean 106.883117 2.545455 1.012987 159.675325 ... 2.207792 1.029610 0.821039 42.665705
std 19.484119 1.094790 1.006473 83.832295 ... 0.832524 0.150477 0.232716 14.047289
min 50.000000 1.000000 0.000000 0.000000 ... 1.000000 0.500000 0.250000 18.042851
25% 100.000000 2.000000 0.000000 130.000000 ... 1.000000 1.000000 0.670000 33.174094
50% 110.000000 3.000000 1.000000 180.000000 ... 2.000000 1.000000 0.750000 40.400208
75% 110.000000 3.000000 2.000000 210.000000 ... 3.000000 1.000000 1.000000 50.828392
max 160.000000 6.000000 5.000000 320.000000 ... 3.000000 1.500000 1.500000 93.704912

8 rows × 13 columns

cereal.describe()
calories protein fat sodium ... shelf weight cups rating
count 77.000000 77.000000 77.000000 77.000000 ... 77.000000 77.000000 77.000000 77.000000
mean 106.883117 2.545455 1.012987 159.675325 ... 2.207792 1.029610 0.821039 42.665705
std 19.484119 1.094790 1.006473 83.832295 ... 0.832524 0.150477 0.232716 14.047289
min 50.000000 1.000000 0.000000 0.000000 ... 1.000000 0.500000 0.250000 18.042851
25% 100.000000 2.000000 0.000000 130.000000 ... 1.000000 1.000000 0.670000 33.174094
50% 110.000000 3.000000 1.000000 180.000000 ... 2.000000 1.000000 0.750000 40.400208
75% 110.000000 3.000000 2.000000 210.000000 ... 3.000000 1.000000 1.000000 50.828392
max 160.000000 6.000000 5.000000 320.000000 ... 3.000000 1.500000 1.500000 93.704912

8 rows × 13 columns

  • count: The number of non-NA/null observations.
  • mean: The mean of column
  • std : The standard deviation of a column
  • min: The min value for a column
  • max: The max value for a column
  • By default the 25, 50 and 75 percentile of the observations
cereal.describe(include='all')
name mfr type calories ... shelf weight cups rating
count 77 77 77 77.000000 ... 77.000000 77.000000 77.000000 77.000000
unique 77 7 2 NaN ... NaN NaN NaN NaN
top 100% Bran K Cold NaN ... NaN NaN NaN NaN
freq 1 23 74 NaN ... NaN NaN NaN NaN
mean NaN NaN NaN 106.883117 ... 2.207792 1.029610 0.821039 42.665705
std NaN NaN NaN 19.484119 ... 0.832524 0.150477 0.232716 14.047289
min NaN NaN NaN 50.000000 ... 1.000000 0.500000 0.250000 18.042851
25% NaN NaN NaN 100.000000 ... 1.000000 1.000000 0.670000 33.174094
50% NaN NaN NaN 110.000000 ... 2.000000 1.000000 0.750000 40.400208
75% NaN NaN NaN 110.000000 ... 3.000000 1.000000 1.000000 50.828392
max NaN NaN NaN 160.000000 ... 3.000000 1.500000 1.500000 93.704912

11 rows × 16 columns

  • unique: how many observations are unique
  • top: which observation value is most occurring
  • freq: what is the frequency of the most occurring observation
ratings = cereal[['rating']]
ratings.mean()
rating    42.665705
dtype: float64


ratings.sum()
rating    3285.259284
dtype: float64


ratings.median()
rating    40.400208
dtype: float64
cereal.mean(numeric_only=True)
calories    106.883117
protein       2.545455
fat           1.012987
sodium      159.675325
fiber         2.151948
carbo        14.623377
sugars        6.948052
potass       96.129870
vitamins     28.246753
shelf         2.207792
weight        1.029610
cups          0.821039
rating       42.665705
dtype: float64

Let’s apply what we learned!