Programming in Python for Data Science

What is the concept of tidy data?

Tidy data satisfies the following three criteria:

Each row is a single observation,
Each variable is a single column, and
Each value is a single cell (i.e., its row, column position in the dataframe is not shared with another value)

Image Source: R for Data Science by Garrett Grolemund & Hadley Wickham

What a variable and an observation is may depend on your immediate goal.

Are protein and calories content associated with different cereal manufacturers?

cereal2

	name	mfr	nutrition	value
0	Apple Jacks	K	protein	2
1	Bran Flakes	P	protein	3
2	Cheerios	G	protein	6
...	...	...	...	...
11	Raisin Bran	K	calories	120
12	Special K	K	calories	110
13	Wheaties	G	calories	100

14 rows × 4 columns

cereal2[cereal2['nutrition'] == 'calories']['value'].mean()

np.float64(107.14285714285714)

If we had tidy data we could have simply done:

cereal['calories'].mean()

np.float64(107.14285714285714)