Tidy Data

What is the concept of tidy data?

Tidy data satisfies the following three criteria:

  • Each row is a single observation,
  • Each variable is a single column, and
  • Each value is a single cell (i.e., its row, column position in the dataframe is not shared with another value)
404 image

Image Source: R for Data Science by Garrett Grolemund & Hadley Wickham


What a variable and an observation is may depend on your immediate goal.

Are protein and calories content associated with different cereal manufacturers?

404 image

Criterion #1: Each row is a single observation

404 image

Criterion #2: Each variable is a single column

404 image

Criterion #3: Each value is a single cell

404 image
404 image

Criterion #1 Each row is a single observation

404 image

Criterion #2: Each variable is a single column

404 image
cereal2
name mfr nutrition value
0 Apple Jacks K protein 2
1 Bran Flakes P protein 3
2 Cheerios G protein 6
... ... ... ... ...
11 Raisin Bran K calories 120
12 Special K K calories 110
13 Wheaties G calories 100

14 rows × 4 columns


cereal2[cereal2['nutrition'] == 'calories']['value'].mean()
np.float64(107.14285714285714)


If we had tidy data we could have simply done:

cereal['calories'].mean()
np.float64(107.14285714285714)

Let’s practice what we know about tidy data first!