Working with Null Values

Null: The human-readable term of a value that is missing from the dataframe.

np.nan
nan

Missing values are sometimes referred to as NA values.

In this course, we generally refer to them as both null and NaN values.

Info on missing values

cereal.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      77 non-null     object 
 1   mfr       77 non-null     object 
 2   calories  77 non-null     int64  
 3   fat       77 non-null     int64  
 4   fiber     77 non-null     float64
 5   rating    77 non-null     float64
dtypes: float64(2), int64(2), object(2)
memory usage: 3.7+ KB
cycling
Date Name Type Time Distance Comments
0 2019-09-09 07:13:00 Morning Ride Ride 2084 12.62 Rain
1 2019-09-09 20:52:00 Afternoon Ride Ride 2531 13.03 rain
2 2019-09-10 07:23:00 Morning Ride Ride 1863 NaN Wet road but nice weather
... ... ... ... ... ... ...
30 2019-10-09 07:10:00 Morning Ride Ride 1841 12.59 Feeling good after a holiday break!
31 2019-10-09 20:47:00 Afternoon Ride Ride 2463 12.79 Stopped for photo of sunrise
32 2019-10-10 07:16:00 Morning Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump

33 rows × 6 columns

cycling.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      33 non-null     datetime64[ns]
 1   Name      33 non-null     object        
 2   Type      33 non-null     object        
 3   Time      33 non-null     int64         
 4   Distance  30 non-null     float64       
 5   Comments  33 non-null     object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 1.7+ KB
cycling['Distance'].isnull()
0     False
1     False
2      True
      ...  
30    False
31    False
32    False
Name: Distance, Length: 33, dtype: bool
cycling[cycling['Distance'].isnull()]
Date Name Type Time Distance Comments
2 2019-09-10 07:23:00 Morning Ride Ride 1863 NaN Wet road but nice weather
22 2019-09-30 07:15:00 Morning Ride Ride 1732 NaN Legs feeling strong!
24 2019-10-01 07:13:00 Morning Ride Ride 1756 NaN A little tired today but good weather


cycling[cycling.isnull().any(axis=1)]
Date Name Type Time Distance Comments
2 2019-09-10 07:23:00 Morning Ride Ride 1863 NaN Wet road but nice weather
22 2019-09-30 07:15:00 Morning Ride Ride 1732 NaN Legs feeling strong!
24 2019-10-01 07:13:00 Morning Ride Ride 1756 NaN A little tired today but good weather

We will be discussing the following 2 simple ways of working with missing values:

Dropping Null Values

trips_removed = cycling.dropna()
trips_removed
Date Name Type Time Distance Comments
0 2019-09-09 07:13:00 Morning Ride Ride 2084 12.62 Rain
1 2019-09-09 20:52:00 Afternoon Ride Ride 2531 13.03 rain
3 2019-09-10 21:06:00 Afternoon Ride Ride 2192 12.84 Stopped for photo of sunrise
... ... ... ... ... ... ...
30 2019-10-09 07:10:00 Morning Ride Ride 1841 12.59 Feeling good after a holiday break!
31 2019-10-09 20:47:00 Afternoon Ride Ride 2463 12.79 Stopped for photo of sunrise
32 2019-10-10 07:16:00 Morning Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump

30 rows × 6 columns

cycling.dropna(subset=['Type'])
Date Name Type Time Distance Comments
0 2019-09-09 07:13:00 Morning Ride Ride 2084 12.62 Rain
1 2019-09-09 20:52:00 Afternoon Ride Ride 2531 13.03 rain
2 2019-09-10 07:23:00 Morning Ride Ride 1863 NaN Wet road but nice weather
... ... ... ... ... ... ...
30 2019-10-09 07:10:00 Morning Ride Ride 1841 12.59 Feeling good after a holiday break!
31 2019-10-09 20:47:00 Afternoon Ride Ride 2463 12.79 Stopped for photo of sunrise
32 2019-10-10 07:16:00 Morning Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump

33 rows × 6 columns

cycling.dropna(subset=['Distance'])
Date Name Type Time Distance Comments
0 2019-09-09 07:13:00 Morning Ride Ride 2084 12.62 Rain
1 2019-09-09 20:52:00 Afternoon Ride Ride 2531 13.03 rain
3 2019-09-10 21:06:00 Afternoon Ride Ride 2192 12.84 Stopped for photo of sunrise
... ... ... ... ... ... ...
30 2019-10-09 07:10:00 Morning Ride Ride 1841 12.59 Feeling good after a holiday break!
31 2019-10-09 20:47:00 Afternoon Ride Ride 2463 12.79 Stopped for photo of sunrise
32 2019-10-10 07:16:00 Morning Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump

30 rows × 6 columns

Replacing Null Values

cycling_zero_fill = cycling.fillna(value=0)
cycling_zero_fill
Date Name Type Time Distance Comments
0 2019-09-09 07:13:00 Morning Ride Ride 2084 12.62 Rain
1 2019-09-09 20:52:00 Afternoon Ride Ride 2531 13.03 rain
2 2019-09-10 07:23:00 Morning Ride Ride 1863 0.00 Wet road but nice weather
... ... ... ... ... ... ...
30 2019-10-09 07:10:00 Morning Ride Ride 1841 12.59 Feeling good after a holiday break!
31 2019-10-09 20:47:00 Afternoon Ride Ride 2463 12.79 Stopped for photo of sunrise
32 2019-10-10 07:16:00 Morning Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump

33 rows × 6 columns

cycling['Distance'].mean().round(2)
np.float64(12.67)


cycling_mean_fill = cycling.fillna(value=cycling['Distance'].mean().round(2))
cycling_mean_fill
Date Name Type Time Distance Comments
0 2019-09-09 07:13:00 Morning Ride Ride 2084 12.62 Rain
1 2019-09-09 20:52:00 Afternoon Ride Ride 2531 13.03 rain
2 2019-09-10 07:23:00 Morning Ride Ride 1863 12.67 Wet road but nice weather
... ... ... ... ... ... ...
30 2019-10-09 07:10:00 Morning Ride Ride 1841 12.59 Feeling good after a holiday break!
31 2019-10-09 20:47:00 Afternoon Ride Ride 2463 12.79 Stopped for photo of sunrise
32 2019-10-10 07:16:00 Morning Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump

33 rows × 6 columns

cycling.fillna(method='bfill')
Date Name Type Time Distance Comments
0 2019-09-09 07:13:00 Morning Ride Ride 2084 12.62 Rain
1 2019-09-09 20:52:00 Afternoon Ride Ride 2531 13.03 rain
2 2019-09-10 07:23:00 Morning Ride Ride 1863 12.84 Wet road but nice weather
... ... ... ... ... ... ...
30 2019-10-09 07:10:00 Morning Ride Ride 1841 12.59 Feeling good after a holiday break!
31 2019-10-09 20:47:00 Afternoon Ride Ride 2463 12.79 Stopped for photo of sunrise
32 2019-10-10 07:16:00 Morning Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump

33 rows × 6 columns

cycling.fillna(method='ffill')
Date Name Type Time Distance Comments
0 2019-09-09 07:13:00 Morning Ride Ride 2084 12.62 Rain
1 2019-09-09 20:52:00 Afternoon Ride Ride 2531 13.03 rain
2 2019-09-10 07:23:00 Morning Ride Ride 1863 13.03 Wet road but nice weather
... ... ... ... ... ... ...
30 2019-10-09 07:10:00 Morning Ride Ride 1841 12.59 Feeling good after a holiday break!
31 2019-10-09 20:47:00 Afternoon Ride Ride 2463 12.79 Stopped for photo of sunrise
32 2019-10-10 07:16:00 Morning Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump

33 rows × 6 columns

Let’s apply what we learned!