Exploratory data analysis (EDA)

4. Exploratory data analysis (EDA)#

Lecture learning goals

By the end of the lecture you will be able to:

Visualize missing values.
Define what EDA is on a conceptual levels both for numerical and categorical variables.
Use repeated plot grids to investigate multiple data frame columns in the same plot.
Visualize correlations.
Visualize counts of categorical variables.

Required activities

Before class:

This 10 min video on EDA concepts.

After class:

Review the lecture notes.
Interactive flow chart for which visualization to choose. Not really anything to read, but go there to look around and bookmark it to use as a reference when needed throughout this course and others.

import altair as alt
import pandas as pd

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

DataTransformerRegistry.enable('vegafusion')

4.1. Missing values#

4.1.1. Py#

First we are going to load in some weather data and simulate a faulty sensor by assigning some NaN values manually.

from vega_datasets import data
import numpy as np


df = data.seattle_weather()
# Artifically introduce some NaNs to simulate a sensor that can't measure below freezing
df.loc[df['temp_min'] <= 0.6, 'temp_min'] = np.nan
df.loc[df['temp_max'] <= 0.6, 'temp_max'] = np.nan
df = df[df['weather'].isin(['rain', 'snow', 'sun'])]
df

	date	precipitation	temp_max	temp_min	wind	weather
1	2012-01-02	10.9	10.6	2.8	4.5	rain
2	2012-01-03	0.8	11.7	7.2	2.3	rain
3	2012-01-04	20.3	12.2	5.6	4.7	rain
4	2012-01-05	1.3	8.9	2.8	6.1	rain
5	2012-01-06	2.5	4.4	2.2	2.2	rain
...	...	...	...	...	...	...
1442	2015-12-13	1.3	7.8	6.1	6.1	sun
1443	2015-12-14	0.0	7.8	1.7	1.7	sun
1455	2015-12-26	0.0	4.4	NaN	2.5	sun
1459	2015-12-30	0.0	5.6	NaN	3.4	sun
1460	2015-12-31	0.0	5.6	NaN	3.5	sun

996 rows × 6 columns

If we were to visualize this data in order to answer the the question “what what the minimum temperature each year for the difference types of weather?” it might look something like this:

alt.Chart(df).mark_point(size=120, opacity=0.7).encode(
    y='year(date):N',
    x='min(temp_min)',
    color='weather',
    shape='weather'
)

One immediate drawback with this visualization is that there could be overlapping dots in the exact same spot which make it difficult to see the exact values from each weather type. This chart still reveals something peculiar about our data: it looks like the dots are neatly arranged in a grid around the lowest temperature for each year. Almost too neat… this might be due to some binning on the data or maybe that the resolution of the sensor is low. There is also probably something going where the lowest values are clamped to just over one degree celsius, that seems quite odd as we would expect some variation in the yearly minimum temperature as well. This could be due to some error in how the data was pre-processed or maybe that the sensor mechanics don’t tolerate values under freezing and it stops measuring during that time.

In general when we read in data, it is a good idea to perform some EDA on the missing values before starting to do it on the rest of the data, regardless of whether we have already observed a weird pattern or not. df.info() can show us how many missing values there are, but we don’t get any info on their pattern.

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 996 entries, 1 to 1460
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           996 non-null    datetime64[ns]
 1   precipitation  996 non-null    float64       
 2   temp_max       991 non-null    float64       
 3   temp_min       910 non-null    float64       
 4   wind           996 non-null    float64       
 5   weather        996 non-null    object        
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 54.5+ KB

To see if there is a pattern in the missing values, the simplest we could do would be to use pandas built in styling functions. This produces quite long output, so I have truncated it here, but if you were to scroll through it you would see that there seems to be a periodicity to the data and since this data set is ordered by the date the measurement was collected it seems like this periodic variation is through time.

df[:50].style.highlight_null()

	date	precipitation	temp_max	temp_min	wind	weather
1	2012-01-02 00:00:00	10.900000	10.600000	2.800000	4.500000	rain
2	2012-01-03 00:00:00	0.800000	11.700000	7.200000	2.300000	rain
3	2012-01-04 00:00:00	20.300000	12.200000	5.600000	4.700000	rain
4	2012-01-05 00:00:00	1.300000	8.900000	2.800000	6.100000	rain
5	2012-01-06 00:00:00	2.500000	4.400000	2.200000	2.200000	rain
6	2012-01-07 00:00:00	0.000000	7.200000	2.800000	2.300000	rain
7	2012-01-08 00:00:00	0.000000	10.000000	2.800000	2.000000	sun
8	2012-01-09 00:00:00	4.300000	9.400000	5.000000	3.400000	rain
9	2012-01-10 00:00:00	1.000000	6.100000	nan	3.400000	rain
10	2012-01-11 00:00:00	0.000000	6.100000	nan	5.100000	sun
11	2012-01-12 00:00:00	0.000000	6.100000	nan	1.900000	sun
12	2012-01-13 00:00:00	0.000000	5.000000	nan	1.300000	sun
13	2012-01-14 00:00:00	4.100000	4.400000	nan	5.300000	snow
14	2012-01-15 00:00:00	5.300000	1.100000	nan	3.200000	snow
15	2012-01-16 00:00:00	2.500000	1.700000	nan	5.000000	snow
16	2012-01-17 00:00:00	8.100000	3.300000	nan	5.600000	snow
17	2012-01-18 00:00:00	19.800000	nan	nan	5.000000	snow
18	2012-01-19 00:00:00	15.200000	nan	nan	1.600000	snow
19	2012-01-20 00:00:00	13.500000	7.200000	nan	2.300000	snow
20	2012-01-21 00:00:00	3.000000	8.300000	3.300000	8.200000	rain
21	2012-01-22 00:00:00	6.100000	6.700000	2.200000	4.800000	rain
22	2012-01-23 00:00:00	0.000000	8.300000	1.100000	3.600000	rain
23	2012-01-24 00:00:00	8.600000	10.000000	2.200000	5.100000	rain
24	2012-01-25 00:00:00	8.100000	8.900000	4.400000	5.400000	rain
25	2012-01-26 00:00:00	4.800000	8.900000	1.100000	4.800000	rain
27	2012-01-28 00:00:00	0.000000	6.700000	nan	2.200000	rain
28	2012-01-29 00:00:00	27.700000	9.400000	3.900000	4.500000	rain
29	2012-01-30 00:00:00	3.600000	8.300000	6.100000	5.100000	rain
30	2012-01-31 00:00:00	1.800000	9.400000	6.100000	3.900000	rain
31	2012-02-01 00:00:00	13.500000	8.900000	3.300000	2.700000	rain
32	2012-02-02 00:00:00	0.000000	8.300000	1.700000	2.600000	sun
33	2012-02-03 00:00:00	0.000000	14.400000	2.200000	5.300000	sun
34	2012-02-04 00:00:00	0.000000	15.600000	5.000000	4.300000	sun
35	2012-02-05 00:00:00	0.000000	13.900000	1.700000	2.900000	sun
36	2012-02-06 00:00:00	0.000000	16.100000	1.700000	5.000000	sun
37	2012-02-07 00:00:00	0.300000	15.600000	7.800000	5.300000	rain
38	2012-02-08 00:00:00	2.800000	10.000000	5.000000	2.700000	rain
39	2012-02-09 00:00:00	2.500000	11.100000	7.800000	2.400000	rain
40	2012-02-10 00:00:00	2.500000	12.800000	6.700000	3.000000	rain
41	2012-02-11 00:00:00	0.800000	8.900000	5.600000	3.400000	rain
42	2012-02-12 00:00:00	1.000000	8.300000	5.000000	1.300000	rain
43	2012-02-13 00:00:00	11.400000	7.200000	4.400000	1.400000	rain
44	2012-02-14 00:00:00	2.500000	6.700000	1.100000	3.100000	rain
46	2012-02-16 00:00:00	1.800000	7.200000	3.300000	2.100000	rain
47	2012-02-17 00:00:00	17.300000	10.000000	4.400000	3.400000	rain
48	2012-02-18 00:00:00	6.400000	6.700000	3.900000	8.100000	rain
49	2012-02-19 00:00:00	0.000000	6.700000	2.200000	4.700000	sun
50	2012-02-20 00:00:00	3.000000	7.800000	1.700000	2.900000	rain
51	2012-02-21 00:00:00	0.800000	10.000000	7.800000	7.500000	rain
52	2012-02-22 00:00:00	8.600000	10.000000	2.800000	5.900000	rain

We could create a more effective visualization of this using Altair, but it would be a bit involved as you can see below. The reason we are melting the data is so that we can perform this viz for all the columns since we don’t know where the EDA values are when we do EDA. If you are only interested in one column, you could skip the melting part.

alt.Chart(
    df.isna().reset_index().melt(
        id_vars='index'
    )
).mark_rect().encode(
    alt.X('index:O').axis(None),
    alt.Y('variable').title(None),
    alt.Color('value').title('NaN'),
    alt.Stroke('value')  # We set the stroke which is the outline of each rectangle in the heatmap
).properties(
    width=df.shape[0]
)

Aha! Here we can clearly see the periodic pattern in the data. There is a predictable reoccurrence of the missing values. Again, remember that this data is ordered by time even thought we are using the index when plotting it to make sure this technique works with dataframe that do not have a date column.

We could explore what this patterns is due to by sorting the data by another variable. Let’s test our hypothesis that the sensor can’t measure low temperatures. If this is true, we should see that all the temp_min values bunch up to the left since this is also where the lowest temp_max temparatures are (a day with a low max temperature is more likely to have a low min temperature).

alt.Chart(
    df.sort_values(
        'temp_max',
        ignore_index=True
    ).isna().reset_index().melt(
        id_vars='index'
    )
).mark_rect().encode(
    alt.X('index:O').axis(None),
    alt.Y('variable').title(None),
    alt.Color('value').title('NaN'),
    alt.Stroke('value')
).properties(
    width=df.shape[0]
)

There does indeed seem like the missing values are related to low daily temperatures! (to the far right you can see the values that are NaN for both the max and min temperature, NaNs are sorted after all numerical values in pandas).

We could have used the sorting technique with the pandas styler as well:

# Just showing the first 50 to save space
df.sort_values('temp_max')[:50].style.highlight_null()

	date	precipitation	temp_max	temp_min	wind	weather
14	2012-01-15 00:00:00	5.300000	1.100000	nan	3.200000	snow
705	2013-12-06 00:00:00	0.000000	1.100000	nan	4.700000	sun
704	2013-12-05 00:00:00	0.000000	1.100000	nan	2.600000	sun
708	2013-12-09 00:00:00	0.000000	1.100000	nan	1.300000	sun
15	2012-01-16 00:00:00	2.500000	1.700000	nan	5.000000	snow
707	2013-12-08 00:00:00	0.000000	2.200000	nan	2.200000	sun
378	2013-01-13 00:00:00	0.000000	2.200000	nan	1.500000	sun
765	2014-02-04 00:00:00	0.000000	2.800000	nan	4.700000	sun
1064	2014-11-30 00:00:00	0.000000	2.800000	nan	4.400000	sun
377	2013-01-12 00:00:00	0.000000	2.800000	nan	2.000000	sun
16	2012-01-17 00:00:00	8.100000	3.300000	nan	5.600000	snow
375	2013-01-10 00:00:00	0.300000	3.300000	nan	2.100000	snow
768	2014-02-07 00:00:00	0.000000	3.300000	nan	4.200000	sun
379	2013-01-14 00:00:00	0.000000	3.300000	nan	1.300000	sun
1095	2014-12-31 00:00:00	0.000000	3.300000	nan	3.000000	sun
1094	2014-12-30 00:00:00	0.000000	3.300000	nan	3.600000	sun
352	2012-12-18 00:00:00	3.300000	3.900000	nan	5.300000	snow
349	2012-12-15 00:00:00	5.300000	4.400000	nan	5.100000	snow
703	2013-12-04 00:00:00	0.000000	4.400000	nan	1.600000	sun
13	2012-01-14 00:00:00	4.100000	4.400000	nan	5.300000	snow
5	2012-01-06 00:00:00	2.500000	4.400000	2.200000	2.200000	rain
1455	2015-12-26 00:00:00	0.000000	4.400000	nan	2.500000	sun
1065	2014-12-01 00:00:00	0.000000	4.400000	nan	2.200000	sun
12	2012-01-13 00:00:00	0.000000	5.000000	nan	1.300000	sun
56	2012-02-26 00:00:00	1.300000	5.000000	nan	3.400000	snow
702	2013-12-03 00:00:00	0.000000	5.000000	nan	5.600000	sun
710	2013-12-11 00:00:00	0.000000	5.000000	nan	0.800000	sun
59	2012-02-29 00:00:00	0.800000	5.000000	1.100000	7.000000	snow
718	2013-12-19 00:00:00	0.000000	5.000000	nan	2.100000	sun
77	2012-03-18 00:00:00	3.600000	5.000000	nan	2.700000	rain
366	2013-01-01 00:00:00	0.000000	5.000000	nan	2.700000	sun
363	2012-12-29 00:00:00	1.500000	5.000000	3.300000	1.700000	rain
764	2014-02-03 00:00:00	0.000000	5.000000	nan	4.300000	sun
747	2014-01-17 00:00:00	0.000000	5.600000	2.800000	2.300000	sun
72	2012-03-13 00:00:00	9.400000	5.600000	nan	5.300000	snow
359	2012-12-25 00:00:00	13.500000	5.600000	2.800000	4.200000	snow
358	2012-12-24 00:00:00	0.300000	5.600000	2.800000	2.800000	rain
1459	2015-12-30 00:00:00	0.000000	5.600000	nan	3.400000	sun
1096	2015-01-01 00:00:00	0.000000	5.600000	nan	1.200000	sun
711	2013-12-12 00:00:00	6.900000	5.600000	nan	2.300000	sun
709	2013-12-10 00:00:00	0.000000	5.600000	nan	1.500000	sun
392	2013-01-27 00:00:00	1.800000	5.600000	3.900000	4.500000	rain
1460	2015-12-31 00:00:00	0.000000	5.600000	nan	3.500000	sun
393	2013-01-28 00:00:00	7.900000	6.100000	3.300000	3.200000	rain
60	2012-03-01 00:00:00	0.000000	6.100000	1.100000	3.100000	sun
749	2014-01-19 00:00:00	0.000000	6.100000	3.300000	2.500000	sun
348	2012-12-14 00:00:00	7.900000	6.100000	1.100000	1.700000	rain
9	2012-01-10 00:00:00	1.000000	6.100000	nan	3.400000	rain
10	2012-01-11 00:00:00	0.000000	6.100000	nan	5.100000	sun
11	2012-01-12 00:00:00	0.000000	6.100000	nan	1.900000	sun

The missing values in our case were directly related to the column itself, low temp_min values meant that they were missing. Values could also be missing seemingly randomly, or depending on another column in the data frame. These three types of missing values are sometimes called:

Missing not at random - Depends on the value in the column itself
Missing at random - Depends on the value in another column
Missing completely at random - Doesn’t seem to depend on any other measured variable

The underlying reasons for missing values can be quite domain specific, e.g. a medical doctor might be missing blood pressure measurements for younger adults than for older adults since that information might be more valuable for older people. In this case we would say that the values are missing at random since there is no relation between the actual blood pressure reading and whether a value goes missing but there is a relation to the patients age which would be another variable in our dataset (it is more important you remember these principles than memorizing the exact names). There are more discussion of these concepts in this article.

4.1.2. R#

There is no styler function in dplyer like in pandas, so we will go straight to showing a summary and creating the heatmap:

%load_ext rpy2.ipython

Error importing in API mode: ImportError("dlopen(/Users/andytai/miniforge3/envs/531/lib/python3.11/site-packages/_rinterface_cffi_api.abi3.so, 0x0002): Library not loaded: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.dylib\n  Referenced from: <20FB70DB-7E84-3375-A520-E0350E06C060> /Users/andytai/miniforge3/envs/531/lib/python3.11/site-packages/_rinterface_cffi_api.abi3.so\n  Reason: tried: '/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.dylib' (no such file), '/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.dylib' (no such file)")

Trying to import in ABI mode.

%%R -i df
library(tidyverse)

df |> glimpse()

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Rows: 996

Columns: 6

$ date          <dttm> 2012-01-02, 2012-01-03, 2012-01-04, 2012-01-05, 2012-01…
$ precipitation <dbl> 10.9, 0.8, 20.3, 1.3, 2.5, 0.0, 0.0, 4.3, 1.0, 0.0, 0.0,…
$ temp_max      <dbl> 10.6, 11.7, 12.2, 8.9, 4.4, 7.2, 10.0, 9.4, 6.1, 6.1, 6.…
$ temp_min      <dbl> 2.8, 7.2, 5.6, 2.8, 2.2, 2.8, 2.8, 5.0, NaN, NaN, NaN, N…
$ wind          <dbl> 4.5, 2.3, 4.7, 6.1, 2.2, 2.3, 2.0, 3.4, 3.4, 5.1, 1.9, 1…
$ weather       <chr> "rain", "rain", "rain", "rain", "rain", "rain", "sun", "…

There is no overview of NAs in glimpse, instead we can see the NAs in summary:

%%R
df |> summary()

      date

 precipitation

    temp_max

 Min.   :2012-01-02 00:00:00.00

 Min.   : 0.000

 Min.   : 1.10

 1st Qu.:2012-10-03 18:00:00.00

 1st Qu.: 0.000

 1st Qu.:10.60

 Median :2013-08-05 12:00:00.00

 Median : 0.000

 Median :17.20

 Mean   :2013-09-27 05:08:33.25

 Mean   : 1.776

 Mean   :17.37

 3rd Qu.:2014-08-19 06:00:00.00

 3rd Qu.: 0.800

 3rd Qu.:23.60

 Max.   :2015-12-31 00:00:00.00

 Max.   :54.100

 Max.   :35.60

 NA's   :5

    temp_min

      wind

   weather

 Min.   : 1.100

 Min.   :0.4

 Length:996

 1st Qu.: 5.600

 1st Qu.:2.3

 Class :character

 Median : 9.400

 Median :3.0

 Mode  :character

 Mean   : 9.294

 Mean   :3.2

 3rd Qu.:13.300

 3rd Qu.:3.9

 Max.   :18.300

 Max.   :9.5

 NA's   :86

We use the same strategy to visualize NaNs in ggplot as in Altair where we first pivot_longer the dataframe:

%%R -w 1200
theme_set(theme_gray(base_size = 20))

df |>
    map_df(is.na) |>
    mutate(index = row_number()) |>
    pivot_longer(-index) |>
    ggplot(aes(x = index, y = name, fill = value)) +
        geom_raster() +
        labs(y = '', fill = 'NaN')

../_images/17f32ba71a007e4722a3851ba48af0fe1b956bc8e8a0db7e7282701ccbd12afc.png

Sorted by max temp:

%%R -w 1200
theme_set(theme_gray(base_size = 20))

df |>
    arrange(temp_max) |>
    map_df(is.na) |>
    mutate(index = row_number()) |>
    pivot_longer(-index) |>
    ggplot(aes(x = index, y = name, fill = value)) +
        geom_raster() +
        labs(y = '', fill = 'NaN')

../_images/be23a6e6e2cac26faafa3b8bdf994a49e6f8b9b15cc79c9edafa49e316284e15.png

4.2. Repeating the same plots for multiple dataframe columns#

4.2.1. Py#

Previously we have made subplots via faceting, which creates one subplot per unique value in a categorical column and displays the same numerical columns in all the subplots/facets. This view of multiple subsets in the data is often called a trellis plot or plot of small multiples. Here, we will see how we can create subplots that each display all the data points, but is repeated for different columns in the data.

We specify which columns we want to use via the .repeat method, and where we want to use them via alt.repeat. For this plot, we keep the y-axis constant, and repeat the plot for different x-axis columns.

url = 'https://raw.githubusercontent.com/joelostblom/teaching-datasets/main/movies.json'
movies = pd.read_json(url)[['runtime', 'budget', 'revenue', 'genres', 'countries']]
movies

	runtime	budget	revenue	genres	countries
0	100	94000000	940335536	[Animation]	[United States of America]
1	143	140000000	655011224	[Fantasy]	[United States of America]
2	87	75000000	527068851	[Animation]	[United States of America]
3	151	200000000	1065659812	[Fantasy]	[United States of America]
4	106	70000000	101371017	[Fantasy]	[United States of America]
...	...	...	...	...	...
675	107	100000000	519876949	[History]	[United Kingdom, United States of America]
676	86	50000000	66913939	[Animation]	[United States of America]
677	100	22000000	194647323	[Fantasy]	[United States of America]
678	72	3500000	3775000	[Animation]	[United States of America]
679	109	10000000	4073489	[History]	[United Kingdom, United States of America]

680 rows × 5 columns

# You could extract the columns like this, but we will write them out for clarity here
# numeric_cols = movies.select_dtypes('number').columns.tolist()
alt.Chart(movies).mark_bar().encode(
     alt.X(alt.repeat()).type('quantitative').bin(maxbins=40),
     y='count()',
).properties(
    width=300,
    height=200
).repeat(
    ['runtime', 'revenue', 'budget']
)

4.2.2. R#

There is not “repeat” function in ggplot, so we would first need to pivot all data frame variables into the same column and then use faceting on this new column name.

%%R

library(rjson)
library(tidyverse)
theme_set(theme_gray(base_size = 16))


url <- 'https://raw.githubusercontent.com/joelostblom/teaching-datasets/main/movies.json'
movies <- fromJSON(file = url) %>%
    as_tibble() %>%
    select(runtime, revenue, budget, genres, countries) %>%
    unnest(-c(countries, genres))
glimpse(movies)

Rows: 680

Columns: 5

$ runtime   <dbl> 100, 143, 87, 151, 106, 119, 115, 178, 179, 201, 132, 101, 1…
$ revenue   <dbl> 940335536, 655011224, 527068851, 1065659812, 101371017, 7627…
$ budget    <dbl> 9.40e+07, 1.40e+08, 7.50e+07, 2.00e+08, 7.00e+07, 2.50e+07, …
$ genres    <named list> "Animation", "Fantasy", "Animation", "Fantasy", "Fant…
$ countries <named list> "United States of America", "United States of America…

%%R

movies_long <- movies |>
    select_if(is.numeric) |>
    pivot_longer(everything())  # To get only some columns we can write c(revenue, budget)

movies_long

# A tibble: 2,040 × 2

   name        value

   <chr>       <dbl>

 1 runtime       100

 2 revenue 940335536

 3 budget   94000000

 4 runtime       143

 5 revenue 655011224

 6 budget  140000000

 7 runtime        87

 8 revenue 527068851

 9 budget   75000000

10 runtime       151

# ℹ 2,030 more rows

# ℹ Use `print(n = ...)` to see more rows

%%R -w 800 -h 250

movies_long |>
    ggplot(aes(x = value)) + 
        geom_density(fill = 'grey') + 
        facet_wrap(~name, scales = 'free')

../_images/f5f4bd76d85d27d6801d515f56e44c33068dedab1591ce5f34bb2949c1bdcaa9.png

4.2.3. Repeating on two axes#

Py#

To efficiently repeat over both the axis, we can use the column and row parameter to repeat. These work similarly as they do for facet, distributing the list of data frame variables over only one axis, which we specify with alt.repeat to the x and y axis.

alt.Chart(movies).mark_point(opacity=0.3, size=10).encode(
     alt.X(alt.repeat()).type('quantitative'),
     y='revenue',
).properties(
    width=200,
    height=200
).repeat(
    ['runtime', 'revenue', 'budget']
)

When we don’t give any named parameters to repeat and alt.repeat, they use the default, which is repeat and then you can wrap with columns=2 if you want more than one row. We could use this technique to create a what is usually called a scatter plot matrix (or pairplot), where we repeat numerical columns on both axes to investigate their pairwise relationships.

alt.Chart(movies).mark_point(opacity=0.3, size=10).encode(
     alt.X(alt.repeat('row')).type('quantitative'),
     alt.Y(alt.repeat('column')).type('quantitative')
).properties(
    width=200,
    height=200
).repeat(
    column=['runtime', 'revenue', 'budget'],
    row=['runtime', 'revenue', 'budget']
)

alt.Chart(movies).mark_point(opacity=0.3, size=10).encode(
     alt.X(alt.repeat('column')).type('quantitative'),
     alt.Y(alt.repeat('row')).type('quantitative')
).properties(
    width=200,
    height=200
).repeat(
    row=['runtime', 'revenue', 'budget'],
    column=['runtime', 'revenue', 'budget']
)

R#

This would not work if both axes are numerical, but luckiy there is an extension packagge for ggplot called GGally, which can be used here. It also plots the density curves on the diagonal and the correlations on top.

%%R -w 800 -h 550
library(GGally)
GGally::ggpairs(movies %>% select_if(is.numeric), progress = FALSE)

../_images/9cedc6f076267801787dc94816d5b688133f101fa44d589cc6118241500270b1.png

This function can also show info on categorical variables.

%%R -w 800 -h 550

free_countries <- movies %>% unnest(countries) %>% select(!genres)
GGally::ggpairs(free_countries, aes(color = countries), progress=FALSE)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

../_images/665493197d3355865bb913d2ff4a33c6ae34a985fc0fc8a1f6cb9177eec01bf5.png

Plotting multiple categories is also possible and will include counts of the combinations of categories as in the next plot. Customizing these plots is a bit cumbersome, this blog post has some good examples, and you can see one below too.

%%R -w 800 -h 750

free_both <- movies %>% unnest(genres) %>% unnest(countries)
GGally::ggpairs(
    free_both,
    aes(color = countries),
    progress = FALSE,
    lower = list(continuous = GGally::wrap('points', alpha = 0.3, size=0.1, color = 'steelblue')),
    diag = list(continuous = GGally::wrap('densityDiag', fill = 'coral', alpha = 0.4))) 

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

../_images/cb9f2ae26fe68017f5e9f30f7fc95c1027b4b0544289d226bb627381a82adbce.png

4.3. Correlation plots#

4.3.1. Py#

A good complement to a scatter plot matrix is a correlation plot which helps formalize the correlation between numerical variables. Using “spearman” correlation instead of “pearson” allows us to detect non-linear relationships better.

This can be done in pandas via the styler (requires matplotlib).

movies.corr(numeric_only=True).style.background_gradient()

	runtime	budget	revenue
runtime	1.000000	0.165387	0.174192
budget	0.165387	1.000000	0.694802
revenue	0.174192	0.694802	1.000000

We could do the same in Altair but it would require some more work

corr_df = (
    movies
    .corr('spearman', numeric_only=True)
    .abs()                      # Use abs for negative correlation to stand out
    .stack()                    # Get df into long format for altair
    .reset_index(name='corr'))  # Name the index that is reset to avoid name collision
corr_df

	level_0	level_1	corr
0	runtime	runtime	1.000000
1	runtime	budget	0.147367
2	runtime	revenue	0.106208
3	budget	runtime	0.147367
4	budget	budget	1.000000
5	budget	revenue	0.758968
6	revenue	runtime	0.106208
7	revenue	budget	0.758968
8	revenue	revenue	1.000000

alt.Chart(corr_df).mark_circle().encode(
    x='level_0',
    y='level_1',
    size='corr',
    color='corr'
)

With some effort, we can make it look nicer and remove the correlation on the diagonal since that is just with the same column.

corr_df = movies.select_dtypes('number').corr('spearman', numeric_only=True).stack().reset_index(name='corr')
corr_df.loc[corr_df['corr'] == 1, 'corr'] = 0  # Remove diagonal
corr_df['abs'] = corr_df['corr'].abs()
corr_df

	level_0	level_1	corr	abs
0	runtime	runtime	0.000000	0.000000
1	runtime	budget	0.147367	0.147367
2	runtime	revenue	0.106208	0.106208
3	budget	runtime	0.147367	0.147367
4	budget	budget	0.000000	0.000000
5	budget	revenue	0.758968	0.758968
6	revenue	runtime	0.106208	0.106208
7	revenue	budget	0.758968	0.758968
8	revenue	revenue	0.000000	0.000000

alt.Chart(corr_df).mark_circle().encode(
    x='level_0',
    y='level_1',
    size=alt.Size('abs').scale(domain=(0, 1)),
    color=alt.Color('corr').scale(scheme='blueorange', domain=(-1, 1))
)

The domain argument sets the extent of the colorscale (from a perfect negative to a perfect positive correlation), we will talk more about colorscale later in the course.

4.3.2. R#

We could do the separate calculation in ggplot as well, but there is also a special function in GGally, which does the computation and visualizes it as a heatmap.

%%R -w 600 -h 350
GGally::ggcorr(movies)

In addition: Warning message:
data in columns genres and countries are not numeric and were ignored 

../_images/59f68b18a8557dea0aee1f3da76d540332f4e584e0997c78f3b4655245712d6b.png

4.4. Counting combinations of categorical groups#

4.4.1. Py#

Counting categoricals is helpful to get and overview of where most observations lie. Often the size and color can be used for this, similar to the correlation plot.

boom_both = movies.explode('countries').explode('genres')
alt.Chart(boom_both).mark_square().encode(
    x='genres',
    y='countries',
    color='count()',
    size='count()'
)

As with bar charts of counts, it is often helpful to sort these categorical plot according to which combinations have the most counts. Here we can sort based on the color or size encoding, just as we have sorted based on x and y before (the effect of this sorting is more notable in larger charts).

boom_both = movies.explode('countries').explode('genres')
alt.Chart(boom_both).mark_square().encode(
    x=alt.X('genres').sort('-color'),
    y=alt.Y('countries').sort('color'),
    color='count()',
    size='count()'
)

A heatmap could be created as well, but it can be a little harder to see the difference here sine we are not using size for the count, just color variations.

alt.Chart(boom_both).mark_rect().encode(
    x='genres',
    y='countries',
    color='count()'
)

For more precise counts, we could use a faceted bar chart.

alt.Chart(boom_both).mark_bar().encode(
    x='count()',
    y='genres',
    color='genres'
).facet(
    'countries',
    columns=1
)

4.4.2. R#

GGally has a similar plot for counting categories.

%%R -w 600 -h 350

free_both <- movies %>% unnest(genres) %>% unnest(countries)
GGally::ggally_cross(free_both, aes(x = genres, y = countries))

../_images/7faf1a3dd327249112562b39dd2065409f6ac22dd7c7dd5a73821cea604ec615.png

We can create an heatmap using ggplot’s geom_tile together with add_count (remember that this function adds the count in a column names n).

%%R -w 600 -h 350

free_both %>%
    add_count(genres, countries) %>%
    ggplot() +
        aes(x = genres,
            y = countries,
            fill = n) +
        geom_tile()

../_images/0e1009a4133eaf0524828ecf60494a5dd597b229c503a26345fd3708217afebb.png

A barplot of counts could be made in ggplot without faceting by using 'dodge' for he position. The color parameter is used to create visual separation between the bars with a white outline.

%%R -w 600 -h 350

ggplot(free_both) +
    aes(y = countries,
        fill = genres) +
    geom_bar(stat = 'count', position = 'dodge', color = 'white')

../_images/e13a0e678edfd432e70e9e645bff694f6142c1327cb6f219d3f6955f38d97ebd.png

4.5. Optional material#

In addition to the ggally package, there are a few other packages in both R and Python that facilitate quick EDA and missing value visualization. As long as you understand the underlying principles you are free to use any of these:

import altair_ally as aly

aly.nan(df.sort_values('temp_max', ignore_index=True))

aly.pair(movies)

Python

R

Some of them go to great length to facilitate the creation of quick EDA visualizations: