Data Visualization – How Can We Visualize Data?

There are two types of visualization approaches

When learning about data visualization, it is helpful to distinguish between the following two approaches to visualization:

Imperative
Declarative

Imperative (low level) plotting focuses on plot mechanics

Focus on plot construction details.
- Often includes loops, low-level drawing commands, etc.
Specify how something should be done
- “Draw a red point for every observation that has value X in column A, a blue point for every observation that has value Y in column A, etc.”
Minute control over plotting details, but laborious for complex visualization.

The data we will be plotting

Country	Area	Population
Russia	17098246	144386830
Canada	9984670	38008005
China	9596961	1400050000

Example of imperative plotting

# Pseudocode
colors = ['blue', 'red', 'yellow']
plot = create_plot()
for row_number, row_data in enumerate(dataframe):
    plot.add_point(x=row_data['Area'], y=row_data['Population'], color=colors[row_number])

For this example, we will use Python-inspired pseudocode, which is code that is made up and designed so that it is less complex and easier to read than real programming languages.

This helps us focus on understanding the concepts of plotting instead of getting hung up on the code syntax details of a particular package.

You can see that an imperative approach to plotting this data would be to first create the plot and then loop through the dataframe to add a point for each country one by one.

To colour the points, we need to manually create a sequence of colours that we can access inside the loop.

The visualization on this page is an example of what a plot could look like when run with real code similar to our pseudocode.

You can see that one of the countries is bigger than the others, and one of the countries has a much larger population, but without seeing the code, it is not possible to know which colour represents which country.

We could add a legend by creating it explicitly and adding one coloured dot per iteration in the loop.

Declarative (high level) plotting focuses on the data

Focus on data and relationships.
- Often includes linking columns to visual channels.
Specify what should be done
- “Assign colors based on the values in column A”
Smart defaults give us what we want without complete control over minor plotting details.

Example of declarative plotting

# Pseudocode
point_plot(data=dataframe, x='Area', y='Population', color='Country')

A high-level grammar of graphics helps us compose plots effectively

Simple grammatical components combine to create visualizations.
Visualization grammars often consist of three main components:
1. Create a chart linked to a dataframe.
2. Add graphical elements (such as points, lines, etc).
3. Encode dataframe columns as visual channels (such as x, etc).

# Pseudocode
chart(dataframe).add_points().encode_columns(x='Area', y='Population', color='Country')

The declarative plotting concept can be implemented in different ways.

In the previous slide, we had a dedicated function for creating the pointplot, and there would be a separate function for creating a lineplot, barplot, etc.

With this approach, it is often not easy to combine plots together, unless there is a specific function for that purpose and the three bullets points on this slide are all executed by this single function.

Another way to use declarative plotting is via a visualization grammar.

Generally, a grammar governs how individual parts come together to compose more complex constructs.

For example, a linguistic grammar decides how words and phrases can be combined into coherent sentences. A data visualization grammar determines how to combine individual parts of the plotting syntax to create complete visualization.

In the example on this slide, you can see that the three bullet points are now broken down into one main function to create the chart linked to the data, and then we build upon this by adding the graphical elements (add_points()) and the encoding of the columns to properties of this chart (encode_columns()).

By combining these three grammatical components in different ways, we can build a wide range of visualizations, without memorizing a unique function for each plot type.

Thanks to this grammatical visualization approach, we also only require minimal changes to our code to change the type of plot.

The Python plotting landscape

Now that we know the basic concepts of how data can be visualized, let’s select a Python package and get coding!

In this image, you can see the most commonly used Python plotting packages.

There are many more, but these are the ones you are the most likely to hear about, so it is good to know that they exist.

The text to the left in the image is a legend to explain the colours used for the different Python packages (blue for high level, declarative packages and orange for low-level, imperative packages).

As you can see there are several high and low-level language, so how do we chose?

In this course we will use Altair, because it is a powerful declarative visualization tool with a clear and consistent grammar that also allows us to add interactive components to our plots, such as tooltips and selections.

We have also included some of the most common visualization packages for the web which are built-in Javascript and coloured in white.

The reason we mention these is that the Altair library is a little bit of Python code connected to an already existing powerful JavaScript package called VegaLite, which in turns builds on D3, the most dominant visualization package on the web today.

By leveraging these well-established JavaScript visualization packages Altair can create plots that work natively on the web and includes interactive features without reinventing the wheel.

Since Altair and VegaLite are relatively new visualization libraries, they don’t yet support every single plot type out there, but they more than make up for it with their ease of use and support for powerful interactive visualizations, as we will see later.

Sample data can be found in Altair’s companion package vega_datasets

from vega_datasets import data

cars = data.cars()
cars

	Name	Miles_per_Gallon	Cylinders	Displacement	...	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	...	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	...	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	...	3436	11.0	1970-01-01	USA
...	...	...	...	...	...	...	...	...	...
403	dodge rampage	32.0	4	135.0	...	2295	11.6	1982-01-01	USA
404	ford ranger	28.0	4	120.0	...	2625	18.6	1982-01-01	USA
405	chevy s-10	31.0	4	119.0	...	2720	19.4	1982-01-01	USA

406 rows × 9 columns

Before we start visualizing data, we need to select a dataset and often also a question we want to answer.

Altair works with dataframes in the “tidy” format (which we talked about in the Programming in Python for Data Science course), which means that they should consist of rows with one observation each and a set of named data columns with one feature each (you might also have heard these called fields or variables, but we will stick to columns for clarity).

In this course, we will often use data from the vega-datasets package, which has many plot-friendly practice datasets available as Pandas dataframes and can be loaded as demonstrated in this slide. We can use these datasets by importing the data module from the vega_datasets packages as in this slide. Here, our data contains the name of different cars and some attributes relating to each car. There are many interesting questions we could ask from this data set! For our first plot, let’s explore the relationship between how heavy a car is (the Weight_in_lbs column) and how good gas mileage it has (the Miles_per_gallon column).

Before starting to code the visualization, take a few seconds and think about what you would expect the relationship between these two columns to look like when you plot it.

Adding graphical elements via marks

import altair as alt

alt.Chart(cars).mark_point()

Here we assigned a shorter name (alt) to the Altair library when importing it to save us some typing later. The Altair syntax is similar to the grammar of graphics pseudocode we saw a few slides ago. The fundamental object in Altair is the Chart, which takes a data frame as a single argument, e.g. alt.Chart(cars).

After the chart object has been created, we can specify how the graphical element should look that we use to visualize the data. This is called a graphical mark in Altair, and in this slide, we have used mark_point() to show the data as points.

Since we have not specified which columns should be used for the x and y axes, we appear to only see one point in this plot since all the data is plotted on top of each other in the same location.

To the right of the chart, there is a button with three dots on it. don’t worry about it right now, we will explain what this is for at the end of the chapter.

Encoding columns as visual channels

Mapping a dataframe column to the x-scale

alt.Chart(cars).mark_point().encode(
    x='Weight_in_lbs')

To visually separate the points, we can encode columns in the dataframe as visual channels, such as the axes or colours of the plot.

Here, we encode the column Miles_per_Gallon as the x-axis. For Pandas data frames, Altair automatically determines an appropriate data type for the mapped column, which in this case is quantitative (or numerical) and shows the numbers under the axis.

You can see that there are several short black lines spread out evenly on the x-axis. These are called axis ticks and help us see where the values of this dataframe column lie along the axis.

The faint gray lines are called grid lines and extend the locations of the axis ticks so that it is easy to compare their position to the points.

This is particularly useful when the points might be further away from the axis ticks, such as in the next slide.

Mapping a dataframe column to the y-scale

alt.Chart(cars).mark_point().encode(
    x='Weight_in_lbs',
    y='Miles_per_Gallon')

By spreading out the data along both the x and y-axis, we can answer our initial question about the relationship between car weight and gas mileage. as it appears that the heavier cars are the ones that have the poorest mileage.

Although we might have expected this to be the case, visualizing all the data points also provides information on the nature of the relationship between weight and mileage.

It appears that the x-y points don’t simply follow a straight line, but rather a curved line that where the mileage drop quickly when moving away from the lightest cars, but then decreases more slowly throughout the remainder of the data.

This rich, easily interpretable display of information is one of the main advantages of visualizing data and later in the course, we will talk more about the different type of relationships, such as linear, exponential, etc.

Mapping a numerical dataframe column to the colour scale

alt.Chart(cars).mark_point().encode(
    x='Weight_in_lbs',
    y='Miles_per_Gallon',
    color='Horsepower')

Is there a relationship between horsepower and car weight, or fuel-efficiency?

To enrich this display of information further, we can colour the points according to a column in the dataframe. When we encode a column as the colour channel Altair will automatically figure out an appropriate colour scale to use, depending on whether the data is numerical, categorical, etc. Here we have indicated that we want to colour the points according to the car’s horsepower, which indicated how powerful its engine is.

We can see that the heavier cars have more powerful engines, than the lighter ones, but when using colour for a numerical comparison like this, makes it is harder to tell whether the relationship follows a straight line or is of another nature, so this encoding is mostly useful as an approximate indication of the horsepower.

We can also observe a relationship between the horsepower of a vehicle and the fuel efficiency. It appears that cars with greater horsepower (the points with a darker shade of blue) are less efficient with their fuel since miles per Gallon is much lower.

In the next module, we will learn more in detail about which encodings are most suitable for different comparisons.

Mapping a categorical dataframe column to the colour scale

alt.Chart(cars).mark_point().encode(
    x='Weight_in_lbs',
    y='Miles_per_Gallon',
    color='Origin')

Mapping a dataframe column to the shape scale

alt.Chart(cars).mark_point().encode(
    x='Weight_in_lbs',
    y='Miles_per_Gallon',
    color='Origin',
    shape='Origin')

Mapping a dataframe column to the size scale

alt.Chart(cars).mark_point().encode(
    x='Weight_in_lbs',
    y='Miles_per_Gallon',
    color='Origin',
    shape='Origin',
    size='Horsepower')

The action button can be used to save the plot

alt.Chart(cars).mark_point().encode(
    x='Weight_in_lbs',
    y='Miles_per_Gallon',
    color='Origin',
    shape='Origin',
    size='Horsepower')