Introduction to Machine Learning – Tabular Data and Terminology

Terminology

Here is some basic terminology used in ML:

examples = rows
features = inputs
targets = outputs
training = learning = fitting

{fig-alt:“Supervised machine learning terminology” fig-align=“center” width=“90%”}

In the supervised machine learning paradigm, we have input data and an output. We feed our input to a machine learning algorithm.

The question is how do we effectively represent this input?

Is there a specific required format for our data so that we can pass it to machine learning algorithms.

YES! In supervised machine learning, we typically work with tabular data.

Here is a toy example of tabular data.

The task here is to predict the quiz2 grade given all this information.

Rows are examples
Columns are features and one of the columns is typically the target.
Features are relevant characteristics of the problem (usually suggested by experts).
To a machine, column names (features) have no meaning. Only feature values and how they vary across examples mean something.
Training a model can also be called learning or fitting a model.

All of these will be used in the course so it’s important to get familiar with the vocabulary now.

You will see a lot of variable terminology in machine learning and statistics and sometimes they can be confusing. See the MDS terminology resource here to clear up any confusions.

Terminology

Example 1: Tabular data for the housing price prediction problem

df = pd.read_csv("data/kc_house_data.csv")
df.head(3)

	price	bedrooms	bathrooms	sqft_living	...	lat	long	sqft_living15	sqft_lot15
0	425000.0	3	1.00	1180	...	47.7121	-122.333	2240	7875
1	230500.0	2	1.00	740	...	47.4793	-122.336	850	8775
2	372000.0	4	2.75	2610	...	47.3747	-122.124	2390	7852

3 rows × 19 columns

df.shape

(1000, 19)

Example 2: Tabular data for quiz2 classification problem

classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df.head(3)

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1	quiz2
0	1	1	92	93	84	91	92	A+
1	1	0	94	90	80	83	91	not A+
2	0	0	78	85	83	80	80	not A+

classification_df.shape

(21, 8)

X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]
X.head()

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1
0	1	1	92	93	84	91	92
1	1	0	94	90	80	83	91
2	0	0	78	85	83	80	80
3	0	1	91	94	92	91	89
4	0	1	77	83	90	92	85

y.head()

0        A+
1    not A+
2    not A+
3        A+
4        A+
Name: quiz2, dtype: object

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1
0	1	1	92	93	84	91	92
1	1	0	94	90	80	83	91
2	0	0	78	85	83	80	80
3	0	1	91	94	92	91	89
4	0	1	77	83	90	92	85

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1
0	1	1	92	93	84	91	92
1	1	0	94	90	80	83	91
2	0	0	78	85	83	80	80
3	0	1	91	94	92	91	89
4	0	1	77	83	90	92	85

Tabular Data and Terminology

Terminology

Terminology

Example 1: Tabular data for the housing price prediction problem

Example 2: Tabular data for quiz2 classification problem

Let’s apply what we learned!

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1
0	1	1	92	93	84	91	92
1	1	0	94	90	80	83	91
2	0	0	78	85	83	80	80
3	0	1	91	94	92	91	89
4	0	1	77	83	90	92	85