Tabular Data and Terminology

Terminology

Here is some basic terminology used in ML:

  • examples = rows
  • features = inputs
  • targets = outputs
  • training = learning = fitting

{fig-alt:“Supervised machine learning terminology” fig-align=“center” width=“90%”}

Terminology

Example 1: Tabular data for the housing price prediction problem

df = pd.read_csv("data/kc_house_data.csv")
df.head(3)
price bedrooms bathrooms sqft_living ... lat long sqft_living15 sqft_lot15
0 425000.0 3 1.00 1180 ... 47.7121 -122.333 2240 7875
1 230500.0 2 1.00 740 ... 47.4793 -122.336 850 8775
2 372000.0 4 2.75 2610 ... 47.3747 -122.124 2390 7852

3 rows × 19 columns

df.shape
(1000, 19)

Example 2: Tabular data for quiz2 classification problem

classification_df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
classification_df.head(3)
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1 quiz2
0 1 1 92 93 84 91 92 A+
1 1 0 94 90 80 83 91 not A+
2 0 0 78 85 83 80 80 not A+
classification_df.shape
(21, 8)
X = classification_df.drop(columns=["quiz2"])
y = classification_df["quiz2"]
X.head()
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1
0 1 1 92 93 84 91 92
1 1 0 94 90 80 83 91
2 0 0 78 85 83 80 80
3 0 1 91 94 92 91 89
4 0 1 77 83 90 92 85
y.head()
0        A+
1    not A+
2    not A+
3        A+
4        A+
Name: quiz2, dtype: object

Let’s apply what we learned!