{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Lecture 4: Class demo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports, Announcements, LOs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# import the libraries\n", "import os\n", "import sys\n", "sys.path.append(os.path.join(os.path.abspath(\"../\"), \"code\"))\n", "from plotting_functions import *\n", "from utils import *\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "%matplotlib inline\n", "\n", "pd.set_option(\"display.max_colwidth\", 200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you recall [the restaurants survey](https://ubc.ca1.qualtrics.com/jfe/form/SV_73VuZiuwM1eDVrw) you completed at the start of the course?\n", "\n", "Let's use that data for this demo. You'll find a [wrangled version](https://github.ubc.ca/MDS-2023-24/DSCI_571_sup-learn-1_students/blob/master/lectures/data/cleaned_restaurant_data.csv) in the course repository." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('../data/cleaned_restaurant_data.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
north_americaeat_out_freqagen_peoplepricefood_typenoise_levelgood_servercommentsrestaurant_nametarget
0Yes3.02910.0120.0ItalianmediumYesAmbienceNaNdislike
1Yes2.0233.020.0Canadian/Americanno musicNofood tastes badNaNdislike
2Yes2.02120.015.0ChinesemediumYesbad foodNaNdislike
3No2.02414.018.0OthermediumNoOverall vibe on the restaurantNaNdislike
4Yes5.02330.020.0ChinesemediumYesA bad dayNaNdislike
....................................
959No10.022NaNNaNNaNNaNNaNNaNNaNlike
960Yes1.020NaNNaNNaNNaNNaNNaNNaNlike
961No1.02240.050.0ChinesemediumYesThe self service sauce table is very clean and the sauces were always filled up.Haidilaolike
962Yes3.021NaNNaNNaNNaNNaNNaNNaNlike
963Yes3.02720.022.0OthermediumYesLots of meat that was very soft and tasty. Hearty and amazing broth. Good noodle thickness and consistencyUno Beef Noodlelike
\n", "

964 rows × 11 columns

\n", "
" ], "text/plain": [ " north_america eat_out_freq age n_people price food_type \\\n", "0 Yes 3.0 29 10.0 120.0 Italian \n", "1 Yes 2.0 23 3.0 20.0 Canadian/American \n", "2 Yes 2.0 21 20.0 15.0 Chinese \n", "3 No 2.0 24 14.0 18.0 Other \n", "4 Yes 5.0 23 30.0 20.0 Chinese \n", ".. ... ... ... ... ... ... \n", "959 No 10.0 22 NaN NaN NaN \n", "960 Yes 1.0 20 NaN NaN NaN \n", "961 No 1.0 22 40.0 50.0 Chinese \n", "962 Yes 3.0 21 NaN NaN NaN \n", "963 Yes 3.0 27 20.0 22.0 Other \n", "\n", " noise_level good_server \\\n", "0 medium Yes \n", "1 no music No \n", "2 medium Yes \n", "3 medium No \n", "4 medium Yes \n", ".. ... ... \n", "959 NaN NaN \n", "960 NaN NaN \n", "961 medium Yes \n", "962 NaN NaN \n", "963 medium Yes \n", "\n", " comments \\\n", "0 Ambience \n", "1 food tastes bad \n", "2 bad food \n", "3 Overall vibe on the restaurant \n", "4 A bad day \n", ".. ... \n", "959 NaN \n", "960 NaN \n", "961 The self service sauce table is very clean and the sauces were always filled up. \n", "962 NaN \n", "963 Lots of meat that was very soft and tasty. Hearty and amazing broth. Good noodle thickness and consistency \n", "\n", " restaurant_name target \n", "0 NaN dislike \n", "1 NaN dislike \n", "2 NaN dislike \n", "3 NaN dislike \n", "4 NaN dislike \n", ".. ... ... \n", "959 NaN like \n", "960 NaN like \n", "961 Haidilao like \n", "962 NaN like \n", "963 Uno Beef Noodle like \n", "\n", "[964 rows x 11 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
eat_out_freqagen_peopleprice
count964.000000964.0000006.960000e+02696.000000
mean2.58518723.9751041.439254e+041472.179152
std2.2464864.5567163.790481e+0537903.575636
min0.00000010.000000-2.000000e+000.000000
25%1.00000021.0000001.000000e+0118.000000
50%2.00000022.0000002.000000e+0125.000000
75%3.00000026.0000003.000000e+0140.000000
max15.00000046.0000001.000000e+071000000.000000
\n", "
" ], "text/plain": [ " eat_out_freq age n_people price\n", "count 964.000000 964.000000 6.960000e+02 696.000000\n", "mean 2.585187 23.975104 1.439254e+04 1472.179152\n", "std 2.246486 4.556716 3.790481e+05 37903.575636\n", "min 0.000000 10.000000 -2.000000e+00 0.000000\n", "25% 1.000000 21.000000 1.000000e+01 18.000000\n", "50% 2.000000 22.000000 2.000000e+01 25.000000\n", "75% 3.000000 26.000000 3.000000e+01 40.000000\n", "max 15.000000 46.000000 1.000000e+07 1000000.000000" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are there any unusual values in this data that you notice?\n", "Let's get rid of these outliers. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(942, 11)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "upperbound_price = 200\n", "lowerbound_people = 1\n", "df = df[~(df['price'] > 200)]\n", "restaurant_df = df[~(df['n_people'] < lowerbound_people)]\n", "restaurant_df.shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
eat_out_freqagen_peopleprice
count942.000000942.000000674.000000674.000000
mean2.59805723.99256924.97329434.023279
std2.2577874.58257022.01666029.018622
min0.00000010.0000001.0000000.000000
25%1.00000021.00000010.00000018.000000
50%2.00000022.00000020.00000025.000000
75%3.00000026.00000030.00000040.000000
max15.00000046.000000200.000000200.000000
\n", "
" ], "text/plain": [ " eat_out_freq age n_people price\n", "count 942.000000 942.000000 674.000000 674.000000\n", "mean 2.598057 23.992569 24.973294 34.023279\n", "std 2.257787 4.582570 22.016660 29.018622\n", "min 0.000000 10.000000 1.000000 0.000000\n", "25% 1.000000 21.000000 10.000000 18.000000\n", "50% 2.000000 22.000000 20.000000 25.000000\n", "75% 3.000000 26.000000 30.000000 40.000000\n", "max 15.000000 46.000000 200.000000 200.000000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "restaurant_df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data splitting \n", "\n", "We aim to predict whether a restaurant is liked or disliked." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Separate `X` and `y`. \n", "\n", "X = restaurant_df.drop(columns=['target'])\n", "y = restaurant_df['target']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below I'm perturbing this data just to demonstrate a few concepts. Don't do it in real life. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "X.at[459, 'food_type'] = 'Quebecois'\n", "X['price'] = X['price'] * 100" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Split the data\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "X_train.hist(bins=20, figsize=(12, 8));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you see anything interesting in these plots? " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "food_type\n", "Other 189\n", "Canadian/American 131\n", "Chinese 102\n", "Indian 36\n", "Italian 32\n", "Thai 20\n", "Fusion 18\n", "Mexican 17\n", "fusion 3\n", "Quebecois 1\n", "Name: count, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train['food_type'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Error in data collection? Probably \"Fusion\" and \"fusion\" categories should be combined?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "X_train['food_type'] = X_train['food_type'].replace(\"fusion\", \"Fusion\")\n", "X_test['food_type'] = X_test['food_type'].replace(\"fusion\", \"Fusion\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "food_type\n", "Other 189\n", "Canadian/American 131\n", "Chinese 102\n", "Indian 36\n", "Italian 32\n", "Fusion 21\n", "Thai 20\n", "Mexican 17\n", "Quebecois 1\n", "Name: count, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train['food_type'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, usually we should spend lots of time in EDA, but let's stop here so that we have time to learn about transformers and pipelines. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dummy Classifier" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fit_timescore_timetest_scoretrain_score
00.0007780.0005740.5165560.514950
10.0006500.0004690.5165560.514950
20.0005930.0005350.5165560.514950
30.0006070.0003950.5133330.515755
40.0005120.0003680.5133330.515755
\n", "
" ], "text/plain": [ " fit_time score_time test_score train_score\n", "0 0.000778 0.000574 0.516556 0.514950\n", "1 0.000650 0.000469 0.516556 0.514950\n", "2 0.000593 0.000535 0.516556 0.514950\n", "3 0.000607 0.000395 0.513333 0.515755\n", "4 0.000512 0.000368 0.513333 0.515755" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.dummy import DummyClassifier\n", "\n", "dummy = DummyClassifier()\n", "scores = cross_validate(dummy, X_train, y_train, return_train_score=True)\n", "pd.DataFrame(scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a relatively balanced distribution of both 'like' and 'dislike' classes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's try KNN on this data\n", "\n", "Do you think KNN would work directly on `X_train` and `y_train`?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Preprocessing and pipeline\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "knn = KNeighborsClassifier()\n", "# knn.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to preprocess the data before passing it to ML models. What are the different types of features in the data? " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
north_americaeat_out_freqagen_peoplepricefood_typenoise_levelgood_servercommentsrestaurant_name
80No2.02130.02200.0ChinesehighNoThe environment was very not clean. The food tasted awful.NaN
934Yes4.02130.03000.0Canadian/AmericanlowYesThe building and the room gave a very comfy feeling. Immediately after sitting down it felt like we were right at home.NaN
911No4.02040.02500.0Canadian/AmericanmediumYesI was hungryChambar
459Yes5.021NaNNaNQuebecoisNaNNaNNaNNaN
62Yes2.02420.03000.0IndianhighYesbad tasteeast is east
\n", "
" ], "text/plain": [ " north_america eat_out_freq age n_people price food_type \\\n", "80 No 2.0 21 30.0 2200.0 Chinese \n", "934 Yes 4.0 21 30.0 3000.0 Canadian/American \n", "911 No 4.0 20 40.0 2500.0 Canadian/American \n", "459 Yes 5.0 21 NaN NaN Quebecois \n", "62 Yes 2.0 24 20.0 3000.0 Indian \n", "\n", " noise_level good_server \\\n", "80 high No \n", "934 low Yes \n", "911 medium Yes \n", "459 NaN NaN \n", "62 high Yes \n", "\n", " comments \\\n", "80 The environment was very not clean. The food tasted awful. \n", "934 The building and the room gave a very comfy feeling. Immediately after sitting down it felt like we were right at home. \n", "911 I was hungry \n", "459 NaN \n", "62 bad taste \n", "\n", " restaurant_name \n", "80 NaN \n", "934 NaN \n", "911 Chambar \n", "459 NaN \n", "62 east is east " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What all transformations we need to apply before training a machine learning model? \n", "- Can we group features based on what type of transformations we would like to apply?" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['north_america', 'eat_out_freq', 'age', 'n_people', 'price',\n", " 'food_type', 'noise_level', 'good_server', 'comments',\n", " 'restaurant_name'],\n", " dtype='object')" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.columns" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "good_server\n", "Yes 396\n", "No 148\n", "Name: count, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train['good_server'].value_counts()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "noise_level\n", "medium 232\n", "low 186\n", "high 75\n", "no music 37\n", "crazy loud 18\n", "Name: count, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train['noise_level'].value_counts()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "numeric_feats = ['age', 'n_people', 'price'] # Continuous and quantitative features\n", "categorical_feats = ['north_america', 'food_type'] # Discrete and qualitative features\n", "binary_feats = ['good_server'] # Categorical features with only two possible values \n", "ordinal_feats = ['noise_level'] # Some natural ordering in the categories \n", "noise_cats = ['no music', 'low', 'medium', 'high', 'crazy loud']\n", "drop_feats = ['comments', 'restaurant_name', 'eat_out_freq'] # Dropping text feats and `eat_out_freq` because it's not that useful" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin with numeric features. What if we just use numeric features to train a KNN model? Would it work? " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "X_train_num = X_train[numeric_feats]\n", "X_test_num = X_test[numeric_feats]\n", "# knn.fit(X_train_num, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to deal with NaN values. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### sklearn's `SimpleImputer` " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Impute numeric features using SimpleImputer\n", "from sklearn.impute import SimpleImputer\n", "\n", "imputer = SimpleImputer(strategy='median')\n", "imputer.fit(X_train_num)\n", "X_train_num_imp = imputer.transform(X_train_num)\n", "X_test_num_imp = imputer.transform(X_test_num)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "KNeighborsClassifier()" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn.fit(X_train_num_imp, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No more errors. It worked! Let's try cross validation. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6706507304116865" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn.score(X_train_num_imp, y_train)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.49206349206349204" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn.score(X_test_num_imp, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have slightly improved results in comparison to the dummy model. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discussion questions \n", "\n", "- What's the difference between sklearn estimators and transformers? \n", "- Can you think of a better way to impute missing values? \n", "\n", "



" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do we need to scale the data? " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agen_peopleprice
802130.02200.0
9342130.03000.0
9112040.02500.0
45921NaNNaN
622420.03000.0
............
1062710.01500.0
3332412.0800.0
393205.01500.0
37620NaNNaN
5252050.03000.0
\n", "

753 rows × 3 columns

\n", "
" ], "text/plain": [ " age n_people price\n", "80 21 30.0 2200.0\n", "934 21 30.0 3000.0\n", "911 20 40.0 2500.0\n", "459 21 NaN NaN\n", "62 24 20.0 3000.0\n", ".. ... ... ...\n", "106 27 10.0 1500.0\n", "333 24 12.0 800.0\n", "393 20 5.0 1500.0\n", "376 20 NaN NaN\n", "525 20 50.0 3000.0\n", "\n", "[753 rows x 3 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train[numeric_feats]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Scale the imputed data \n", "\n", "from sklearn.preprocessing import StandardScaler\n", "scaler = StandardScaler()\n", "scaler.fit(X_train_num_imp)\n", "X_train_num_imp_scaled = scaler.transform(X_train_num_imp)\n", "X_test_num_imp_scaled = scaler.transform(X_test_num_imp)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### What are some alternative methods for scaling?\n", "- [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html): Transform each feature to a desired range\n", "- [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html): Scale features using median and quantiles. Robust to outliers. \n", "- [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html): Works on rows rather than columns. Normalize examples individually to unit norm.\n", "- [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html): A scaler that scales each feature by its maximum absolute value.\n", " - What would happen when you apply `StandardScaler` to sparse data? \n", "- You can also apply custom scaling on columns using [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html). For example, when a column follows the power law distribution (a handful of your values have many data points whereas most other values have few data points) log scaling is helpful. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- For now, let's focus on `StandardScaler`. Let's carry out cross-validation" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.55629139, 0.49006623, 0.56953642, 0.54 , 0.53333333])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_score(knn, X_train_num_imp_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, we don't see a big difference with `StandardScaler`. But usually, scaling is a good idea. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- This worked but are we doing anything wrong here? \n", "- What's the problem with calling `cross_val_score` with preprocessed data? \n", "- How would you do it properly?\n", "



" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_improper_processing(\"kNN\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enter sklearn pipelines to do it properly. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# Create a pipeline \n", "pipe_knn = make_pipeline(\n", " SimpleImputer(strategy=\"median\"),\n", " StandardScaler(), \n", " KNeighborsClassifier()\n", ") " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5245916114790287" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_score(pipe_knn, X_train_num, y_train).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What all things are happening under the hood? \n", "- Why is this a better approach? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", "[Source](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#18)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_proper_processing(\"kNN\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "



" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Categorical features\n", "\n", "Let's assess the scores using categorical features." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "food_type\n", "Other 189\n", "Canadian/American 131\n", "Chinese 102\n", "Indian 36\n", "Italian 32\n", "Fusion 21\n", "Thai 20\n", "Mexican 17\n", "Quebecois 1\n", "Name: count, dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train['food_type'].value_counts()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
north_americafood_type
80NoChinese
934YesCanadian/American
911NoCanadian/American
459YesQuebecois
62YesIndian
.........
106NoChinese
333NoOther
393YesCanadian/American
376YesNaN
525Don't want to shareChinese
\n", "

753 rows × 2 columns

\n", "
" ], "text/plain": [ " north_america food_type\n", "80 No Chinese\n", "934 Yes Canadian/American\n", "911 No Canadian/American\n", "459 Yes Quebecois\n", "62 Yes Indian\n", ".. ... ...\n", "106 No Chinese\n", "333 No Other\n", "393 Yes Canadian/American\n", "376 Yes NaN\n", "525 Don't want to share Chinese\n", "\n", "[753 rows x 2 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train[categorical_feats]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "north_america\n", "Yes 415\n", "No 330\n", "Don't want to share 8\n", "Name: count, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train['north_america'].value_counts()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "food_type\n", "Other 189\n", "Canadian/American 131\n", "Chinese 102\n", "Indian 36\n", "Italian 32\n", "Fusion 21\n", "Thai 20\n", "Mexican 17\n", "Quebecois 1\n", "Name: count, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train['food_type'].value_counts()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "X_train_cat = X_train[categorical_feats]\n", "X_test_cat = X_test[categorical_feats]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "# One-hot encoding of categorical features \n", "from sklearn.preprocessing import OneHotEncoder\n", "# Define and fit OneHotEncoder\n", "ohe = OneHotEncoder(sparse_output=False)\n", "ohe.fit(X_train_cat)\n", "X_train_cat_ohe = ohe.transform(X_train_cat) # transform the train set\n", "X_test_cat_ohe = ohe.transform(X_test_cat) # transform the test set" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 1., 0., ..., 0., 0., 0.],\n", " [0., 0., 1., ..., 0., 0., 0.],\n", " [0., 1., 0., ..., 0., 0., 0.],\n", " ...,\n", " [0., 0., 1., ..., 0., 0., 0.],\n", " [0., 0., 1., ..., 0., 0., 1.],\n", " [1., 0., 0., ..., 0., 0., 0.]])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train_cat_ohe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- It's a sparse matrix. \n", "- Why? What would happen if we pass `sparse_output=False`? Why we might want to do that? " ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
north_america_Don't want to sharenorth_america_Nonorth_america_Yesfood_type_Canadian/Americanfood_type_Chinesefood_type_Fusionfood_type_Indianfood_type_Italianfood_type_Mexicanfood_type_Otherfood_type_Quebecoisfood_type_Thaifood_type_nan
00.01.00.00.01.00.00.00.00.00.00.00.00.0
10.00.01.01.00.00.00.00.00.00.00.00.00.0
20.01.00.01.00.00.00.00.00.00.00.00.00.0
30.00.01.00.00.00.00.00.00.00.01.00.00.0
40.00.01.00.00.00.01.00.00.00.00.00.00.0
..........................................
7480.01.00.00.01.00.00.00.00.00.00.00.00.0
7490.01.00.00.00.00.00.00.00.01.00.00.00.0
7500.00.01.01.00.00.00.00.00.00.00.00.00.0
7510.00.01.00.00.00.00.00.00.00.00.00.01.0
7521.00.00.00.01.00.00.00.00.00.00.00.00.0
\n", "

753 rows × 13 columns

\n", "
" ], "text/plain": [ " north_america_Don't want to share north_america_No north_america_Yes \\\n", "0 0.0 1.0 0.0 \n", "1 0.0 0.0 1.0 \n", "2 0.0 1.0 0.0 \n", "3 0.0 0.0 1.0 \n", "4 0.0 0.0 1.0 \n", ".. ... ... ... \n", "748 0.0 1.0 0.0 \n", "749 0.0 1.0 0.0 \n", "750 0.0 0.0 1.0 \n", "751 0.0 0.0 1.0 \n", "752 1.0 0.0 0.0 \n", "\n", " food_type_Canadian/American food_type_Chinese food_type_Fusion \\\n", "0 0.0 1.0 0.0 \n", "1 1.0 0.0 0.0 \n", "2 1.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", ".. ... ... ... \n", "748 0.0 1.0 0.0 \n", "749 0.0 0.0 0.0 \n", "750 1.0 0.0 0.0 \n", "751 0.0 0.0 0.0 \n", "752 0.0 1.0 0.0 \n", "\n", " food_type_Indian food_type_Italian food_type_Mexican food_type_Other \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 1.0 0.0 0.0 0.0 \n", ".. ... ... ... ... \n", "748 0.0 0.0 0.0 0.0 \n", "749 0.0 0.0 0.0 1.0 \n", "750 0.0 0.0 0.0 0.0 \n", "751 0.0 0.0 0.0 0.0 \n", "752 0.0 0.0 0.0 0.0 \n", "\n", " food_type_Quebecois food_type_Thai food_type_nan \n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 1.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", ".. ... ... ... \n", "748 0.0 0.0 0.0 \n", "749 0.0 0.0 0.0 \n", "750 0.0 0.0 0.0 \n", "751 0.0 0.0 1.0 \n", "752 0.0 0.0 0.0 \n", "\n", "[753 rows x 13 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the OHE feature names \n", "ohe_feats = ohe.get_feature_names_out().tolist()\n", "ohe_feats\n", "pd.DataFrame(X_train_cat_ohe, columns = ohe_feats)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.53642384, 0.53642384, 0.50993377, 0.51333333, 0.47333333])" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_score(knn, X_train_cat_ohe, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What's wrong here? \n", "- How can we fix this? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are we breaking the golden rule here? Let's do this properly with a pipeline. " ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# Code to create a pipeline for OHE and KNN\n", "pipe_ohe_knn = make_pipeline(\n", " OneHotEncoder(sparse_output=False, handle_unknown=\"ignore\"),\n", " KNeighborsClassifier()\n", ")" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.53642384, 0.53642384, 0.50993377, 0.51333333, 0.47333333])" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_score(pipe_ohe_knn, X_train_cat, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ordinal features\n", "\n", "Let's assess the scores using categorical features." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "noise_level\n", "medium 232\n", "low 186\n", "high 75\n", "no music 37\n", "crazy loud 18\n", "Name: count, dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train['noise_level'].value_counts()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OrdinalEncoder\n", "noise_ordering = ['no music', 'low', 'medium', 'high', 'crazy loud']\n", "\n", "ordinal_transformer = make_pipeline(SimpleImputer(strategy=\"most_frequent\"), \n", " OrdinalEncoder(categories=[noise_ordering]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "



" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right now we are working with numeric and categorical features separately. But ideally when we create a model, we need to use all these features together. \n", "\n", "**Enter column transformer!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How can we horizontally stack \n", "- preprocessed numeric features, \n", "- preprocessed binary features, \n", "- preprocessed ordinal features, and \n", "- preprocessed categorical features?\n", "\n", "Let's define a column transformer. " ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "from sklearn.compose import make_column_transformer\n", "\n", "numeric_transformer = make_pipeline(SimpleImputer(strategy=\"median\"),\n", " StandardScaler()) \n", "binary_transformer = make_pipeline(SimpleImputer(strategy=\"most_frequent\"), \n", " OneHotEncoder(drop=\"if_binary\"))\n", "ordinal_transformer = make_pipeline(SimpleImputer(strategy=\"most_frequent\"), \n", " OrdinalEncoder(categories=[noise_ordering]))\n", "categorical_transformer = make_pipeline(SimpleImputer(strategy=\"most_frequent\"), \n", " OneHotEncoder(sparse_output=False, handle_unknown=\"ignore\"))\n", "\n", "preprocessor = make_column_transformer(\n", " (numeric_transformer, numeric_feats), \n", " (binary_transformer, binary_feats), \n", " (ordinal_transformer, ordinal_feats),\n", " (categorical_transformer, categorical_feats),\n", " (\"drop\", drop_feats)\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does the transformed data look like? " ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['north_america', 'food_type']" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categorical_feats" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"north_america_Don't want to share\",\n", " 'north_america_No',\n", " 'north_america_Yes',\n", " 'food_type_Canadian/American',\n", " 'food_type_Chinese',\n", " 'food_type_Fusion',\n", " 'food_type_Indian',\n", " 'food_type_Italian',\n", " 'food_type_Mexican',\n", " 'food_type_Other',\n", " 'food_type_Quebecois',\n", " 'food_type_Thai']" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ohe_feat_names" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(753, 17)" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformed = preprocessor.fit_transform(X_train)\n", "transformed.shape" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ColumnTransformer(transformers=[('pipeline-1',\n",
       "                                 Pipeline(steps=[('simpleimputer',\n",
       "                                                  SimpleImputer(strategy='median')),\n",
       "                                                 ('standardscaler',\n",
       "                                                  StandardScaler())]),\n",
       "                                 ['age', 'n_people', 'price']),\n",
       "                                ('pipeline-2',\n",
       "                                 Pipeline(steps=[('simpleimputer',\n",
       "                                                  SimpleImputer(strategy='most_frequent')),\n",
       "                                                 ('onehotencoder',\n",
       "                                                  OneHotEncoder(drop='if_binary'))]),\n",
       "                                 ['good_server']),\n",
       "                                ('pipeline-3',...\n",
       "                                                  OrdinalEncoder(categories=[['no '\n",
       "                                                                              'music',\n",
       "                                                                              'low',\n",
       "                                                                              'medium',\n",
       "                                                                              'high',\n",
       "                                                                              'crazy '\n",
       "                                                                              'loud']]))]),\n",
       "                                 ['noise_level']),\n",
       "                                ('pipeline-4',\n",
       "                                 Pipeline(steps=[('simpleimputer',\n",
       "                                                  SimpleImputer(strategy='most_frequent')),\n",
       "                                                 ('onehotencoder',\n",
       "                                                  OneHotEncoder(handle_unknown='ignore',\n",
       "                                                                sparse_output=False))]),\n",
       "                                 ['north_america', 'food_type']),\n",
       "                                ('drop', 'drop',\n",
       "                                 ['comments', 'restaurant_name',\n",
       "                                  'eat_out_freq'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "ColumnTransformer(transformers=[('pipeline-1',\n", " Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(strategy='median')),\n", " ('standardscaler',\n", " StandardScaler())]),\n", " ['age', 'n_people', 'price']),\n", " ('pipeline-2',\n", " Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehotencoder',\n", " OneHotEncoder(drop='if_binary'))]),\n", " ['good_server']),\n", " ('pipeline-3',...\n", " OrdinalEncoder(categories=[['no '\n", " 'music',\n", " 'low',\n", " 'medium',\n", " 'high',\n", " 'crazy '\n", " 'loud']]))]),\n", " ['noise_level']),\n", " ('pipeline-4',\n", " Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehotencoder',\n", " OneHotEncoder(handle_unknown='ignore',\n", " sparse_output=False))]),\n", " ['north_america', 'food_type']),\n", " ('drop', 'drop',\n", " ['comments', 'restaurant_name',\n", " 'eat_out_freq'])])" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preprocessor" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"north_america_Don't want to share\",\n", " 'north_america_No',\n", " 'north_america_Yes',\n", " 'food_type_Canadian/American',\n", " 'food_type_Chinese',\n", " 'food_type_Fusion',\n", " 'food_type_Indian',\n", " 'food_type_Italian',\n", " 'food_type_Mexican',\n", " 'food_type_Other',\n", " 'food_type_Quebecois',\n", " 'food_type_Thai']" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Getting feature names from a column transformer\n", "ohe_feat_names = preprocessor.named_transformers_['pipeline-4']['onehotencoder'].get_feature_names_out(categorical_feats).tolist()\n", "ohe_feat_names" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['age', 'n_people', 'price']" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numeric_feats" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "feat_names = numeric_feats + binary_feats + ordinal_feats + ohe_feat_names" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-0.66941678, 0.31029469, -0.36840629, ..., 0. ,\n", " 0. , 0. ],\n", " [-0.66941678, 0.31029469, -0.05422496, ..., 0. ,\n", " 0. , 0. ],\n", " [-0.89515383, 0.82336432, -0.25058829, ..., 0. ,\n", " 0. , 0. ],\n", " ...,\n", " [-0.89515383, -0.97237936, -0.64331495, ..., 0. ,\n", " 0. , 0. ],\n", " [-0.89515383, -0.20277493, -0.25058829, ..., 1. ,\n", " 0. , 0. ],\n", " [-0.89515383, 1.33643394, -0.05422496, ..., 0. ,\n", " 0. , 0. ]])" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformed" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agen_peoplepricegood_servernoise_levelnorth_america_Don't want to sharenorth_america_Nonorth_america_Yesfood_type_Canadian/Americanfood_type_Chinesefood_type_Fusionfood_type_Indianfood_type_Italianfood_type_Mexicanfood_type_Otherfood_type_Quebecoisfood_type_Thai
0-0.6694170.310295-0.3684060.03.00.01.00.00.01.00.00.00.00.00.00.00.0
1-0.6694170.310295-0.0542251.01.00.00.01.01.00.00.00.00.00.00.00.00.0
2-0.8951540.823364-0.2505881.02.00.01.00.01.00.00.00.00.00.00.00.00.0
3-0.669417-0.202775-0.2505881.02.00.00.01.00.00.00.00.00.00.00.01.00.0
40.007794-0.202775-0.0542251.03.00.00.01.00.00.00.01.00.00.00.00.00.0
......................................................
7480.685006-0.715845-0.6433151.02.00.01.00.00.01.00.00.00.00.00.00.00.0
7490.007794-0.613231-0.9182241.02.00.01.00.00.00.00.00.00.00.01.00.00.0
750-0.895154-0.972379-0.6433150.01.00.00.01.01.00.00.00.00.00.00.00.00.0
751-0.895154-0.202775-0.2505881.02.00.00.01.00.00.00.00.00.00.01.00.00.0
752-0.8951541.336434-0.0542251.03.01.00.00.00.01.00.00.00.00.00.00.00.0
\n", "

753 rows × 17 columns

\n", "
" ], "text/plain": [ " age n_people price good_server noise_level \\\n", "0 -0.669417 0.310295 -0.368406 0.0 3.0 \n", "1 -0.669417 0.310295 -0.054225 1.0 1.0 \n", "2 -0.895154 0.823364 -0.250588 1.0 2.0 \n", "3 -0.669417 -0.202775 -0.250588 1.0 2.0 \n", "4 0.007794 -0.202775 -0.054225 1.0 3.0 \n", ".. ... ... ... ... ... \n", "748 0.685006 -0.715845 -0.643315 1.0 2.0 \n", "749 0.007794 -0.613231 -0.918224 1.0 2.0 \n", "750 -0.895154 -0.972379 -0.643315 0.0 1.0 \n", "751 -0.895154 -0.202775 -0.250588 1.0 2.0 \n", "752 -0.895154 1.336434 -0.054225 1.0 3.0 \n", "\n", " north_america_Don't want to share north_america_No north_america_Yes \\\n", "0 0.0 1.0 0.0 \n", "1 0.0 0.0 1.0 \n", "2 0.0 1.0 0.0 \n", "3 0.0 0.0 1.0 \n", "4 0.0 0.0 1.0 \n", ".. ... ... ... \n", "748 0.0 1.0 0.0 \n", "749 0.0 1.0 0.0 \n", "750 0.0 0.0 1.0 \n", "751 0.0 0.0 1.0 \n", "752 1.0 0.0 0.0 \n", "\n", " food_type_Canadian/American food_type_Chinese food_type_Fusion \\\n", "0 0.0 1.0 0.0 \n", "1 1.0 0.0 0.0 \n", "2 1.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", ".. ... ... ... \n", "748 0.0 1.0 0.0 \n", "749 0.0 0.0 0.0 \n", "750 1.0 0.0 0.0 \n", "751 0.0 0.0 0.0 \n", "752 0.0 1.0 0.0 \n", "\n", " food_type_Indian food_type_Italian food_type_Mexican food_type_Other \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 1.0 0.0 0.0 0.0 \n", ".. ... ... ... ... \n", "748 0.0 0.0 0.0 0.0 \n", "749 0.0 0.0 0.0 1.0 \n", "750 0.0 0.0 0.0 0.0 \n", "751 0.0 0.0 0.0 1.0 \n", "752 0.0 0.0 0.0 0.0 \n", "\n", " food_type_Quebecois food_type_Thai \n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 1.0 0.0 \n", "4 0.0 0.0 \n", ".. ... ... \n", "748 0.0 0.0 \n", "749 0.0 0.0 \n", "750 0.0 0.0 \n", "751 0.0 0.0 \n", "752 0.0 0.0 \n", "\n", "[753 rows x 17 columns]" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(transformed, columns = feat_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have new columns for the categorical features. Let's create a pipeline with the preprocessor and SVC. " ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.686569536423841" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "svc_all_pipe = make_pipeline(preprocessor, SVC())\n", "cross_val_score(svc_all_pipe, X_train, y_train).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are getting better results! \n", "


" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "571", "language": "python", "name": "571" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 4 }