๐‘˜ -Nearest Neighbours Regressor

Regression with ๐‘˜-nearest neighbours ( ๐‘˜ -NNs)

np.random.seed(0)
n = 50
X_1 = np.linspace(0,2,n)+np.random.randn(n)*0.01
X = pd.DataFrame(X_1[:,None], columns=['length'])
X.head()
length
0 0.017641
1 0.044818
2 0.091420
3 0.144858
4 0.181941


y = abs(np.random.randn(n,1))*2 + X_1[:,None]*5
y = pd.DataFrame(y, columns=['weight'])
y.head()
weight
0 1.879136
1 0.997894
2 1.478710
3 3.085554
4 0.966069
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)


import altair as alt
source = pd.concat([X_train, y_train], axis=1)

scatter = alt.Chart(source, width=500, height=300).mark_point(filled=True, color='green').encode(
    alt.X('length:Q'),
    alt.Y('weight:Q'))

scatter
404 image
from sklearn.neighbors import KNeighborsRegressor

knnr = KNeighborsRegressor(n_neighbors=1, weights="uniform")
knnr.fit(X_train,y_train);


predicted = knnr.predict(X_train)
predicted[:5]
array([[ 4.57636104],
       [13.20245224],
       [ 3.03671796],
       [10.74123618],
       [ 1.82820801]])


knnr.score( X_train, y_train)  
1.0
knnr = KNeighborsRegressor(n_neighbors=10, weights="uniform")
knnr.fit(X_train, y_train);


knnr.score(X_train, y_train)
0.9254540554756747


Using weighted distances

knnr = KNeighborsRegressor(n_neighbors=10, weights="distance")
knnr.fit(X_train, y_train);


knnr.score(X_train, y_train)
1.0

Pros and Cons of ๐‘˜ -Nearest Neighbours


Pros:

  • Easy to understand, interpret.
  • Simply hyperparameter ๐‘˜ (n_neighbors) controlling the fundamental tradeoff.
  • Can learn very complex functions given enough data.
  • Lazy learning: Takes no time to fit


Cons:

  • Can potentially be VERY slow during prediction time.
  • Often not that great test accuracy compared to the modern approaches.
  • You should scale your features. Weโ€™ll be looking into it in the next lecture.

Letโ€™s apply what we learned!