Finding the Nearest Neighbour

cities_df = pd.read_csv("data/canada_usa_cities.csv")
train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)
train_df.head(3)
longitude latitude country
160 -76.4813 44.2307 Canada
127 -81.2496 42.9837 Canada
169 -66.0580 45.2788 Canada


from sklearn.metrics.pairwise import euclidean_distances
dists = euclidean_distances(train_df[["latitude", "longitude"]])
dists
array([[ 0.        ,  4.92866046, ...,  3.13968038,  9.58476504],
       [ 4.92866046,  0.        , ...,  1.80868018, 14.45684087],
       ...,
       [ 3.13968038,  1.80868018, ...,  0.        , 12.70774745],
       [ 9.58476504, 14.45684087, ..., 12.70774745,  0.        ]], shape=(167, 167))


dists.shape
(167, 167)
pd.DataFrame(dists).loc[:5,:5]
0 1 2 3 4 5
0 0.000000 4.928660 10.475863 3.402295 9.046000 44.329135
1 4.928660 0.000000 15.363990 8.326614 13.965788 39.839439
2 10.475863 15.363990 0.000000 7.195350 2.653738 54.549042
3 3.402295 8.326614 7.195350 0.000000 5.643921 47.391337
4 9.046000 13.965788 2.653738 5.643921 0.000000 52.532333
5 44.329135 39.839439 54.549042 47.391337 52.532333 0.000000


np.fill_diagonal(dists, np.inf)
pd.DataFrame(dists).loc[:5,:5]
0 1 2 3 4 5
0 inf 4.928660 10.475863 3.402295 9.046000 44.329135
1 4.928660 inf 15.363990 8.326614 13.965788 39.839439
2 10.475863 15.363990 inf 7.195350 2.653738 54.549042
3 3.402295 8.326614 7.195350 inf 5.643921 47.391337
4 9.046000 13.965788 2.653738 5.643921 inf 52.532333
5 44.329135 39.839439 54.549042 47.391337 52.532333 inf

Feature vector for city 0:

train_df.iloc[0]
longitude   -76.4813
latitude     44.2307
country       Canada
Name: 160, dtype: object


Distances from city 0 to 5 other cities:

dists[0][:5]
array([        inf,  4.92866046, 10.47586257,  3.40229467,  9.04600003])
train_df.iloc[[0]]
longitude latitude country
160 -76.4813 44.2307 Canada


np.argmin(dists[0])
np.int64(157)


train_df.iloc[[157]]
longitude latitude country
96 -76.3019 44.211 Canada


dists[0][157]
np.float64(0.18047839205805613)

Finding the distances to a query point

query_point = [[-80, 25]]

dists = euclidean_distances(train_df[["longitude", "latitude"]], query_point)
dists[0:5]
array([[19.54996348],
       [18.02706204],
       [24.60912622],
       [21.39718237],
       [25.24111312]])


We can find the city closest to the query point (-80, 25) using:

np.argmin(dists)
np.int64(147)


The distance between the query point and closest city is:

dists[np.argmin(dists)].item()
3.8383922936564634
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=1)
nn.fit(train_df[['longitude', 'latitude']]);
nn.kneighbors([[-80, 25]])
(array([[3.83839229]]), array([[147]]))
pokemon_df = pd.read_csv("data/pokemon.csv")
X = pokemon_df.drop(columns = ['deck_no', 'name','total_bs', 'type', 'legendary'])
y = pokemon_df[['legendary']]
X_train, X_test, y_train,  y_test = train_test_split(X, y, test_size=0.2, random_state=123)

X_train.head()
attack defense sp_attack sp_defense speed capture_rt gen
362 40 50 55 50 25 255 3
132 55 50 45 65 55 45 1
704 75 53 83 113 60 45 6
9 30 35 20 20 45 255 1
687 52 67 39 56 50 120 6
dists = euclidean_distances(X_train[:3])
dists
array([[  0.        , 213.4338305 , 226.54138695],
       [213.4338305 ,   0.        ,  64.86139067],
       [226.54138695,  64.86139067,   0.        ]])


dists[0,2]
np.float64(226.54138694728607)


nn = NearestNeighbors(n_neighbors=1)
nn.fit(X_train);
nn.kneighbors(X_test.iloc[[1]])
(array([[15.5241747]]), array([[143]]))


X_test.to_numpy().shape
(161, 7)
nn = NearestNeighbors(n_neighbors=5)
nn.fit(X_train);
nn.kneighbors(X_test.iloc[1])
ValueError: Expected 2D array, got 1D array instead:
array=[605  55  55  85  55  30 255 335   5].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.


X_test.iloc[1].shape
(7,)
X_test.iloc[[1]].shape
(1, 7)


nn = NearestNeighbors(n_neighbors=5)
nn.fit(X_train);
nn.kneighbors(X_test.iloc[[1]])
(array([[15.5241747 , 25.90366769, 27.91057147, 33.3166625 , 34.69870315]]),
 array([[143, 364, 515, 638,   0]]))

Let’s apply what we learned!