Predicting Probabilities

from sklearn.model_selection import train_test_split

cities_df = pd.read_csv("data/canada_usa_cities.csv")
train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)
X_train, y_train = train_df.drop(columns=["country"], axis=1), train_df["country"]
X_test, y_test = test_df.drop(columns=["country"], axis=1), test_df["country"]

train_df.head()
longitude latitude country
160 -76.4813 44.2307 Canada
127 -81.2496 42.9837 Canada
169 -66.0580 45.2788 Canada
188 -73.2533 45.3057 Canada
187 -67.9245 47.1652 Canada


from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train);


lr.predict(X_test[:1])
array(['Canada'], dtype=object)
lr.predict(X_test[:1])
array(['Canada'], dtype=object)


lr.predict_proba(X_test[:1])
array([[0.87849316, 0.12150684]])

How is this being done?

For linear regression we used something like this:

predicted(value) = coefficientfeature1 x feature1 + coefficientfeature2 x feature2 + … + intercept

But this won’t work with probabilities.

Sigmoid function (optional)

predict_y = lr.predict(X_train)
predict_y[-5:]
array(['Canada', 'Canada', 'USA', 'Canada', 'Canada'], dtype=object)


y_proba = lr.predict_proba(X_train)
y_proba[-5:]
array([[0.69849181, 0.30150819],
       [0.76971285, 0.23028715],
       [0.05301371, 0.94698629],
       [0.63295092, 0.36704908],
       [0.81540984, 0.18459016]])
data_dict = {"y":y_train, 
             "pred y": predict_y.tolist(),
             "probabilities": y_proba.tolist()}
pd.DataFrame(data_dict).tail(10)
y pred y probabilities
96 Canada Canada [0.7047665661828637, 0.29523343381713635]
57 USA USA [0.031212358424745568, 0.9687876415752544]
123 Canada Canada [0.653686673894087, 0.34631332610591303]
... ... ... ...
66 USA USA [0.053013708112636726, 0.9469862918873633]
126 Canada Canada [0.6329509233541488, 0.36704907664585124]
109 Canada Canada [0.8154098423831041, 0.18459015761689593]

10 rows × 3 columns


lr_targets = pd.DataFrame({"y":y_train,
                           "pred y": predict_y.tolist(),
                           "probability_canada": y_proba[:,0].tolist()})
lr_targets.head(3)
y pred y probability_canada
160 Canada Canada 0.704614
127 Canada Canada 0.563022
169 Canada Canada 0.838976


lr_targets.sort_values(by='probability_canada')
y pred y probability_canada
37 USA USA 0.006547
78 USA USA 0.007685
34 USA USA 0.008317
... ... ... ...
0 USA Canada 0.932481
165 Canada Canada 0.951096
1 USA Canada 0.961898

167 rows × 3 columns

X_train.loc[[1,37]]
longitude latitude
1 -134.4197 58.3019
37 -98.4951 29.4246


lr_targets = pd.DataFrame({"y":y_train,
                           "pred y": predict_y.tolist(),
                           "prob_difference": (abs(y_proba[:,0] - y_proba[:,1])).tolist()})
lr_targets.sort_values(by="prob_difference").head()
y pred y prob_difference
61 USA USA 0.001715
13 USA USA 0.020016
54 USA USA 0.020016
130 Canada USA 0.022225
92 Canada USA 0.022225
X_train.loc[[61, 54]]
longitude latitude
61 -87.9225 43.0350
54 -83.0466 42.3316

Let’s apply what we learned!