Logistic Regression

from sklearn.model_selection import train_test_split

cities_df = pd.read_csv("data/canada_usa_cities.csv")
train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)
X_train, y_train = train_df.drop(columns=["country"], axis=1), train_df["country"]
X_test, y_test = test_df.drop(columns=["country"], axis=1), test_df["country"]

train_df.head()
longitude latitude country
160 -76.4813 44.2307 Canada
127 -81.2496 42.9837 Canada
169 -66.0580 45.2788 Canada
188 -73.2533 45.3057 Canada
187 -67.9245 47.1652 Canada

Setting the stage

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

dc = DummyClassifier(strategy="prior")

scores= pd.DataFrame(cross_validate(dc, X_train, y_train, return_train_score=True))
scores
fit_time score_time test_score train_score
0 0.000718 0.000722 0.588235 0.601504
1 0.000507 0.000569 0.588235 0.601504
2 0.000482 0.000557 0.606061 0.597015
3 0.000452 0.000569 0.606061 0.597015
4 0.000452 0.000566 0.606061 0.597015
from sklearn.linear_model import LogisticRegression


lr = LogisticRegression()
scores = pd.DataFrame(cross_validate(lr, X_train, y_train, return_train_score=True))
scores
fit_time score_time test_score train_score
0 0.006823 0.001266 0.852941 0.827068
1 0.004122 0.001170 0.823529 0.827068
2 0.003635 0.001100 0.696970 0.858209
3 0.003740 0.001091 0.787879 0.843284
4 0.003710 0.001083 0.939394 0.805970

Visualizing our model



Coefficients

lr = LogisticRegression()
lr.fit(X_train, y_train);


print("Model coefficients:", lr.coef_)
print("Model intercept:", lr.intercept_)
Model coefficients: [[-0.04108378 -0.33683087]]
Model intercept: [10.886759]


data = {'features': X_train.columns, 'coefficients':lr.coef_[0]}
pd.DataFrame(data)
features coefficients
0 longitude -0.041084
1 latitude -0.336831

Predictions

lr.classes_
array(['Canada', 'USA'], dtype=object)


example = X_test.iloc[0,:]
example.tolist()
[-64.8001, 46.098]


(example.tolist() * lr.coef_).sum(axis=1) + lr.intercept_ 
array([-1.97823755])


lr.predict([example])
array(['Canada'], dtype=object)

Hyperparameter: C (A new one)

scores_dict ={
"C" :10.0**np.arange(-6,2,1),
"train_score" : list(),
"cv_score" : list(),
}
for C in scores_dict['C']:
    lr_model = LogisticRegression(C=C)
    results = cross_validate(lr_model, X_train, y_train, return_train_score=True)
    scores_dict['train_score'].append(results["train_score"].mean())
    scores_dict['cv_score'].append(results["test_score"].mean())
pd.DataFrame(scores_dict)
C train_score cv_score
0 0.000001 0.598810 0.598930
1 0.000010 0.598810 0.598930
2 0.000100 0.664707 0.658645
... ... ... ...
5 0.100000 0.832320 0.820143
6 1.000000 0.832320 0.820143
7 10.000000 0.832320 0.820143

8 rows × 3 columns

import scipy
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "C": scipy.stats.uniform(0, 100)}

lr = LogisticRegression()
grid_search = RandomizedSearchCV(lr, param_grid, cv=5, return_train_score=True, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train);
Fitting 5 folds for each of 10 candidates, totalling 50 fits


grid_search.best_params_
{'C': np.float64(1.5229072268322374)}


grid_search.best_score_
np.float64(0.8201426024955436)

Logistic regression with text data

X = [
    "URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!",
    "Lol you are always so convincing.",
    "Nah I don't think he goes to usf, he lives around here though",
    "URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!",
    "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030",
    "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"]

y = ["spam", "non spam", "non spam", "spam", "spam", "non spam"]
vec = CountVectorizer()
X_transformed = vec.fit_transform(X);
bow_df = pd.DataFrame(X_transformed.toarray(), columns=sorted(vec.vocabulary_), index=X)
bow_df
08002986030 100000 11 900 ... with won you your
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward! 0 0 0 1 ... 0 0 1 0
Lol you are always so convincing. 0 0 0 0 ... 0 0 1 0
Nah I don't think he goes to usf, he lives around here though 0 0 0 0 ... 0 0 0 0
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot! 0 1 0 0 ... 0 1 1 0
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030 1 0 1 0 ... 1 0 0 1
As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune 0 0 0 0 ... 0 0 0 3

6 rows × 72 columns

lr_text_model = LogisticRegression()
lr_text_model.fit(X_transformed, y);


pd.DataFrame({'feature': vec.get_feature_names_out(),
              'coefficient': lr_text_model.coef_[0]})
feature coefficient
0 08002986030 0.083694
1 100000 0.147269
2 11 0.083694
... ... ...
69 won 0.147269
70 you 0.111708
71 your -0.149143

72 rows × 2 columns

Let’s apply what we learned!