Introduction to Machine Learning – Logistic Regression

from sklearn.model_selection import train_test_split

cities_df = pd.read_csv("data/canada_usa_cities.csv")
train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)
X_train, y_train = train_df.drop(columns=["country"], axis=1), train_df["country"]
X_test, y_test = test_df.drop(columns=["country"], axis=1), test_df["country"]

train_df.head()

	longitude	latitude	country
160	-76.4813	44.2307	Canada
127	-81.2496	42.9837	Canada
169	-66.0580	45.2788	Canada
188	-73.2533	45.3057	Canada
187	-67.9245	47.1652	Canada

Setting the stage

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

dc = DummyClassifier(strategy="prior")

scores= pd.DataFrame(cross_validate(dc, X_train, y_train, return_train_score=True))
scores

	fit_time	score_time	test_score	train_score
0	0.000718	0.000722	0.588235	0.601504
1	0.000507	0.000569	0.588235	0.601504
2	0.000482	0.000557	0.606061	0.597015
3	0.000452	0.000569	0.606061	0.597015
4	0.000452	0.000566	0.606061	0.597015

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
scores = pd.DataFrame(cross_validate(lr, X_train, y_train, return_train_score=True))
scores

	fit_time	score_time	test_score	train_score
0	0.006823	0.001266	0.852941	0.827068
1	0.004122	0.001170	0.823529	0.827068
2	0.003635	0.001100	0.696970	0.858209
3	0.003740	0.001091	0.787879	0.843284
4	0.003710	0.001083	0.939394	0.805970

Visualizing our model

Coefficients

lr = LogisticRegression()
lr.fit(X_train, y_train);

print("Model coefficients:", lr.coef_)
print("Model intercept:", lr.intercept_)

Model coefficients: [[-0.04108378 -0.33683087]]
Model intercept: [10.886759]

data = {'features': X_train.columns, 'coefficients':lr.coef_[0]}
pd.DataFrame(data)

	features	coefficients
0	longitude	-0.041084
1	latitude	-0.336831

Predictions

lr.classes_

array(['Canada', 'USA'], dtype=object)

example = X_test.iloc[0,:]
example.tolist()

[-64.8001, 46.098]

(example.tolist() * lr.coef_).sum(axis=1) + lr.intercept_

array([-1.97823755])

lr.predict([example])

array(['Canada'], dtype=object)

Hyperparameter: C (A new one)

scores_dict ={
"C" :10.0**np.arange(-6,2,1),
"train_score" : list(),
"cv_score" : list(),
}
for C in scores_dict['C']:
    lr_model = LogisticRegression(C=C)
    results = cross_validate(lr_model, X_train, y_train, return_train_score=True)
    scores_dict['train_score'].append(results["train_score"].mean())
    scores_dict['cv_score'].append(results["test_score"].mean())

pd.DataFrame(scores_dict)

	C	train_score	cv_score
0	0.000001	0.598810	0.598930
1	0.000010	0.598810	0.598930
2	0.000100	0.664707	0.658645
...	...	...	...
5	0.100000	0.832320	0.820143
6	1.000000	0.832320	0.820143
7	10.000000	0.832320	0.820143

8 rows × 3 columns

import scipy
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "C": scipy.stats.uniform(0, 100)}

lr = LogisticRegression()
grid_search = RandomizedSearchCV(lr, param_grid, cv=5, return_train_score=True, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train);

Fitting 5 folds for each of 10 candidates, totalling 50 fits

grid_search.best_params_

{'C': np.float64(1.5229072268322374)}

grid_search.best_score_

np.float64(0.8201426024955436)

Logistic regression with text data

X = [
    "URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!",
    "Lol you are always so convincing.",
    "Nah I don't think he goes to usf, he lives around here though",
    "URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!",
    "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030",
    "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"]

y = ["spam", "non spam", "non spam", "spam", "spam", "non spam"]

vec = CountVectorizer()
X_transformed = vec.fit_transform(X);
bow_df = pd.DataFrame(X_transformed.toarray(), columns=sorted(vec.vocabulary_), index=X)
bow_df

	08002986030	100000	11	900	...	with	won	you	your
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!	0	0	0	1	...	0	0	1	0
Lol you are always so convincing.	0	0	0	0	...	0	0	1	0
Nah I don't think he goes to usf, he lives around here though	0	0	0	0	...	0	0	0	0
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!	0	1	0	0	...	0	1	1	0
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030	1	0	1	0	...	1	0	0	1
As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune	0	0	0	0	...	0	0	0	3

6 rows × 72 columns

lr_text_model = LogisticRegression()
lr_text_model.fit(X_transformed, y);

pd.DataFrame({'feature': vec.get_feature_names_out(),
              'coefficient': lr_text_model.coef_[0]})

	feature	coefficient
0	08002986030	0.083694
1	100000	0.147269
2	11	0.083694
...	...	...
69	won	0.147269
70	you	0.111708
71	your	-0.149143

72 rows × 2 columns

Logistic Regression

Setting the stage

Visualizing our model

Coefficients

Predictions

Hyperparameter: C (A new one)

Logistic regression with text data

Let’s apply what we learned!