Tutorial

DSCI524_Group36_MLpipeline is a project that simplifies some of the steps in the machine learning process and provides a model comparison function. These methods cover simple tasks in exploratory data analysis (EDA), pipeline creation, computation of metrics and model comparison. This package helps machine learning practitioners streamline common workflow steps without needing to rely on multiple external libraries.

Here is a demo of using this package to build an ML pipeline for Iris dataset:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier

from dsci524_group36_mlpipeline.eda import eda
from dsci524_group36_mlpipeline.model_pipeline import create_model_pipeline
from dsci524_group36_mlpipeline.compute_model_metrics import compute_model_metrics 
from dsci524_group36_mlpipeline.model_comparison import model_comparison

# Load example dataset
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")

X = df.drop(columns=["species"])
y = df["species"]

# Using eda function
stats, ax = eda(df, "sepal_length")
print("Summary Statistics:")
print(stats)

# Customize the plot using the returned matplotlib Axes object
ax.set_title("Distribution of Sepal Length")
plt.show()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

# Train model
log = create_model_pipeline(X_train, numerical_feat=['sepal_length'], model='lr')
log.fit(X_train, y_train)

# Define metrics
metrics = {
    "accuracy": accuracy_score,
    "f1_macro": lambda y_true, y_pred: f1_score(y_true, y_pred, average="macro"),
}

# Compute metrics
results = compute_model_metrics(log, X_test, y_test, metrics)
print(results)


# Using Model Comparison Function
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
best_model = model_comparison([log, dt], X, y, metric='accuracy_score')