Survey Cleaner
Tutorial: Cleaning a Survey Dataset
This tutorial demonstrates how to clean a small survey dataset using all functions in survey_cleaner.
import pandas as pd
from survey_cleaner import (
handle_emptyStrings,
normalize_binary,
word_to_ordinal,
remove_duplicates
)
# Example raw survey data
df = pd.DataFrame({
"respondent_id": [1, 1, 2, 3],
"completed_at": [
"2024-01-01 10:00",
"2024-01-01 11:00",
"2024-01-01 12:00",
"2024-01-01 13:00"
],
"satisfaction": [
" strongly agree ",
"Agree",
" disagree ",
"Neither agree nor disagree"
],
"recommend": ["Yes", "No", "Yes", "True"]
})
dfExpected Output:
| respondent_id | completed_at | satisfaction | recommend |
|---|---|---|---|
| 1 | 2024-01-01 10:00 | ” strongly agree ” | Yes |
| 1 | 2024-01-01 11:00 | Agree | No |
| 2 | 2024-01-01 12:00 | ” disagree ” | Yes |
| 3 | 2024-01-01 13:00 | Neither agree nor disagree | True |
Step 1: Clean whitespace
df["satisfaction"] = df["satisfaction"].apply(handle_emptyStrings)
dfExpected Output:
| respondent_id | completed_at | satisfaction | recommend |
|---|---|---|---|
| 1 | 2024-01-01 10:00 | strongly agree | Yes |
| 1 | 2024-01-01 11:00 | Agree | No |
| 2 | 2024-01-01 12:00 | disagree | Yes |
| 3 | 2024-01-01 13:00 | Neither agree nor disagree | True |
Step 2: Normalize binary responses
df["recommend"] = df["recommend"].apply(normalize_binary)
dfExpected Output:
| respondent_id | completed_at | satisfaction | recommend |
|---|---|---|---|
| 1 | 2024-01-01 10:00 | strongly agree | 1 |
| 1 | 2024-01-01 11:00 | Agree | 0 |
| 2 | 2024-01-01 12:00 | disagree | 1 |
| 3 | 2024-01-01 13:00 | Neither agree nor disagree | 1 |
Step 3: Convert ordinal responses to numeric
df["satisfaction_score"] = word_to_ordinal(df["satisfaction"], likert="agreement")
dfExpected Output:
| respondent_id | completed_at | satisfaction | recommend | satisfaction_score |
|---|---|---|---|---|
| 1 | 2024-01-01 10:00 | strongly agree | 1 | 5 |
| 1 | 2024-01-01 11:00 | Agree | 0 | 4 |
| 2 | 2024-01-01 12:00 | disagree | 1 | 2 |
| 3 | 2024-01-01 13:00 | Neither agree nor disagree | 1 | 3 |
Step 4: Remove duplicate responses
df = remove_duplicates(df, "respondent_id", "completed_at")
dfExpected Output:
| respondent_id | completed_at | satisfaction | recommend | satisfaction_score |
|---|---|---|---|---|
| 1 | 2024-01-01 11:00 | Agree | 0 | 4 |
| 2 | 2024-01-01 12:00 | disagree | 1 | 2 |
| 3 | 2024-01-01 13:00 | Neither agree nor disagree | 1 | 3 |