Getting Started with sanityzeR
sanityzeR.Rmd
The goal of sanityzeR
Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities to spot and redact PII from r data frames/Tibbles.
PII can be used to uniquely identify a person. This includes names, addresses, credit card numbers, phone numbers, email addresses, and social security numbers, and therefore regulatory bodies such as the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require that PII be removed or redacted from data sets before they are shared an further processed.
Functionalities
This document introduces you to the fundamental tools of sanityzeR and shows you how to apply them with data frames.
There are three functions in this package including data cleaning, credit card number handling, and email address handling.
Installation
You can install the development version of sanityzeR from GitHub with:
# install.packages("devtools")
devtools::install_github("UBC-MDS/sanityzeR")
Setup
Create a dummy dataframe
df <- data.frame(
Name = c("My email address is 123456abcd@yahoo.com and zzzzz123@yahoo.mail Thank you.",
"Bill for: 4556129404313766",
"Maria",
"Ben",
"Tina"),
Age = c(23, 41, 32, 16, 26)
)
df
## Name
## 1 My email address is 123456abcd@yahoo.com and zzzzz123@yahoo.mail Thank you.
## 2 Bill for: 4556129404313766
## 3 Maria
## 4 Ben
## 5 Tina
## Age
## 1 23
## 2 41
## 3 32
## 4 16
## 5 26
Clean PII with redacted
Replacing the PII with a fixed string.
clean_data_frame(df, spotters_redacted)
## Name Age
## 1 My email address is EMAILADDRS and EMAILADDRS Thank you. 23
## 2 Bill for: CREDITCARD 41
## 3 Maria 32
## 4 Ben 16
## 5 Tina 26
Clean PII with hash
Replacing the PII with a hash.
clean_data_frame(df, spotters_hashed)
## Name
## 1 My email address is 00345d02eb20733e49077c9618f0d598 and ba68a57288bf24140628f37aadbb7920 Thank you.
## 2 Bill for: e93723ee0d38e30a68902aef6b0033de
## 3 Maria
## 4 Ben
## 5 Tina
## Age
## 1 23
## 2 41
## 3 32
## 4 16
## 5 26