Skip to contents

The goal of sanityzeR

Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities to spot and redact PII from r data frames/Tibbles.

PII can be used to uniquely identify a person. This includes names, addresses, credit card numbers, phone numbers, email addresses, and social security numbers, and therefore regulatory bodies such as the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require that PII be removed or redacted from data sets before they are shared an further processed.

Functionalities

This document introduces you to the fundamental tools of sanityzeR and shows you how to apply them with data frames.

There are three functions in this package including data cleaning, credit card number handling, and email address handling.

Installation

You can install the development version of sanityzeR from GitHub with:

# install.packages("devtools")
devtools::install_github("UBC-MDS/sanityzeR")

Setup

Create a dummy dataframe

df <- data.frame(
 Name = c("My email address is 123456abcd@yahoo.com and zzzzz123@yahoo.mail Thank you.",
                          "Bill for: 4556129404313766",
                          "Maria",
                          "Ben",
                          "Tina"),
                 Age = c(23, 41, 32, 16, 26)
)
df
##                                                                          Name
## 1 My email address is 123456abcd@yahoo.com and zzzzz123@yahoo.mail Thank you.
## 2                                                  Bill for: 4556129404313766
## 3                                                                       Maria
## 4                                                                         Ben
## 5                                                                        Tina
##   Age
## 1  23
## 2  41
## 3  32
## 4  16
## 5  26

Create Spotters with redacted function

The following spotters will replace the detected PII with a fixed string.

spotter_1_r <- list(redact_email,FALSE,"EMAILADDRS")
spotter_2_r <- list(redact_creditcardnumber,FALSE,"CREDITCARD")
spotters_redacted <- list(spotter_2_r,spotter_1_r)

Create Spotters with hash function

The following spotters will replace the detected PII with a hash.

spotter_1_h <- list(redact_email,TRUE,0)
spotter_2_h <- list(redact_creditcardnumber,TRUE,0)
spotters_hashed <- list(spotter_2_h,spotter_1_h)

Clean PII with redacted

Replacing the PII with a fixed string.

clean_data_frame(df, spotters_redacted)
##                                                       Name Age
## 1 My email address is EMAILADDRS and EMAILADDRS Thank you.  23
## 2                                     Bill for: CREDITCARD  41
## 3                                                    Maria  32
## 4                                                      Ben  16
## 5                                                     Tina  26

Clean PII with hash

Replacing the PII with a hash.

clean_data_frame(df, spotters_hashed)
##                                                                                                   Name
## 1 My email address is 00345d02eb20733e49077c9618f0d598 and ba68a57288bf24140628f37aadbb7920 Thank you.
## 2                                                           Bill for: e93723ee0d38e30a68902aef6b0033de
## 3                                                                                                Maria
## 4                                                                                                  Ben
## 5                                                                                                 Tina
##   Age
## 1  23
## 2  41
## 3  32
## 4  16
## 5  26