Getting Started with sanityze

If you are developing locally

After checking out the repository, you can run the following commands to install the package and its dependencies:

$ poetry install

Open this notebook in JupyterLab and run the cells in section “Example”.

If you are using the public package

To use sanityze in a project, You can install it from PyPI:

$ pip install sanityze

Then, you can import it in your code:

from sanityze.cleanser import *
from sanityze.spotters import *

Examples

# setup a dummy test dataframe
import pandas as pd
data = {'product_name': ['laptop', 'printer foo@gaga.com', 'tablet', 'desk 5555 5555 5555 4444', 'chair'],
        'price': [1200, 150, 300, 450, 200]
        }
df = pd.DataFrame(data)
df.head()
product_name price
0 laptop 1200
1 printer foo@gaga.com 150
2 tablet 300
3 desk 5555 5555 5555 4444 450
4 chair 200
from sanityze.cleanser import *
c = Cleanser()
c.clean(df, verbose=False)
product_name price
0 laptop 1200
1 printer EMAILADDRS 150
2 tablet 300
3 desk 5555 5555 5555 4444 450
4 chair 200

Testing with dummy data

df_with_pii = pd.read_csv("../tests/data_with_pii.csv")
df_with_pii.head()
first_name last_name email_address visa_cc master_cc balance active_member age
0 Jacob King the following is my email address JacobKing100... this is my credit card: 4658481398602920 5339168719695860 100 1 24
1 Chloe Lavoie the following is my email address ChloeLavoie2... this is my credit card: 4532546510575280 5284482559079650 200 0 36
2 Myles Clark MylesClark300@hotmail.com this is my credit card: 4539650939655290 5338287181016540 300 1 23
3 Daniel Murray DanielMurray400@outlook.ca this is my credit card: 4716505160113470 5581255820397210 400 0 28
4 ​Lucy Landry ​LucyLandry500@ubc.ca this is my credit card: 4716908400371550 5453813871212040 500 1 37
c.clean(df_with_pii)
first_name last_name email_address visa_cc master_cc balance active_member age
0 Jacob King the following is my email address EMAILADDRS this is my credit card: CREDITCARD CREDITCARD 100 1 24
1 Chloe Lavoie the following is my email address EMAILADDRS this is my credit card: CREDITCARD CREDITCARD 200 0 36
2 Myles Clark EMAILADDRS this is my credit card: CREDITCARD CREDITCARD 300 1 23
3 Daniel Murray EMAILADDRS this is my credit card: CREDITCARD CREDITCARD 400 0 28
4 ​Lucy Landry ​EMAILADDRS this is my credit card: CREDITCARD CREDITCARD 500 1 37
5 Austin Cote EMAILADDRS this is my credit card: CREDITCARD CREDITCARD 600 1 31
6 Leo Leblanc EMAILADDRS this is my credit card: CREDITCARD this is my master card number: CREDITCARD 700 0 41
7 Luke Cote EMAILADDRS CREDITCARD CREDITCARD 800 1 43
8 Chloe Martin EMAILADDRS CREDITCARD CREDITCARD 900 0 58
9 Sophia Taylor EMAILADDRS CREDITCARD CREDITCARD 1000 1 67
10 Sebastian Li EMAILADDRS CREDITCARD CREDITCARD 1100 0 25
11 Theodore Walker EMAILADDRS CREDITCARD CREDITCARD 1200 1 29
12 Grayson Moore EMAILADDRS CREDITCARD CREDITCARD 1300 0 38
13 Madelyn Ross EMAILADDRS CREDITCARD CREDITCARD 1400 1 64
14 Charlie Johnson EMAILADDRS CREDITCARD CREDITCARD 1500 0 66
15 Isaac Davis EMAILADDRS CREDITCARD CREDITCARD 1600 1 55
16 Grace Thomas EMAILADDRS CREDITCARD CREDITCARD 1700 0 43
17 Kayden Thomas EMAILADDRS CREDITCARD CREDITCARD 1800 1 48
18 Peyton Bergeron EMAILADDRS CREDITCARD CREDITCARD 1900 0 58
19 Evelyn Johnston EMAILADDRS CREDITCARD CREDITCARD 2000 0 29
df_without_pii = pd.read_csv("../tests/data_without_pii.csv")
df_without_pii.head()
first_name last_name balance active_member age
0 Jacob King 100 1 24
1 Chloe Lavoie 200 0 36
2 Myles Clark 300 1 23
3 Daniel Murray 400 0 28
4 ​Lucy Landry 500 1 37
c.clean(df_without_pii)
first_name last_name balance active_member age
0 Jacob King 100 1 24
1 Chloe Lavoie 200 0 36
2 Myles Clark 300 1 23
3 Daniel Murray 400 0 28
4 ​Lucy Landry 500 1 37
5 Austin Cote 600 1 31
6 Leo Leblanc 700 0 41
7 Luke Cote 800 1 43
8 Chloe Martin 900 0 58
9 Sophia Taylor 1000 1 67
10 Sebastian Li 1100 0 25
11 Theodore Walker 1200 1 29
12 Grayson Moore 1300 0 38
13 Madelyn Ross 1400 1 64
14 Charlie Johnson 1500 0 66
15 Isaac Davis 1600 1 55
16 Grace Thomas 1700 0 43
17 Kayden Thomas 1800 1 48
18 Peyton Bergeron 1900 0 58
19 Evelyn Johnston 2000 0 29

Testing with dummy data (hashing)

# create Cleanser with hash
c = Cleanser(include_default_spotters=False)
s1 = EmailSpotter("EMAILS",True)
s2 = CreditCardSpotter("CREDITCARDS",True)
c.add_spotter(s1)
c.add_spotter(s2)
c.clean(df_with_pii)
first_name last_name email_address visa_cc master_cc balance active_member age
0 Jacob King the following is my email address d3ebf4160b78... this is my credit card: 49bd7b28d230d310f17685... c077ea8331f1357b655546c0c6dd030c 100 1 24
1 Chloe Lavoie the following is my email address ccc33e19aed8... this is my credit card: c44d39f19fdd5f1b07c2e4... 1beee7a164fd2c76e8b4e588131d64c9 200 0 36
2 Myles Clark 8a7ab41909ffabb0e5bd3e759e926550 this is my credit card: b31abe94871fbe42c4805a... ff05ef02f05938f288220125c64d72e3 300 1 23
3 Daniel Murray e786525bbe5cdbd0e1b0723580c0be35 this is my credit card: 1d5378ff80e8dbdf12c3d1... 8c33e7d6d9f5b8b3e9f68b11416e9fe3 400 0 28
4 ​Lucy Landry ​66aa0f296907a96a0a15a64c54a0271c this is my credit card: f2d2c9b2a8fdfd376aa806... 2fe45c134eebaf4476d23fea7642824a 500 1 37
5 Austin Cote 1aac2a4671c819e7ba448156eb2a945d this is my credit card: e0690ff65144d6ff24fc2c... 9e3172495cc711edec297ef8a5f095a5 600 1 31
6 Leo Leblanc b575f97818252f3ae9de320f38e6a26b this is my credit card: bee6a0d7b4c91337eef146... this is my master card number: dfcc41738316c7d... 700 0 41
7 Luke Cote 726ff593c99ae2734b61f7ba08cc0339 c68e0b958a085134e22585ac92c08e3c 6bb66b6d89abbbf0718f66720547ce57 800 1 43
8 Chloe Martin 40b148d282dfa8d609ab18207188463a 22dabc63d739da192ef30a2bbcb06e61 a8f9b7c5b5e4c4b3f8a4d37d51078cac 900 0 58
9 Sophia Taylor a6c9c2015492ed280d2da2b94f3f37a4 444225bea558baa2a4c006f688853d1f 6ad1e9a84f5dbd02512391a07c001ed5 1000 1 67
10 Sebastian Li 026dec7828d784304966748f8aac14e4 10fc94534572f34b028c61523f4605fe d45be3c831abd474467bdd0b5d9e4dd6 1100 0 25
11 Theodore Walker 0c89577b11f335687d51f6769baf809b b28d6de54411d3884d8b24d55e285004 52ccea324a0c1a50a266a33283645042 1200 1 29
12 Grayson Moore 4e7ca558fbc639c4cb24f8588c118b3d ba76eb7fc1be6293e5feba89f5c7639f 39c34c75de06effa59da842781653c4a 1300 0 38
13 Madelyn Ross 94cdc647d86ef5a8b54d5ff54b4c35b4 211acd46154d7a437dcc03b3ce46e5ce bdefeaadfd160b9ff82eb33bd2726b1d 1400 1 64
14 Charlie Johnson 3c235fdb2aa343a247c0f51ceda5eabe ba97f2ad35d4d9f6fe00ed50e2762327 9fe8835d0d95deccf7dfd99c02c9c294 1500 0 66
15 Isaac Davis ffd1f67fff8581d26db24b85ce1d479a 53309f3853ff954ef7ed621b38501e28 e2737b551ebe7a0f4842f3f11bc2aa87 1600 1 55
16 Grace Thomas 8d54befa39a3f13bea178f38a8fc67de 99a0625ae373ff242d7ed9c76930b836 8aba9728ea64663867b50a17c10bf729 1700 0 43
17 Kayden Thomas e71770b14ccf5aa8587750c5c5318f4a 779c725caf15e67c16b59536eaa5b862 92610a6913a995c2d9f5e08bfcd6c105 1800 1 48
18 Peyton Bergeron b7299528a41c8f5baf74ecc541b7aa4e 060783327b0e977a61614fa2129a7328 68321411ad37a3dccabe7902620ef7d0 1900 0 58
19 Evelyn Johnston 95473fc56071e41d16b3b769a07d17ad a22af5a670e749c4e8529a840088c372 83d71d2e14d5de862d8bcd28c23c5417 2000 0 29