Getting Started with sanityze
If you are developing locally
After checking out the repository, you can run the following commands to install the package and its dependencies:
$ poetry install
Open this notebook in JupyterLab and run the cells in section “Example”.
If you are using the public package
To use sanityze
in a project, You can install it from PyPI:
$ pip install sanityze
Then, you can import it in your code:
from sanityze.cleanser import *
from sanityze.spotters import *
Examples
# setup a dummy test dataframe
import pandas as pd
data = {'product_name': ['laptop', 'printer foo@gaga.com', 'tablet', 'desk 5555 5555 5555 4444', 'chair'],
'price': [1200, 150, 300, 450, 200]
}
df = pd.DataFrame(data)
df.head()
product_name | price | |
---|---|---|
0 | laptop | 1200 |
1 | printer foo@gaga.com | 150 |
2 | tablet | 300 |
3 | desk 5555 5555 5555 4444 | 450 |
4 | chair | 200 |
from sanityze.cleanser import *
c = Cleanser()
c.clean(df, verbose=False)
product_name | price | |
---|---|---|
0 | laptop | 1200 |
1 | printer EMAILADDRS | 150 |
2 | tablet | 300 |
3 | desk 5555 5555 5555 4444 | 450 |
4 | chair | 200 |
Testing with dummy data
df_with_pii = pd.read_csv("../tests/data_with_pii.csv")
df_with_pii.head()
first_name | last_name | email_address | visa_cc | master_cc | balance | active_member | age | |
---|---|---|---|---|---|---|---|---|
0 | Jacob | King | the following is my email address JacobKing100... | this is my credit card: 4658481398602920 | 5339168719695860 | 100 | 1 | 24 |
1 | Chloe | Lavoie | the following is my email address ChloeLavoie2... | this is my credit card: 4532546510575280 | 5284482559079650 | 200 | 0 | 36 |
2 | Myles | Clark | MylesClark300@hotmail.com | this is my credit card: 4539650939655290 | 5338287181016540 | 300 | 1 | 23 |
3 | Daniel | Murray | DanielMurray400@outlook.ca | this is my credit card: 4716505160113470 | 5581255820397210 | 400 | 0 | 28 |
4 | Lucy | Landry | LucyLandry500@ubc.ca | this is my credit card: 4716908400371550 | 5453813871212040 | 500 | 1 | 37 |
c.clean(df_with_pii)
first_name | last_name | email_address | visa_cc | master_cc | balance | active_member | age | |
---|---|---|---|---|---|---|---|---|
0 | Jacob | King | the following is my email address EMAILADDRS | this is my credit card: CREDITCARD | CREDITCARD | 100 | 1 | 24 |
1 | Chloe | Lavoie | the following is my email address EMAILADDRS | this is my credit card: CREDITCARD | CREDITCARD | 200 | 0 | 36 |
2 | Myles | Clark | EMAILADDRS | this is my credit card: CREDITCARD | CREDITCARD | 300 | 1 | 23 |
3 | Daniel | Murray | EMAILADDRS | this is my credit card: CREDITCARD | CREDITCARD | 400 | 0 | 28 |
4 | Lucy | Landry | EMAILADDRS | this is my credit card: CREDITCARD | CREDITCARD | 500 | 1 | 37 |
5 | Austin | Cote | EMAILADDRS | this is my credit card: CREDITCARD | CREDITCARD | 600 | 1 | 31 |
6 | Leo | Leblanc | EMAILADDRS | this is my credit card: CREDITCARD | this is my master card number: CREDITCARD | 700 | 0 | 41 |
7 | Luke | Cote | EMAILADDRS | CREDITCARD | CREDITCARD | 800 | 1 | 43 |
8 | Chloe | Martin | EMAILADDRS | CREDITCARD | CREDITCARD | 900 | 0 | 58 |
9 | Sophia | Taylor | EMAILADDRS | CREDITCARD | CREDITCARD | 1000 | 1 | 67 |
10 | Sebastian | Li | EMAILADDRS | CREDITCARD | CREDITCARD | 1100 | 0 | 25 |
11 | Theodore | Walker | EMAILADDRS | CREDITCARD | CREDITCARD | 1200 | 1 | 29 |
12 | Grayson | Moore | EMAILADDRS | CREDITCARD | CREDITCARD | 1300 | 0 | 38 |
13 | Madelyn | Ross | EMAILADDRS | CREDITCARD | CREDITCARD | 1400 | 1 | 64 |
14 | Charlie | Johnson | EMAILADDRS | CREDITCARD | CREDITCARD | 1500 | 0 | 66 |
15 | Isaac | Davis | EMAILADDRS | CREDITCARD | CREDITCARD | 1600 | 1 | 55 |
16 | Grace | Thomas | EMAILADDRS | CREDITCARD | CREDITCARD | 1700 | 0 | 43 |
17 | Kayden | Thomas | EMAILADDRS | CREDITCARD | CREDITCARD | 1800 | 1 | 48 |
18 | Peyton | Bergeron | EMAILADDRS | CREDITCARD | CREDITCARD | 1900 | 0 | 58 |
19 | Evelyn | Johnston | EMAILADDRS | CREDITCARD | CREDITCARD | 2000 | 0 | 29 |
df_without_pii = pd.read_csv("../tests/data_without_pii.csv")
df_without_pii.head()
first_name | last_name | balance | active_member | age | |
---|---|---|---|---|---|
0 | Jacob | King | 100 | 1 | 24 |
1 | Chloe | Lavoie | 200 | 0 | 36 |
2 | Myles | Clark | 300 | 1 | 23 |
3 | Daniel | Murray | 400 | 0 | 28 |
4 | Lucy | Landry | 500 | 1 | 37 |
c.clean(df_without_pii)
first_name | last_name | balance | active_member | age | |
---|---|---|---|---|---|
0 | Jacob | King | 100 | 1 | 24 |
1 | Chloe | Lavoie | 200 | 0 | 36 |
2 | Myles | Clark | 300 | 1 | 23 |
3 | Daniel | Murray | 400 | 0 | 28 |
4 | Lucy | Landry | 500 | 1 | 37 |
5 | Austin | Cote | 600 | 1 | 31 |
6 | Leo | Leblanc | 700 | 0 | 41 |
7 | Luke | Cote | 800 | 1 | 43 |
8 | Chloe | Martin | 900 | 0 | 58 |
9 | Sophia | Taylor | 1000 | 1 | 67 |
10 | Sebastian | Li | 1100 | 0 | 25 |
11 | Theodore | Walker | 1200 | 1 | 29 |
12 | Grayson | Moore | 1300 | 0 | 38 |
13 | Madelyn | Ross | 1400 | 1 | 64 |
14 | Charlie | Johnson | 1500 | 0 | 66 |
15 | Isaac | Davis | 1600 | 1 | 55 |
16 | Grace | Thomas | 1700 | 0 | 43 |
17 | Kayden | Thomas | 1800 | 1 | 48 |
18 | Peyton | Bergeron | 1900 | 0 | 58 |
19 | Evelyn | Johnston | 2000 | 0 | 29 |
Testing with dummy data (hashing)
# create Cleanser with hash
c = Cleanser(include_default_spotters=False)
s1 = EmailSpotter("EMAILS",True)
s2 = CreditCardSpotter("CREDITCARDS",True)
c.add_spotter(s1)
c.add_spotter(s2)
c.clean(df_with_pii)
first_name | last_name | email_address | visa_cc | master_cc | balance | active_member | age | |
---|---|---|---|---|---|---|---|---|
0 | Jacob | King | the following is my email address d3ebf4160b78... | this is my credit card: 49bd7b28d230d310f17685... | c077ea8331f1357b655546c0c6dd030c | 100 | 1 | 24 |
1 | Chloe | Lavoie | the following is my email address ccc33e19aed8... | this is my credit card: c44d39f19fdd5f1b07c2e4... | 1beee7a164fd2c76e8b4e588131d64c9 | 200 | 0 | 36 |
2 | Myles | Clark | 8a7ab41909ffabb0e5bd3e759e926550 | this is my credit card: b31abe94871fbe42c4805a... | ff05ef02f05938f288220125c64d72e3 | 300 | 1 | 23 |
3 | Daniel | Murray | e786525bbe5cdbd0e1b0723580c0be35 | this is my credit card: 1d5378ff80e8dbdf12c3d1... | 8c33e7d6d9f5b8b3e9f68b11416e9fe3 | 400 | 0 | 28 |
4 | Lucy | Landry | 66aa0f296907a96a0a15a64c54a0271c | this is my credit card: f2d2c9b2a8fdfd376aa806... | 2fe45c134eebaf4476d23fea7642824a | 500 | 1 | 37 |
5 | Austin | Cote | 1aac2a4671c819e7ba448156eb2a945d | this is my credit card: e0690ff65144d6ff24fc2c... | 9e3172495cc711edec297ef8a5f095a5 | 600 | 1 | 31 |
6 | Leo | Leblanc | b575f97818252f3ae9de320f38e6a26b | this is my credit card: bee6a0d7b4c91337eef146... | this is my master card number: dfcc41738316c7d... | 700 | 0 | 41 |
7 | Luke | Cote | 726ff593c99ae2734b61f7ba08cc0339 | c68e0b958a085134e22585ac92c08e3c | 6bb66b6d89abbbf0718f66720547ce57 | 800 | 1 | 43 |
8 | Chloe | Martin | 40b148d282dfa8d609ab18207188463a | 22dabc63d739da192ef30a2bbcb06e61 | a8f9b7c5b5e4c4b3f8a4d37d51078cac | 900 | 0 | 58 |
9 | Sophia | Taylor | a6c9c2015492ed280d2da2b94f3f37a4 | 444225bea558baa2a4c006f688853d1f | 6ad1e9a84f5dbd02512391a07c001ed5 | 1000 | 1 | 67 |
10 | Sebastian | Li | 026dec7828d784304966748f8aac14e4 | 10fc94534572f34b028c61523f4605fe | d45be3c831abd474467bdd0b5d9e4dd6 | 1100 | 0 | 25 |
11 | Theodore | Walker | 0c89577b11f335687d51f6769baf809b | b28d6de54411d3884d8b24d55e285004 | 52ccea324a0c1a50a266a33283645042 | 1200 | 1 | 29 |
12 | Grayson | Moore | 4e7ca558fbc639c4cb24f8588c118b3d | ba76eb7fc1be6293e5feba89f5c7639f | 39c34c75de06effa59da842781653c4a | 1300 | 0 | 38 |
13 | Madelyn | Ross | 94cdc647d86ef5a8b54d5ff54b4c35b4 | 211acd46154d7a437dcc03b3ce46e5ce | bdefeaadfd160b9ff82eb33bd2726b1d | 1400 | 1 | 64 |
14 | Charlie | Johnson | 3c235fdb2aa343a247c0f51ceda5eabe | ba97f2ad35d4d9f6fe00ed50e2762327 | 9fe8835d0d95deccf7dfd99c02c9c294 | 1500 | 0 | 66 |
15 | Isaac | Davis | ffd1f67fff8581d26db24b85ce1d479a | 53309f3853ff954ef7ed621b38501e28 | e2737b551ebe7a0f4842f3f11bc2aa87 | 1600 | 1 | 55 |
16 | Grace | Thomas | 8d54befa39a3f13bea178f38a8fc67de | 99a0625ae373ff242d7ed9c76930b836 | 8aba9728ea64663867b50a17c10bf729 | 1700 | 0 | 43 |
17 | Kayden | Thomas | e71770b14ccf5aa8587750c5c5318f4a | 779c725caf15e67c16b59536eaa5b862 | 92610a6913a995c2d9f5e08bfcd6c105 | 1800 | 1 | 48 |
18 | Peyton | Bergeron | b7299528a41c8f5baf74ecc541b7aa4e | 060783327b0e977a61614fa2129a7328 | 68321411ad37a3dccabe7902620ef7d0 | 1900 | 0 | 58 |
19 | Evelyn | Johnston | 95473fc56071e41d16b3b769a07d17ad | a22af5a670e749c4e8529a840088c372 | 83d71d2e14d5de862d8bcd28c23c5417 | 2000 | 0 | 29 |