data_correction

data_correction

A module that replaces data values to the resolved name within a column. Requires pandas and rapidfuzz.

Functions

Name Description
resolve_string_value For all the values in the column_name of the df, find the one element

resolve_string_value

data_correction.resolve_string_value(df, column_name, resolved_names, threshold)

For all the values in the column_name of the df, find the one element in the resolved_names with highest similarity score computed with fuzz.WRatio (case sensitive, meaning that “Google” and “google” will not have a score of 100). And compare the similiarty score with the threshold to decide whether to apply the string replacement inplace.

Parameters

Name Type Description Default
df pandas.DataFrame The DataFrame of interest. required
column_name str The column to conduct the consolidation on. The column must exist in df and be of type string. required
resolved_names list A list of standard names for transforming the column’s value to. required
threshold The minimum similarity score (0 and 100) required to replace a value with a resolved name. required

Returns

Name Type Description
None

Raises

Name Type Description
ValueError If column_name is not in df. If resolved_names is empty. If threshold is below 0 or above 100.
TypeError If df[column_name] dtype is not string.

Examples

>>> import pandas as pd
>>> from csvplus.resolve_string_value import resolve_string_value
>>> data = pd.DataFrame({
...     "company_name": ["Google", "Google Inc.",
...     "Gogle", "Microsoftt", "Micro-soft"],
...     "num_searches": [1, 2, 3, 4, 5]
... })
>>> resolve_string_value(data, "company_name", ["Google", "Microsoft"], 80)
>>> print(data)
   company_name  num_searches
0   Google       1
1   Google       2
2   Google       3
3   Microsoft    4
4   Microsoft    5