For all the values in the column_name of the df, find the one element in the resolved_names with highest similarity score computed with fuzz.WRatio (case sensitive, meaning that “Google” and “google” will not have a score of 100). And compare the similiarty score with the threshold to decide whether to apply the string replacement inplace.
Parameters
Name
Type
Description
Default
df
pandas.DataFrame
The DataFrame of interest.
required
column_name
str
The column to conduct the consolidation on. The column must exist in df and be of type string.
required
resolved_names
list
A list of standard names for transforming the column’s value to.
required
threshold
The minimum similarity score (0 and 100) required to replace a value with a resolved name.
required
Returns
Name
Type
Description
None
Raises
Name
Type
Description
ValueError
If column_name is not in df. If resolved_names is empty. If threshold is below 0 or above 100.
TypeError
If df[column_name] dtype is not string.
Examples
>>>import pandas as pd>>>from csvplus.resolve_string_value import resolve_string_value>>> data = pd.DataFrame({... "company_name": ["Google", "Google Inc.",... "Gogle", "Microsoftt", "Micro-soft"],... "num_searches": [1, 2, 3, 4, 5]... })>>> resolve_string_value(data, "company_name", ["Google", "Microsoft"], 80)>>>print(data) company_name num_searches0 Google 11 Google 22 Google 33 Microsoft 44 Microsoft 5