This function uses a univariate approach to outlier detection. For each column with outliers (values that are 2 or more standard deviations from the mean), this function will create a reference list of row indices with outliers, and the total number of outliers in that column.
Note: This function works best for small datasets with unimodal variable distributions.
find_bad_apples(df)
df | A dataframe containing numeric data |
---|
A dataframe with columns for 'variable' (dataframe column name), 'total_outliers' (number of outliers in the column), and 'indices' (list of row indices with outliers)
df <- data.frame('A' = c(1, 1, 1, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'B' = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 10)) find_bad_apples(df)#> # A tibble: 2 x 3 #> # Groups: variable, total_outliers [2] #> variable total_outliers indices #> <chr> <dbl> <list> #> 1 A 1 <tibble [1 × 1]> #> 2 B 1 <tibble [1 × 1]>