Takes a dataframe object and returns a nested list object comprising of three lists. The first element of the output list corresponds to the descriptive statistics of numeric variables, the second element displays a list of summary data for the categorical variables and the final element calculates the count and proportion of distinct values in each categorical column. The last object of the output list can be used to determine which categorical variables to drop due to high proportion of unique values based on an input threshold value.

summary_suggestions(df, threshold = 0.8)

Arguments

df

The dataframe on which the function will operate

threshold

A float value that sets the threshold for the proportion of unique values

Value

list

Examples

library(palmerpenguins)
summary_suggestions(penguins)
#> [[1]]
#>              bill_length_mm bill_depth_mm flipper_length_mm  body_mass_g
#> nbr.val        3.420000e+02   342.0000000      3.420000e+02 3.420000e+02
#> nbr.null       0.000000e+00     0.0000000      0.000000e+00 0.000000e+00
#> nbr.na         2.000000e+00     2.0000000      2.000000e+00 2.000000e+00
#> min            3.210000e+01    13.1000000      1.720000e+02 2.700000e+03
#> max            5.960000e+01    21.5000000      2.310000e+02 6.300000e+03
#> range          2.750000e+01     8.4000000      5.900000e+01 3.600000e+03
#> sum            1.502130e+04  5865.7000000      6.871300e+04 1.437000e+06
#> median         4.445000e+01    17.3000000      1.970000e+02 4.050000e+03
#> mean           4.392193e+01    17.1511696      2.009152e+02 4.201754e+03
#> SE.mean        2.952205e-01     0.1067846      7.603704e-01 4.336473e+01
#> CI.mean.0.95   5.806825e-01     0.2100394      1.495607e+00 8.529605e+01
#> var            2.980705e+01     3.8998080      1.977318e+02 6.431311e+05
#> std.dev        5.459584e+00     1.9747932      1.406171e+01 8.019545e+02
#> coef.var       1.243020e-01     0.1151404      6.998830e-02 1.908618e-01
#>                      year
#> nbr.val      3.440000e+02
#> nbr.null     0.000000e+00
#> nbr.na       0.000000e+00
#> min          2.007000e+03
#> max          2.009000e+03
#> range        2.000000e+00
#> sum          6.907620e+05
#> median       2.008000e+03
#> mean         2.008029e+03
#> SE.mean      4.412279e-02
#> CI.mean.0.95 8.678531e-02
#> var          6.697064e-01
#> std.dev      8.183559e-01
#> coef.var     4.075419e-04
#> 
#> [[2]]
#> dplyr::select_if(df, function(col) is.character(col) | is.factor(col)) 
#> 
#>  3  Variables      344  Observations
#> --------------------------------------------------------------------------------
#> species 
#>        n  missing distinct 
#>      344        0        3 
#>                                         
#> Value         Adelie Chinstrap    Gentoo
#> Frequency        152        68       124
#> Proportion     0.442     0.198     0.360
#> --------------------------------------------------------------------------------
#> island 
#>        n  missing distinct 
#>      344        0        3 
#>                                         
#> Value         Biscoe     Dream Torgersen
#> Frequency        168       124        52
#> Proportion     0.488     0.360     0.151
#> --------------------------------------------------------------------------------
#> sex 
#>        n  missing distinct 
#>      333       11        2 
#>                         
#> Value      female   male
#> Frequency     165    168
#> Proportion  0.495  0.505
#> --------------------------------------------------------------------------------
#> 
#> [[3]]
#> # A tibble: 0 × 0
#> 

"summary statistics for numeric variables,
summary statistics for categorical variables,
percentage of unique values for categorical variables,
list of variables with percentage of unique values higher than the threshold"
#> [1] "summary statistics for numeric variables,\nsummary statistics for categorical variables,\npercentage of unique values for categorical variables,\nlist of variables with percentage of unique values higher than the threshold"