summary_suggestions.Rd
Takes a dataframe object and returns a nested list object comprising of three lists. The first element of the output list corresponds to the descriptive statistics of numeric variables, the second element displays a list of summary data for the categorical variables and the final element calculates the count and proportion of distinct values in each categorical column. The last object of the output list can be used to determine which categorical variables to drop due to high proportion of unique values based on an input threshold value.
summary_suggestions(df, threshold = 0.8)
The dataframe on which the function will operate
A float value that sets the threshold for the proportion of unique values
list
library(palmerpenguins)
summary_suggestions(penguins)
#> [[1]]
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> nbr.val 3.420000e+02 342.0000000 3.420000e+02 3.420000e+02
#> nbr.null 0.000000e+00 0.0000000 0.000000e+00 0.000000e+00
#> nbr.na 2.000000e+00 2.0000000 2.000000e+00 2.000000e+00
#> min 3.210000e+01 13.1000000 1.720000e+02 2.700000e+03
#> max 5.960000e+01 21.5000000 2.310000e+02 6.300000e+03
#> range 2.750000e+01 8.4000000 5.900000e+01 3.600000e+03
#> sum 1.502130e+04 5865.7000000 6.871300e+04 1.437000e+06
#> median 4.445000e+01 17.3000000 1.970000e+02 4.050000e+03
#> mean 4.392193e+01 17.1511696 2.009152e+02 4.201754e+03
#> SE.mean 2.952205e-01 0.1067846 7.603704e-01 4.336473e+01
#> CI.mean.0.95 5.806825e-01 0.2100394 1.495607e+00 8.529605e+01
#> var 2.980705e+01 3.8998080 1.977318e+02 6.431311e+05
#> std.dev 5.459584e+00 1.9747932 1.406171e+01 8.019545e+02
#> coef.var 1.243020e-01 0.1151404 6.998830e-02 1.908618e-01
#> year
#> nbr.val 3.440000e+02
#> nbr.null 0.000000e+00
#> nbr.na 0.000000e+00
#> min 2.007000e+03
#> max 2.009000e+03
#> range 2.000000e+00
#> sum 6.907620e+05
#> median 2.008000e+03
#> mean 2.008029e+03
#> SE.mean 4.412279e-02
#> CI.mean.0.95 8.678531e-02
#> var 6.697064e-01
#> std.dev 8.183559e-01
#> coef.var 4.075419e-04
#>
#> [[2]]
#> dplyr::select_if(df, function(col) is.character(col) | is.factor(col))
#>
#> 3 Variables 344 Observations
#> --------------------------------------------------------------------------------
#> species
#> n missing distinct
#> 344 0 3
#>
#> Value Adelie Chinstrap Gentoo
#> Frequency 152 68 124
#> Proportion 0.442 0.198 0.360
#> --------------------------------------------------------------------------------
#> island
#> n missing distinct
#> 344 0 3
#>
#> Value Biscoe Dream Torgersen
#> Frequency 168 124 52
#> Proportion 0.488 0.360 0.151
#> --------------------------------------------------------------------------------
#> sex
#> n missing distinct
#> 333 11 2
#>
#> Value female male
#> Frequency 165 168
#> Proportion 0.495 0.505
#> --------------------------------------------------------------------------------
#>
#> [[3]]
#> # A tibble: 0 × 0
#>
"summary statistics for numeric variables,
summary statistics for categorical variables,
percentage of unique values for categorical variables,
list of variables with percentage of unique values higher than the threshold"
#> [1] "summary statistics for numeric variables,\nsummary statistics for categorical variables,\npercentage of unique values for categorical variables,\nlist of variables with percentage of unique values higher than the threshold"