preprocess.Rd
Preprocess data in txt, csv, Excel, etc. by dealing with missing values in numeric columns.
preprocess(
path,
method = NULL,
fill_value = NULL,
read_func = readr::read_csv,
...
)
string, path of data file.
c(NULL, 'most_frequent', 'mean', 'median', 'constant')
, default=NULL
The imputation method.
If NULL
, then missing values are treated as NaN.
If 'mean', then replace missing values using the mean along each column.
If 'median', then replace missing values using the median along each column.
If 'most_frequent', then replace missing using the most frequent value along each column. If there is more than one such value, only the smallest is returned.
If 'constant', then replace missing values with fill_value.
NULL
or numeric_value, default=NULL
When method='constant', fill_value is used to replace all occurrences of missing values. If left to the default, fill_value will be 0 when imputing numerical data.
readr::read_*
function name, default=readr::read_csv
Any keyword arguments are defined in read_func
a tibble
preprocess(readr::readr_example("mtcars.csv"))
#> Rows: 32 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows