Preprocess data in txt, csv, Excel, etc. by dealing with missing values in numeric columns.

preprocess(
  path,
  method = NULL,
  fill_value = NULL,
  read_func = readr::read_csv,
  ...
)

Arguments

path

string, path of data file.

method

c(NULL, 'most_frequent', 'mean', 'median', 'constant'), default=NULL

The imputation method.

If NULL, then missing values are treated as NaN.

If 'mean', then replace missing values using the mean along each column.

If 'median', then replace missing values using the median along each column.

If 'most_frequent', then replace missing using the most frequent value along each column. If there is more than one such value, only the smallest is returned.

If 'constant', then replace missing values with fill_value.

fill_value

NULL or numeric_value, default=NULL

When method='constant', fill_value is used to replace all occurrences of missing values. If left to the default, fill_value will be 0 when imputing numerical data.

read_func

readr::read_* function name, default=readr::read_csv

...

Any keyword arguments are defined in read_func

Value

a tibble

Examples

preprocess(readr::readr_example("mtcars.csv"))
#> Rows: 32 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows