Guide to using Simpute
my-vignette.Rmd
What is Simpute?
Simpute was designed to be a simple imputation package. If you have data where some rows are blank or “N/A”, you can use this simple package to fill in those gaps and make your data frame whole or clean. It was made to be easy to understand and directed towards users who are just learning to use R. If that’s you, please read below to learn about how you can use it.
To get started, please load our package using the code below.
library(simpute)
Imputation Functions
We have several different types of imputation functions you can use, depending on the kind of data you would like to impute. These are as follows:
1.Num_imputer
- This function is used for any column
that contains integer values or decimal numbers. It will impute any
empty (or “N/A”) data rows with the average value of that column.
2.Cat_imputer
- This function is used for any columns with
categorical values. It will impute any blank values with the most
frequent category found in that column.
3.Bol_imputer
- This function is for use with a boolean
(“TRUE” or “FALSE”) column in your dataframe. It will impute any blank
values with the most frequent of the TRUE or FALSE values found in the
column. With the method argument = ‘mode’.
4. Date_imputer
- This final function is to be used for
imputation of any blank or empty dates in your data frame.It will
populate blank date cells with the median date. With a the method
argument = ‘median’.
Function Arguments
All our functions use the same arguments in order to maintain
simplicity. You have three components function()
,
data
, "column"
, and method
(found
in the bol_imputer
and date_imputer
functions).
The data entered must be a dataframe or tibble. The column name
supplied should be in a string format and be found within the data
provided, otherwise an error will appear. Lastly, for the
bol_imputer
and date_imputer
functions we also
have a method argument, which is defaulted to ‘mode’ for
bol_imptuer
and ‘median’ for date_imputer
, we
hope to add more flexibility and method functions in the future.
Examples on How to use it
To try out our function we encourage you to follow along with this exercise set up below. First, we will create a test data frame to use.
test_df <- data.frame(
'Origin' = c("Canada", "Japan", "Japan", "Japan", "Germany", NA),
'Speed'= c(NA, 2, 2, 2, 1, 1),
'OnTheMarket' = c(FALSE, FALSE, NA, TRUE, FALSE, NA),
'Date' = lubridate::as_date(c("4/2/2013", "4/2/2014", "01/29/2023", "4/2/2016", NA, "01/29/2023"), format = "%m/%d/%Y")
)
First, we will apply the categorical imputer on the first column.
impu_cat <- cat_imputer(test_df, "Origin")
print(impu_cat)
#> Origin Speed OnTheMarket Date
#> 1 Canada NA FALSE 2013-04-02
#> 2 Japan 2 FALSE 2014-04-02
#> 3 Japan 2 NA 2023-01-29
#> 4 Japan 2 TRUE 2016-04-02
#> 5 Germany 1 FALSE <NA>
#> 6 Japan 1 NA 2023-01-29
Next we will apply the numeric imputer by running the following and printing the output.
impu_num <- num_imputer(test_df, "Speed")
print(impu_num)
#> Origin Speed OnTheMarket Date
#> 1 Canada 2 FALSE 2013-04-02
#> 2 Japan 2 FALSE 2014-04-02
#> 3 Japan 2 NA 2023-01-29
#> 4 Japan 2 TRUE 2016-04-02
#> 5 Germany 1 FALSE <NA>
#> 6 <NA> 1 NA 2023-01-29
Test the boolean imputer by running the following and printing the output.
impu_bol <- bol_imputer(test_df, 'OnTheMarket')
print(impu_bol)
#> Origin Speed OnTheMarket Date
#> 1 Canada NA FALSE 2013-04-02
#> 2 Japan 2 FALSE 2014-04-02
#> 3 Japan 2 TRUE 2023-01-29
#> 4 Japan 2 TRUE 2016-04-02
#> 5 Germany 1 FALSE <NA>
#> 6 <NA> 1 TRUE 2023-01-29
Last, but not least test out the date imputer. Kindly note that for date imputer to work, we must first convert the column to a date format, please ensure it is not a string.
impu_dat <- date_imputer(test_df, "Date")
print(impu_dat)
#> Origin Speed OnTheMarket Date
#> 1 Canada NA FALSE 2013-04-02
#> 2 Japan 2 FALSE 2014-04-02
#> 3 Japan 2 NA 2023-01-29
#> 4 Japan 2 TRUE 2016-04-02
#> 5 Germany 1 FALSE 2023-01-29
#> 6 <NA> 1 NA 2023-01-29