Tools to make EDA easier!

About

This package is aimed at making the EDA process more effective. Basically, we found there were tons of repetitive work when getting a glimpse of the data set. To stop wasting time in repeating procedures, our team came up with the idea to develop a toolkit that includes the following functions:

  1. Clean the data and replace missing values by using the method preferred.
  2. Provide the description of the data such as the distribution of each column of the data.
  3. Give the correlation plot between different numeric columns automatically.
  4. Combine the plots and make them suitable for the report.

Contributors

  • Rowan Sivanandam
  • Steven Leung
  • Vera Cui
  • Jennifer Hoang

Feature specifications

  1. preprocess(path, method=NULL, fill_value=NULL, read_func=readr::read_csv, ...) :
    The function is to preprocess data in txt or csv by dealing with missing values. There are 5 imputation methods provided (NULL, ‘most_frequent’, ‘mean’, ‘median’, ‘constant’). Finally, it will return the processed data as a tibble.
  2. column_stats(data, columns) :
    The function is to obtain summary statistics of column(s) including count, mean, median, mode, Q1, Q3, variance, standard deviation, correlation, and covariance in table format. Finally, it will return a tibble.
  3. numeric_plots(df) :
    The function is to generate scattered plot matrix of numeric features for EDA. Finally, it will return a GGally plot object with the scattered plot matrix of numeric features.
  4. plot_histogram(data, columns = "all", num_bins = 30) :
    The function is to create histograms for numerical features within a dataframe using ggplot2. Finally, it will return a ggplot object.

Surely, EDA is not a new topic to data scientists. There are quite a few packages doing similar work on CRAN. However, most of them only include limited functions like just providing descriptive statistics. Our proposal is more of a one-in-all toolkit for EDA. Below is a list of sister-projects.

  • brinton
    A Graphical EDA Tool
  • correlationfunnel
    Speed Up Exploratory Data Analysis (EDA) with the Correlation Funnel
  • ezEDA
    Task Oriented Interface for Exploratory Data Analysis

Installation

You can install the released version of EDAhelperR from this repo at the R console:

devtools::install_github('UBC-MDS/EDAhelperR')

Usage

Example usage:

library(EDAhelperR)

preprocess(readr::readr_example("mtcars.csv"))

column_stats(iris, c('Sepal.Length', 'Sepal.Width', 'Petal.Length'))

numeric_plots(df)

plot_histogram(mtcars)

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

EDAhelperR was created by Rowan Sivanandam, Steven Leung, Vera Cui, Jennifer Hoang. It is licensed under the terms of the MIT license.

Credits

EDAhelperR was created with usethis.