When learning statistical inference, it is important to

  • define important statistical sampling concepts like population distribution, sample distribution, and sampling distribution
  • explain the similarities and differences between the population distribution, sample distribution and sampling distribution.
  • understand how sample sizes may affect the distribution.

The samplingsimulatorr package aims to make these steps easy by taking care of the coding part, and so you can focus more on the learning part. The samplingsimulatorr package provides the following functions that will:

  • generate a virtual population;
  • draw samples of different sizes;
  • plot the histograms of the sample distribution for each sample size;
  • plot the histograms of the sampling distribution of the mean.

This document introduces to you the basic tools of samplingsimulatorr package and shows you how to use those tools.

Use Case Examples

generate_virtual_pop

To start learning sample distribution and sampling distribution, we need first to generate a virtual population. The generate_virtual_pop helps you generate a group of virtual population with the distribution of your choice. You just need to fill the size of the population you want to generate, the variable name of that population, and the distribution the population comes from. The function would then produce a nice tibble of the virtual population you sepcified.

library(samplingsimulatorr)

# generate population
pop <- generate_virtual_pop(1000, "height", rnorm, 0, 1)
head(pop)
#> # A tibble: 6 x 1
#>   height
#>    <dbl>
#> 1  0.306
#> 2  0.813
#> 3 -3.10 
#> 4 -0.630
#> 5  1.24 
#> 6  0.618

draw_samples

After we have the virtual population, the next thing we need to do is to draw samples from that population. draw_samples function helps you draw samples of different sizes from that population. You can also repeatedly draw the samples of the same sizes multiple times to create a sampling distribution.

# the number of replication for each sample size 
reps <- 100
# the sample sizes for each one of the samples
sample_size <- c(10, 50, 100)
# create samples
samples <- draw_samples(pop, reps, sample_size)
head(samples)
#> # A tibble: 6 x 4
#> # Groups:   replicate [1]
#>   replicate  height  size rep_size
#>       <int>   <dbl> <dbl>    <dbl>
#> 1         1 -1.57      10      100
#> 2         1  0.587     10      100
#> 3         1 -0.217     10      100
#> 4         1  0.0926    10      100
#> 5         1  0.156     10      100
#> 6         1  0.515     10      100

plot_sample_hist

After having the samples, we can then plot the sample histograms for different sample sizes using plot_sample_hist function.

# plot sample histogram
plot_sample_hist(pop, samples, height, sample_size)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot_sampling_hist

Since we have drawn the samples of the same size multiple times, we can then plot a nice sampling histogram. The plot_sampling_hist creates a grid of sampling distribution histogram of the mean of different sample sizes.

plot_sampling_hist(samples, height, sample_size)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

stat_summary

Finally, we have both population and samples, the stat_summary creates a summary of the statistical for the arameters of interest.

stat_summary(pop, samples, c('mean', 'sd'))
#>         data         mean        sd
#> 1 population -0.005111638 0.9827272
#> 2    samples -0.021341198 0.9816514