library(Kmeans)
#> 
#> Attaching package: 'Kmeans'
#> The following object is masked from 'package:stats':
#> 
#>     predict
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
#> ✓ tibble  2.1.3     ✓ dplyr   0.8.3
#> ✓ tidyr   1.0.0     ✓ stringr 1.4.0
#> ✓ readr   1.3.1     ✓ forcats 0.4.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

Introduction to the Kmeans Package

This document introduces you to the Kmeans package. This is different from the Kmeans package available in base R. This package is created as a part of a course project to learn the fundamentals of collborative software development. This package implements the K-Means algorith for clustering. This will work on any dataset with valid numerical features, and includes fit, predict, and cluster_summary functions, as well as as elbow and silhouette methods for hyperparameter “k” optimization

Data

To explore the Kmeans package, we will use a randomly generated dataset having 3 clustesr. Let’s explore how to cluster this dataset using the Kmeans package.

x1 <- rnorm(50, 1, 0.2)
x2 <- rnorm(50, 2, 0.2)
x3 <- rnorm(50, 3, 0.2)
x <- c(x1, x2, x3)
y <- c(x2, x1, x1)

X_train <- data.frame(x, y)

Evaluate different number of clusters sing elbow() function

The inertia plot has a sharp bend at k=3 suggesting the optimal number of clusters in our dataset it 3.

Evaluate different number of clusters using silhouette() function

As we can see, the silhouette score is the highest at k = 3, which is the optimal number of clusters in out dataset.