Lecture 02: Clustering class demo#

To run the code below, you need to install pytorch and torchvision in the course conda environment.

conda install pytorch torchvision -c pytorch

import numpy as np
import pandas as pd
import sys
import os
import torch, torchvision
from torchvision import datasets, models, transforms, utils
from PIL import Image
from torchvision import transforms
from torchvision.models import vgg16
import matplotlib.pyplot as plt
sys.path.append(os.path.join(os.path.abspath(".."), "code"))
from plotting_functions import *
../../_images/4463ee454d70daa21d667a2c4ea056bf06efce5e5a26aacbbd0ba65c3808516b.png

Hierarchical clustering on recipes dataset#

Let’s create a dendrogram on a more realistic dataset.

We’ll use Kaggle’s Food.com recipes corpus. This corpus contains 180K+ recipes and 700K+ recipe reviews. In this lab, we’ll only focus on recipes and not on reviews. The recipes are present in RAW_recipes.csv. Our goal is to find categories or groupings of recipes from this corpus based on their names.

orig_recipes_df = pd.read_csv("../data/RAW_recipes.csv")
orig_recipes_df.shape
(231637, 12)
def get_recipes_sample(orig_recipes_df, n_tags=300, min_len=5):
    orig_recipes_df = orig_recipes_df.dropna()  # Remove rows with NaNs.    
    orig_recipes_df = orig_recipes_df.drop_duplicates(
        "name"
    )  # Remove rows with duplicate names.
    orig_recipes_df = orig_recipes_df[orig_recipes_df["name"].apply(len) >= min_len] # Remove rows where recipe names are too short (< 5 characters).
    first_n = orig_recipes_df["tags"].value_counts()[0:n_tags].index.tolist() # Only consider the rows where tags are one of the most frequent n tags.
    recipes_df = orig_recipes_df[orig_recipes_df["tags"].isin(first_n)]
    return recipes_df
recipes_df = get_recipes_sample(orig_recipes_df)
recipes_df.shape
(9100, 12)
recipes_df["name"]
42            i yam what i yam  muffins
101             to your health  muffins
129       250 00 chocolate chip cookies
138                       lplermagronen
163             california roll   salad
                      ...              
231430      zucchini wheat germ cookies
231514         zucchini blueberry bread
231547           zucchini salsa burgers
231596                    zuppa toscana
231629                     zydeco salad
Name: name, Length: 9100, dtype: object
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2')
/Users/kvarada/miniconda3/envs/563/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning:

TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
recipe_names = recipes_df["name"].tolist()
embeddings = embedder.encode(recipe_names)
recipe_names_embeddings = pd.DataFrame(
    embeddings,
    index=recipes_df.index,
)
recipe_names_embeddings
0 1 2 3 4 5 6 7 8 9 ... 374 375 376 377 378 379 380 381 382 383
42 0.019592 -0.088336 0.072677 -0.034575 -0.048741 -0.049801 0.175334 -0.055191 0.020301 0.019828 ... 0.063293 -0.067171 0.087499 -0.061550 0.039297 -0.050147 0.027708 0.056843 0.056151 -0.122506
101 -0.000567 -0.011825 0.073199 0.058175 0.031688 -0.015428 0.168134 0.000466 0.033078 -0.013923 ... -0.012926 -0.015949 0.031315 -0.059074 0.014143 -0.047270 0.007844 0.035501 0.076061 -0.078119
129 -0.022604 0.065034 -0.033065 0.014450 -0.105039 -0.050559 0.100076 0.022929 -0.037398 0.011857 ... 0.007971 -0.019165 0.004935 0.009005 0.000919 -0.040078 0.008650 -0.075781 -0.083477 -0.123240
138 -0.066915 0.025988 -0.087689 -0.006847 -0.012861 0.049035 0.035351 0.124966 -0.011697 -0.050179 ... -0.042345 -0.005794 -0.031800 0.120664 -0.057335 -0.077068 0.001653 -0.048223 0.116455 0.021789
163 -0.007068 -0.007308 -0.026629 -0.004153 -0.052810 0.011126 0.024000 -0.036993 0.023526 -0.046870 ... -0.018432 0.051918 0.036102 -0.035312 0.005817 0.101802 -0.063171 -0.007917 0.089744 0.006997
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
231430 -0.036366 0.087173 -0.039641 0.002705 0.097142 -0.075385 0.068207 0.010435 -0.069214 0.010464 ... 0.050988 -0.064541 0.090829 -0.004570 0.079109 0.019663 -0.058483 -0.048723 0.019152 -0.012266
231514 -0.052718 0.008980 -0.046014 0.030194 0.005201 0.009964 -0.006760 0.030238 -0.031474 0.024632 ... 0.075467 0.000967 0.085033 -0.006520 0.031094 0.072901 -0.094975 -0.052466 -0.003300 -0.006991
231547 -0.080801 0.004295 -0.044325 0.038307 -0.030125 -0.063566 0.004788 0.004822 0.015525 -0.040094 ... 0.066642 0.016605 0.096211 -0.023969 0.045752 0.017091 -0.062939 -0.016950 0.012060 0.039776
231596 -0.060801 0.111672 -0.108755 0.052323 -0.099851 -0.027532 0.084190 -0.004861 0.002891 0.013944 ... 0.038082 -0.014214 0.048392 0.050377 0.015281 0.106766 0.032009 0.020113 0.004977 -0.005828
231629 -0.085059 0.009065 -0.088442 -0.008907 0.002233 -0.055801 0.039431 -0.037784 0.005315 -0.078361 ... 0.063838 0.035654 0.031519 -0.006411 0.007804 0.016261 -0.006218 0.016956 0.019806 -0.042820

9100 rows × 384 columns



def plot_dendrogram(Z, labels, w=15, h=20, p=10):
    fig, ax = plt.subplots(figsize=(w, h))
    dendrogram(
        Z,
        p=p,
        truncate_mode="level",
        orientation="left",
        leaf_font_size=12,
        labels=labels,
        ax=ax,
    )
    ax = plt.gca()
    ax.set_ylabel("Cluster distances", fontsize=w)
from scipy.cluster.hierarchy import dendrogram, linkage

recipe_names = recipes_df["name"].tolist()
Z = linkage(embeddings, method="complete", metric="cosine")
plot_dendrogram(Z, labels=recipe_names, w=15, h=230, p=10)
../../_images/b07ce9a3d520f3cca89ae49128ed9508b6dccffb83854b416f9cdee5b5b0967c.png





Let’s cluster images!!#

For this demo, I’m going to use the following image dataset:

  1. A tiny subset of Food-101 from last lecture (available here).

  2. A small subset of Human Faces dataset (available here).

Let’s start with small subset of birds dataset. You can experiment with a bigger dataset if you like.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
import random
def set_seed(seed=42):
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
set_seed(seed=42)
import glob
IMAGE_SIZE = 224
def read_img_dataset(data_dir, BATCH_SIZE):     
    data_transforms = transforms.Compose(
        [
            transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),     
            transforms.ToTensor(),
            transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),            
        ])
               
    image_dataset = datasets.ImageFolder(root=data_dir, transform=data_transforms)
    dataloader = torch.utils.data.DataLoader(
         image_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0
    )
    dataset_size = len(image_dataset)
    class_names = image_dataset.classes
    inputs, classes = next(iter(dataloader))
    return inputs, classes
def plot_sample_imgs(inputs):
    plt.figure(figsize=(10, 70)); plt.axis("off"); plt.title("Sample Training Images")
    plt.imshow(np.transpose(utils.make_grid(inputs, padding=1, normalize=True),(1, 2, 0)));
def get_features(model, inputs):
    """Extract output of densenet model"""
    model.eval()
    with torch.no_grad():  # turn off computational graph stuff        
        Z = model(inputs).detach().numpy()         
    return Z
densenet = models.densenet121(weights="DenseNet121_Weights.IMAGENET1K_V1")
densenet.classifier = torch.nn.Identity()  # remove that last "classification" layer
data_dir = "../data/food"
file_names = [image_file for image_file in glob.glob(data_dir + "/*/*.jpg")]
n_images = len(file_names)
BATCH_SIZE = n_images  # because our dataset is quite small
food_inputs, food_classes = read_img_dataset(data_dir, BATCH_SIZE)
n_images
350
X_food = food_inputs.numpy()
plot_sample_imgs(food_inputs[0:24,:,:,:])
../../_images/5511affb9de037c92afdacdaca394c554ff5a19219d6e5e350678a831d89145e.png
Z_food = get_features(
    densenet, food_inputs, 
)
Z_food.shape
(350, 1024)
from sklearn.cluster import KMeans

k = 7
km = KMeans(n_clusters=k, n_init='auto', random_state=123)
km.fit(Z_food)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
KMeans(n_clusters=7, random_state=123)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
for cluster in range(k):
    get_cluster_images(km, Z_food, X_food, cluster, n_img=6)
82
Image indices:  [ 82 334 265  35 114 343]
../../_images/3b6468b58efead7dfc39084f45acb125054f16180b9af3e4c62cf89f1518bdaf.png
276
Image indices:  [276 191 186  29 265 343]
../../_images/b8ff7ae5338f63c6f62de97e8cc0d8c12f15a7081066acf88731c13c6336ab81.png
161
Image indices:  [161 102 190 241  56  76]
../../_images/7965f1d302efcbe6266467f008e591464135e45241e27186ab5c1c21de5c7869.png
76
Image indices:  [ 76  39 259 295 313 138]
../../_images/e5c66b1566b2bb6237aa9be9794db4d99ccca9dc9411d11d09a7e5884126613b.png
124
Image indices:  [124 253  25  27 223 280]
../../_images/fbe90faf438efd8543150c5480910b582a851c28f8b9f443f92b4e57594e719c.png
214
Image indices:  [214  20 141 249 203 274]
../../_images/1df1284d505a979c85302a327c685685cab66b82c7cbdf33fac932414441474d.png
48
Image indices:  [ 48  76  61   5  89 138]
../../_images/c3969e46389e49a3369ca71f5e0bfebca4c435955ff41af62bcdb6d95a17ae18.png

Let’s try DBSCAN.

dbscan = DBSCAN()

labels = dbscan.fit_predict(Z_food)
print("Unique labels: {}".format(np.unique(labels)))
Unique labels: [-1]

It identified all points as noise points. Let’s explore the distances between points.

from sklearn.metrics.pairwise import euclidean_distances

dists = euclidean_distances(Z_food)
np.fill_diagonal(dists, np.inf)
dists_df = pd.DataFrame(dists)
dists_df
0 1 2 3 4 5 6 7 8 9 ... 340 341 342 343 344 345 346 347 348 349
0 inf 27.170055 23.031958 29.344772 27.464478 25.905247 28.601202 27.226376 27.482912 27.221527 ... 27.895399 28.627733 28.157883 24.296227 27.864172 26.458879 28.636673 23.513233 26.565342 27.202257
1 27.170055 inf 22.623018 24.378826 25.491858 21.179714 25.291124 22.400305 17.118834 23.523634 ... 22.605633 19.422020 22.297943 20.873520 24.695148 23.107439 21.258476 19.471344 20.257626 22.101439
2 23.031958 22.623018 inf 28.764128 25.544815 23.159748 26.158796 23.423162 23.352556 25.552362 ... 26.763517 25.627247 27.411243 22.771503 24.695318 25.684536 26.029621 20.417007 21.786676 22.903088
3 29.344772 24.378826 28.764128 inf 28.520514 23.345848 28.131292 27.448324 24.425114 19.679522 ... 22.209078 26.940432 26.980904 25.166636 27.609043 21.547798 26.480221 24.750486 25.170488 26.520864
4 27.464478 25.491858 25.544815 28.520514 inf 21.172094 26.477962 23.458080 24.881802 26.123205 ... 26.482220 24.862627 26.194000 22.634821 26.464811 25.304203 25.343525 23.945744 23.584085 25.460310
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
345 26.458879 23.107439 25.684536 21.547798 25.304203 22.533878 26.831320 24.927713 22.792896 21.153379 ... 20.873121 24.183710 25.783279 23.444519 25.757414 inf 24.095846 22.627157 24.188881 24.653481
346 28.636673 21.258476 26.029621 26.480221 25.343525 21.611740 26.205790 25.979347 20.662193 23.817635 ... 22.648520 21.091978 21.743126 23.498369 27.345230 24.095846 inf 22.860321 24.492981 25.513124
347 23.513233 19.471344 20.417007 24.750486 23.945744 19.392580 23.914526 22.881351 20.868404 23.173368 ... 23.511854 23.249800 23.443996 17.126703 22.055307 22.627157 22.860321 inf 18.786024 18.716606
348 26.565342 20.257626 21.786676 25.170488 23.584085 19.035461 22.507235 21.890001 18.954796 24.086979 ... 24.703278 22.101397 23.019146 18.728832 22.244553 24.188881 24.492981 18.786024 inf 22.293314
349 27.202257 22.101439 22.903088 26.520864 25.460310 21.709154 24.901806 24.091803 23.649529 25.744678 ... 26.223890 25.669901 25.911032 18.624907 25.121590 24.653481 25.513124 18.716606 22.293314 inf

350 rows × 350 columns

dists.min(), np.nanmax(dists[dists != np.inf]), np.mean(dists[dists != np.inf])
(10.067173, 36.652683, 24.538565)
for eps in range(13, 20):
    print("\neps={}".format(eps))
    dbscan = DBSCAN(eps=eps, min_samples=3)
    labels = dbscan.fit_predict(Z_food)
    print("Number of clusters: {}".format(len(np.unique(labels))))
    print("Cluster sizes: {}".format(np.bincount(labels + 1)))
eps=13
Number of clusters: 2
Cluster sizes: [347   3]

eps=14
Number of clusters: 5
Cluster sizes: [334   3   6   4   3]

eps=15
Number of clusters: 4
Cluster sizes: [299  26   8  17]

eps=16
Number of clusters: 4
Cluster sizes: [248  86   3  13]

eps=17
Number of clusters: 2
Cluster sizes: [205 145]

eps=18
Number of clusters: 2
Cluster sizes: [160 190]

eps=19
Number of clusters: 2
Cluster sizes: [116 234]
dbscan = DBSCAN(eps=14, min_samples=3)
dbscan_labels = dbscan.fit_predict(Z_food)
print("Number of clusters: {}".format(len(np.unique(dbscan_labels))))
print("Cluster sizes: {}".format(np.bincount(dbscan_labels + 1)))
print("Unique labels: {}".format(np.unique(dbscan_labels)))
Number of clusters: 5
Cluster sizes: [334   3   6   4   3]
Unique labels: [-1  0  1  2  3]
print_dbscan_clusters(Z_food, food_inputs, dbscan_labels)
../../_images/6475750fd64d6af127732e25771df63dc4f325aa71e4556765bf11eee0bab970.png ../../_images/6cc85dacd242358ae435048963701a0f5d1a5b399f0cec077e6fd955b58a4b5b.png ../../_images/0fecc4a45b03c5d586749b61eab49aa8b48a8cbf8318a2212a862b2fcbd9168b.png ../../_images/97786135b108eec0b3f9a59531d4c2590d65b397b0620ed496300a5ccaf84ac2.png

Let’s examine noise points identified by DBSCAN.

print_dbscan_noise_images(Z_food, food_inputs, dbscan_labels)
../../_images/afd3dde16ab0dd6f42d348f6aa9441c2df33ea7a8b1930985ce2fc64e311643f.png



Now let’s try another dataset with human faces, a small subset of Human Faces dataset (available here).

data_dir = "../data/people"
file_names = [image_file for image_file in glob.glob(data_dir + "/*/*.jpg")]
n_images = len(file_names)
BATCH_SIZE = n_images  # because our dataset is quite small
faces_inputs, classes = read_img_dataset(data_dir, BATCH_SIZE)
n_images
367
X_faces = faces_inputs.numpy()
X_faces.shape
(367, 3, 224, 224)
plot_sample_imgs(faces_inputs[0:24,:,:,:])
../../_images/840cba4a269f3e246c0d1bb81edd92267eb83f8721e7720e0da904a86e89f3d6.png
Z_faces = get_features(
    densenet, faces_inputs, 
)
Z_faces.shape
(367, 1024)
from sklearn.cluster import KMeans

k = 7
km = KMeans(n_clusters=k, n_init='auto', random_state=123)
km.fit(Z_faces)
KMeans(n_clusters=7, random_state=123)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
km.cluster_centers_.shape
(7, 1024)
for cluster in range(k):
    get_cluster_images(km, Z_faces, X_faces, cluster, n_img=6)
128
Image indices:  [128 146  67  10  18  56]
../../_images/8cac4821d0098abc84b9d10ad140c237c652633c8b9a1648bee998021a337337.png
67
Image indices:  [146  67 210 241 321  24]
../../_images/5b6128a18e1a2613b90cdda147d900e1a2c82fa516cd83e5deac4b90b9a7d850.png
92
Image indices:  [211  92 331 170   0 274]
../../_images/b66517f5f8e4b3e1e8979f3ce862da002bcd7067e16b44edaab4b43cd8cb1482.png
272
Image indices:  [272  28 249  38  87 243]
../../_images/8aaf084e53de88482f094022d6e3058fce3940bce76dc8840f657c63199ffb7c.png
150
Image indices:  [150 193 320 308 302 105]
../../_images/7d2ee37b5f37ecdc23727018997933d61137a43a8b3f9fd188056cfc19bb9ef9.png
318
Image indices:  [318  83 109 366  36 272]
../../_images/d40850acbea75addcb2d0beaf2cb08dc124bbce6d3b3f6c151cb786e9dd572c9.png
362
Image indices:  [362 164 242  42  69 186]
../../_images/f5f39c879a126a72dd39518c642026b1fe90c5de61322cc6f3354f54f0bf04b3.png



Clustering faces with DBSCAN#

dbscan = DBSCAN()
labels = dbscan.fit_predict(Z_faces)
print("Unique labels: {}".format(np.unique(labels)))
Unique labels: [-1]
dists = euclidean_distances(Z_faces)
np.fill_diagonal(dists, np.inf)

dist_df = pd.DataFrame(
    dists
)

dist_df.iloc[10:20, 10:20]
10 11 12 13 14 15 16 17 18 19
10 inf 14.686935 21.547512 13.875407 18.288515 12.808964 20.555487 16.590254 9.884141 14.243562
11 14.686935 inf 20.484268 16.409140 16.682758 15.932351 21.362122 17.187832 14.117254 15.732210
12 21.547512 20.484268 inf 21.281721 21.584352 21.747551 22.890703 15.583789 20.823399 18.850285
13 13.875407 16.409140 21.281721 inf 19.584244 14.986351 24.546141 20.170977 12.813548 16.071207
14 18.288515 16.682758 21.584352 19.584244 inf 16.372135 22.501001 19.135139 18.132500 18.606798
15 12.808964 15.932351 21.747551 14.986351 16.372135 inf 22.083078 17.554756 12.090057 14.520963
16 20.555487 21.362122 22.890703 24.546141 22.501001 22.083078 inf 17.042555 21.822445 20.690748
17 16.590254 17.187832 15.583789 20.170977 19.135139 17.554756 17.042555 inf 17.113522 16.881115
18 9.884141 14.117254 20.823399 12.813548 18.132500 12.090057 21.822445 17.113522 inf 14.368977
19 14.243562 15.732210 18.850285 16.071207 18.606798 14.520963 20.690748 16.881115 14.368977 inf
dists.min(), np.nanmax(dists[dists != np.inf]), np.mean(dists[dists != np.inf])
(0.0, 32.12697, 18.71584)
for eps in [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]:
    print("\neps={}".format(eps))
    dbscan = DBSCAN(eps=eps, min_samples=3)
    labels = dbscan.fit_predict(Z_faces)
    print("Number of clusters: {}".format(len(np.unique(labels))))
    print("Cluster sizes: {}".format(np.bincount(labels + 1)))
eps=8
Number of clusters: 6
Cluster sizes: [352   3   3   3   3   3]

eps=9
Number of clusters: 9
Cluster sizes: [340   3   3   3   3   6   3   3   3]

eps=10
Number of clusters: 10
Cluster sizes: [299  17  15  16   3   4   3   3   3   4]

eps=11
Number of clusters: 6
Cluster sizes: [226 128   4   3   3   3]

eps=12
Number of clusters: 3
Cluster sizes: [164 200   3]

eps=13
Number of clusters: 2
Cluster sizes: [115 252]

eps=14
Number of clusters: 2
Cluster sizes: [ 87 280]

eps=15
Number of clusters: 3
Cluster sizes: [ 62 299   6]

eps=16
Number of clusters: 3
Cluster sizes: [ 37 327   3]

eps=17
Number of clusters: 2
Cluster sizes: [ 25 342]
dbscan = DBSCAN(eps=10, min_samples=3)
dbscan_labels = dbscan.fit_predict(Z_faces)
print("Number of clusters: {}".format(len(np.unique(dbscan_labels))))
print("Cluster sizes: {}".format(np.bincount(dbscan_labels + 1)))
print("Unique labels: {}".format(np.unique(dbscan_labels)))
Number of clusters: 10
Cluster sizes: [299  17  15  16   3   4   3   3   3   4]
Unique labels: [-1  0  1  2  3  4  5  6  7  8]
print_dbscan_clusters(Z_faces, faces_inputs, dbscan_labels)
../../_images/33a825e142a9399f7ebf483c6d0f62629495239cb5e57cfa3a9cf60dc76ac459.png ../../_images/f14cf2702af9b6715d6166f3bc976329079ce3c31a20d97f8314c76396018959.png ../../_images/d83670aba998a5cab48c193d0d71516d665846c892dbed8baaf36d9305807fda.png ../../_images/978670f2ed768605e3a9035ad93ea763bd9017f1661ad62fadf42704de3b6b2b.png ../../_images/a3f6601e4eea346bcc24c769f0dc4041e3a03615f4f5cc5b9b9e48f2e25c21b6.png ../../_images/d8e705981ab69d95e2c3cf8aace2c67cd20129a5fdbd1eedb7347c46e3e5f2f6.png ../../_images/d42646fd2e44503f3e4ea98ac62a39a381e68027b6a299cbee9a53b94a268285.png ../../_images/c0fea99939e350544519a4afa71773ad25e430cc9cfdbf61dc6334649df3fcba.png ../../_images/36500f6b7bb7484e5d76fa5ef76d472adbd58f1b889b4d3c3c5f30063dedefef.png

Let’s examine noise images identified by DBSCAN.

print_dbscan_noise_images(Z_faces, faces_inputs, dbscan_labels)
../../_images/98dee7c51a2283cbe9759f45aa1973b0ceee46aeb0361a403017ba6570ad5313.png
  • We can guess why these images are noise images. There are odd angles, cropping, sun glasses, hands near faces etc.

Hierarchical clustering#

set_seed(seed=42)
plt.figure(figsize=(20, 15))
Z_hrch = ward(Z_faces)
dendrogram(Z_hrch, p=7, truncate_mode="level", no_labels=True)
plt.xlabel("Sample index")
plt.ylabel("Cluster distance");
../../_images/de95915fd1d8d1b8b2cde044e9a8d770ca0348602a958d4aeb809df53996ae5e.png
cluster_labels = fcluster(Z_hrch, 30, criterion="maxclust")  # let's get flat clusters
plt.close()
hand_picked_clusters = np.arange(2, 30)
#hand_picked_clusters = [2, 3, 5, 6,7, 8, 9, 10, 12, 14,15,16,17,19,20, 21,22, 24, 26, 27, 28]
print_hierarchical_clusters(
    faces_inputs, Z_faces, cluster_labels, hand_picked_clusters
)
/Users/kvarada/MDS/2023-24/563/DSCI_563_unsup-learn_students/lectures/code/plotting_functions.py:1480: RuntimeWarning:

More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
../../_images/50043e1d47bcd958213df14ae17074cf55c5fdbe11858f7272fa8648c2e5c221.png ../../_images/d020101d66708cef580d91e066ccacf32f1b91612891f020c222dd45b7a32051.png ../../_images/7617ca5148983c486b77e6b3426e372bc117c94ca9f37be531d6497b654c8537.png ../../_images/59657566053f2672850deb2a82c5666c42499696def0d5005e64047f157fb4b9.png ../../_images/ffb6ee2b75c219d165bbcdac6a17b2fdec2cccc58830622c74598034ac83b8f1.png ../../_images/feccc7e3b0a56f590c21becf56690b8266b0971963008fbb6c9fbeffbad683db.png ../../_images/184755bac93071f49989201c539a31ebd19f9c6159d568905b9c1eb144eec911.png ../../_images/2deb35c262cc14585125e107f8c1ffd5da587b90c28212c2f5f17848f3f7279f.png ../../_images/a8f2dcf290ec8a7e7e5f2fee011a32f21c124ea9390efff6e4ca2eb857455f0f.png ../../_images/e6d94f6029f38c0c0fdbbeaf165a3de63c5a26f6a4c4c7de0508dab5e2b3fdb1.png ../../_images/f3da15000bf0232e685f996acbd7b5582b7475641810c895bbd0f28236b9bfc2.png ../../_images/c08484f939976830727a91f97b79b64d10d09950517a9fb2a45903c07f003b56.png ../../_images/2c266ff6f794c35f6765d5a105da266fed1bee78fd205b38090cdd3907226095.png ../../_images/962b552b33b506c6ec9c4117a00a6ccafd853107cf41a6b15c4f1f261cb354da.png ../../_images/070335c64fdf667e31f0e2763478cbde4dc2030097bc327c6bc1592cdfa428c3.png ../../_images/f5df0e6d5ad98d5ba9e58f50815a03cb6f979fe240222e95e6dba9bdf19dd3d8.png ../../_images/bbe9052d04fe5f806088830b6d59e4400a38fe131819efd2e1f022a0e692b274.png ../../_images/e7fb42d24eb60eb17f0a3705111c2b54768ccced2e2a48551bd0b0ec5db79753.png ../../_images/02f231955b7e9867bbfdd4fa2c8dcf104b238cf1e81de9abb6d55817909dc575.png ../../_images/722c2c52ba9cf6ac68118e809940ea9f72703a95bda48a49ad60f4240be69bd9.png ../../_images/dbbddb45e07260c1a70c67fddcd7407121143755985c270d3ce5ca1be6a1d47f.png ../../_images/2bb240b77b07d00907c5b2e9687015a6664dc364f4c8f766ad8ac8301c19ee25.png ../../_images/1600f48c3764c3eee4d7cdbd44398f12f4e9a9e3f15d7082a5d3cf5244e52a08.png ../../_images/36adef30eed18db93d5c2d4df943ea5974a26dc3a823214e9331e3392d30be41.png ../../_images/02b93da7bfae8c8194398084474e7fc974cbacceebfe3bcbcb782cb77d8edb21.png ../../_images/f63e876847fb7b4a14bdaefb3bcc43989da233b5c36384a4318fa2313624e52f.png ../../_images/fcbc6a4a0abcc97405abd31b0ffb81fb23112264c5683059f169ce139ab215a8.png ../../_images/be9a39b0ed8a8ff1b4d8c80a4338e0295ac0a0081dd95ea4d504975db8382606.png
  • Some clusters correspond to people with distinct faces, age, facial expressions, hair colour and hair style, lighting and skin tone.