Recipe Generation using Transformers#
This notebook demonstrates how to build a Transformer-based model to generate recipe titles. You’ll learn about tokenization, preparing datasets, building and training the model, and generating new text.
Imports#
import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import os
import re
import sys
from collections import Counter, defaultdict
from urllib.request import urlopen
import math
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torch
2 import torch.nn as nn
3 import numpy as np
File ~/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/__init__.py:367
365 if USE_GLOBAL_DEPS:
366 _load_global_deps()
--> 367 from torch._C import * # noqa: F403
370 class SymInt:
371 """
372 Like an int (including magic methods), but redirects all operations on the
373 wrapped node. This is used in particular to symbolically record operations
374 in the symbolic shape workflow.
375 """
ImportError: dlopen(/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/_C.cpython-312-darwin.so, 0x0002): Library not loaded: @rpath/libgfortran.5.dylib
Referenced from: <0B9C315B-A1DD-3527-88DB-4B90531D343F> /Users/kvarada/miniforge3/envs/jbook/lib/libopenblas.0.dylib
Reason: tried: '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/lib/libgfortran.5.dylib' (no such file), '/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/../../../libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/lib/libgfortran.5.dylib' (no such file), '/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/../../../libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/bin/../lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/bin/../lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/usr/local/lib/libgfortran.5.dylib' (no such file), '/usr/lib/libgfortran.5.dylib' (no such file, not in dyld cache)
This is a demo for recipe generation using PyTorch and Transformers. For the purpose of this demo, we’ll sample 10_000 recipe titles from the corpus
Data#
orig_recipes_df = pd.read_csv("../data/RAW_recipes.csv")
orig_recipes_df = orig_recipes_df.dropna()
recipes_df = orig_recipes_df.sample(10_000)
recipes_df
name | id | minutes | contributor_id | submitted | tags | nutrition | n_steps | steps | description | ingredients | n_ingredients | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
217580 | tuna stuff | 18264 | 20 | 29110 | 2002-01-28 | ['30-minutes-or-less', 'time-to-make', 'course... | [280.2, 14.0, 23.0, 25.0, 29.0, 18.0, 11.0] | 7 | ['boil noodles', 'drain and return to pan', 'a... | when i first saw this i thought it sounded so ... | ['macaroni and cheese mix', 'corn', 'tuna', 'b... | 6 |
190826 | soetkoekies sweet wine and spice south afric... | 309794 | 30 | 539686 | 2008-06-17 | ['30-minutes-or-less', 'time-to-make', 'course... | [99.0, 4.0, 36.0, 3.0, 3.0, 5.0, 5.0] | 16 | ['preheat the oven to 350 degrees', 'spray two... | i got this from a very old (1970) african cook... | ['butter', 'all-purpose flour', 'cooking spray... | 14 |
21969 | berry fruit dip | 181343 | 65 | 341355 | 2006-08-10 | ['time-to-make', 'course', 'main-ingredient', ... | [134.0, 0.0, 75.0, 4.0, 15.0, 1.0, 8.0] | 3 | ['combine yogurt , orange rind , orange juice ... | berries, orange, and a touch of almond flavori... | ['non-fat strawberry yogurt', 'orange rind', '... | 4 |
189302 | slow cooked texas style beef brisket | 485907 | 1515 | 4439 | 2012-08-24 | ['time-to-make', 'course', 'main-ingredient', ... | [316.3, 20.0, 31.0, 17.0, 76.0, 23.0, 2.0] | 8 | ['place the beef brisket in a large slow cooke... | this is a unique method of making lush, succul... | ['beef brisket', 'strong black coffee', 'ketch... | 8 |
215708 | tortellini salad and basil dressing | 93075 | 160 | 133174 | 2004-06-10 | ['time-to-make', 'course', 'main-ingredient', ... | [246.2, 7.0, 19.0, 10.0, 23.0, 11.0, 13.0] | 11 | ['in a small bowl whisk together basil , pecti... | this salad is so pretty. perfect for a ladies ... | ['fresh basil', 'powdered fruit pectin', 'dijo... | 15 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
127459 | lucky leprechaun smoothie | 405618 | 5 | 628076 | 2009-12-29 | ['15-minutes-or-less', 'time-to-make', 'course... | [155.4, 4.0, 93.0, 5.0, 20.0, 8.0, 7.0] | 1 | ['combine all ingredients in a shaking contain... | this recipe came from studio 5 - who could res... | ['low-fat vanilla yogurt', 'low-fat milk', 'in... | 4 |
86899 | fresh fruit pudding milk mixer | 407115 | 15 | 57042 | 2010-01-05 | ['weeknight', '15-minutes-or-less', 'time-to-m... | [79.3, 3.0, 34.0, 2.0, 8.0, 7.0, 3.0] | 4 | ['place all ingredients in blender container',... | i found this chemung county dairy princess rec... | ['2% low-fat milk', 'vanilla flavor instant pu... | 4 |
111516 | inside out pizza dilla margerita | 205925 | 45 | 37779 | 2007-01-17 | ['60-minutes-or-less', 'time-to-make', 'course... | [583.0, 51.0, 20.0, 47.0, 62.0, 82.0, 12.0] | 16 | ['heat a skillet over medium heat', 'add olive... | rachael ray | ['extra virgin olive oil', 'garlic cloves', 'r... | 10 |
112585 | italian eggplant aubergine crepes | 21508 | 120 | 15718 | 2002-03-05 | ['weeknight', 'time-to-make', 'course', 'main-... | [228.7, 18.0, 27.0, 17.0, 22.0, 23.0, 6.0] | 21 | ['cut eggplant lengthwise into thin slices', '... | delicious italian/ mediterranean-style eggplan... | ['eggplant', 'seasoned flour', 'olive oil', 'p... | 22 |
133045 | mediterranean herb baked chicken | 112720 | 540 | 73836 | 2005-03-05 | ['time-to-make', 'course', 'main-ingredient', ... | [467.8, 25.0, 12.0, 30.0, 138.0, 19.0, 2.0] | 9 | ['combine the parsley , cilantro , garlic , cu... | the selection of spices used in this dish crea... | ['fresh parsley', 'fresh cilantro', 'fresh cil... | 16 |
10000 rows × 12 columns
# Set the appropriate device depending upon your hardware.
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
print(device)
mps
recipes = recipes_df['name'].tolist()
recipes[:10]
['tuna stuff',
'soetkoekies sweet wine and spice south african cookies',
'berry fruit dip',
'slow cooked texas style beef brisket',
'tortellini salad and basil dressing',
'brussels sprouts and carrots',
'camarones en chile salsa shrimp in chili gravy',
'sirloin burgers with blue cheese mayo and sherry vidalia onions',
'turnips and greens',
'bacon wrapped parmesan breadsticks']
Tokenization#
Let’s start with tokenization.
We create a tokenizer wrapper to convert recipe names into tokens using a pre-trained language model (like BERT) that knows lots of words and subwords. But for our specific dataset (say, a bunch of recipe descriptions), we only need a much smaller dictionary, just the words (tokens) that actually show up in our dataset.
So this code helps us:
Use the tokenizer from a big pre-trained model.
Go through our dataset and extract just the tokens we need.
Build a mini vocabulary just for our data.
Be able to tokenize and decode texts using this mini vocab.
from transformers import AutoTokenizer
from tqdm import trange
class TokenizerWrapper:
"""Wraps AutoTokenizer with a custom vocabulary mapping."""
def __init__(self, model_name="bert-base-cased"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
# Initialize mappings with special tokens: [PAD] -> 0, [CLS] -> 1, [SEP] -> 2
self.token_id_to_vocab_id = {0: 0, 101: 1, 102: 2}
self.vocab_id_to_token_id = {0: 0, 1: 101, 2: 102}
self.vocab_id = 3 # Start after special tokens
self.padding_len = None
def build_dictionary(self, recipes: list[str]):
"""Builds vocabulary from a list of recipes and sets padding length."""
tokenized = self.tokenizer(recipes, padding='longest').input_ids
self.padding_len = len(tokenized[0])
for tokens in tokenized:
for token_id in tokens:
if token_id not in self.token_id_to_vocab_id:
self.token_id_to_vocab_id[token_id] = self.vocab_id
self.vocab_id_to_token_id[self.vocab_id] = token_id
self.vocab_id += 1
def get_vocab_size(self) -> int:
"""Returns the size of the custom vocabulary."""
assert len(self.token_id_to_vocab_id) == len(self.vocab_id_to_token_id)
return self.vocab_id
def tokenize(self, text: str) -> list[int]:
"""Tokenizes text using custom vocabulary (requires build_dictionary first)."""
assert self.padding_len is not None, "Call build_dictionary() before tokenizing."
token_ids = self.tokenizer(text, padding='max_length', max_length=self.padding_len).input_ids
return [self.token_id_to_vocab_id[token_id] for token_id in token_ids]
def decode(self, vocab_ids: list[int]) -> str:
"""Decodes a list of custom vocab IDs into a string."""
token_ids = [self.vocab_id_to_token_id[vocab_id] for vocab_id in vocab_ids]
# decoded_string = self.tokenizer.decode(token_ids, skip_special_tokens=True)
decoded_string = self.tokenizer.decode(token_ids, skip_special_tokens=False)
return decoded_string
# Build the dictionary for our tokenizer
from tqdm import tqdm, trange
tokenizer_wrapper = TokenizerWrapper()
tokenizer_wrapper.build_dictionary(recipes_df["name"].to_list())
recipe_tokens = tokenizer_wrapper.tokenize(recipes_df['name'].iloc[10])
decoeded_recipe = tokenizer_wrapper.decode(recipe_tokens)
print('Recipe:', recipes_df['name'].iloc[10])
print('Tokens:', recipe_tokens)
print('Decoded recipe:', decoeded_recipe)
Recipe: roast teriyaki broccoli
Tokens: [1, 90, 91, 28, 92, 93, 33, 94, 95, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Decoded recipe: [CLS] roast teriyaki broccoli [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
vocab_size = tokenizer_wrapper.get_vocab_size()
vocab_size
3699
❓❓ Questions for you#
Shouldn’t we just have a few meaningful indices above? What’s going on?
Why might we want to build a smaller custom vocabulary from our dataset instead of using the full vocabulary from a large pre-trained model?
What do you think the impact would be on memory usage?
Dataset preparation#
We split the dataset into training and test sets and convert each recipe name into a token sequence.
def build_data(data_df, tokenizer_wrapper):
dataset = []
for row_id in trange(len(data_df)):
reicpe_tokens = torch.tensor(tokenizer_wrapper.tokenize(data_df['name'].iloc[row_id]))
dataset.append({'token': reicpe_tokens})
return dataset
Let’s create train and test datasets by calling build_data
on train and test splits.
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(recipes_df, test_size=0.2, random_state=123)
train_data = build_data(train_df, tokenizer_wrapper)
test_data = build_data(test_df, tokenizer_wrapper)
0%| | 0/8000 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|███████████████████████████████████| 8000/8000 [00:00<00:00, 19073.31it/s]
100%|███████████████████████████████████| 2000/2000 [00:00<00:00, 23614.18it/s]
train_data[:5]
[{'token': tensor([ 1, 304, 110, 342, 1229, 2, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0])},
{'token': tensor([ 1, 54, 61, 161, 48, 251, 69, 443, 2, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])},
{'token': tensor([ 1, 588, 665, 788, 1095, 831, 40, 1027, 405, 1120, 106, 31,
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0])},
{'token': tensor([ 1, 99, 198, 336, 223, 1316, 2, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0])},
{'token': tensor([ 1, 1273, 59, 2, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0])}]
Custom PyTorch dataset and batching#
We define a
PytorchDataset
class to provide input-target token sequences for autoregressive training.We prepare the input and target such that the model predicts the next token given previous ones.
class PytorchDataset():
def __init__(self, data, pad_vocab_id=0):
self.data = data
self.pad_tensor = torch.tensor([pad_vocab_id])
def __len__(self):
return len(self.data)
def __getitem__(self, ind):
# Retrieve the next sequence of tokens from the current index
# by excluding the first token of the current sequence and appending a padding token at the end.
target_sequence = torch.cat([self.data[ind]['token'][1:], self.pad_tensor])
return self.data[ind]['token'], target_sequence
train_dataset = PytorchDataset(train_data)
test_dataset = PytorchDataset(test_data)
train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=50, shuffle=False)
Now let’s get a batch of data from DataLoader
train_text, train_target = next(iter(train_dataloader))
train_text = train_text.to(device)
train_text.shape
torch.Size([64, 25])
train_text[0]
tensor([ 1, 48, 267, 645, 113, 968, 1491, 1897, 2, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0], device='mps:0')
train_target[0]
tensor([ 48, 267, 645, 113, 968, 1491, 1897, 2, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0])
tokenizer_wrapper.decode(train_text[0].tolist())
'[CLS] carrot apple chicken nuggets [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'
tokenizer_wrapper.decode(train_target[0].tolist())
'carrot apple chicken nuggets [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'
The target is shifted one position to the left for autoregressive training.
Transformer Decoder Model#
We are now ready to define a transformer-based decoder-only model with positional encoding to generate text.
Let’s begin with positional encoding. Transformers don’t have any built-in notion of word order (unlike RNNs), so we need to explicitly tell the model the position of each word in the sequence.
In the interest of time, we won’t dive deep into the math, but we’ll use a standard implementation inspired by the Attention is all you need paper.
The code below adds these position signals to token embeddings so the model can learn not just what the tokens are, but where they appear in the sequence.
# The PositionalEncoding model is already defined for you. Do not change this class.
# We'll use this class in this exercise as well as the next exercise.
class PositionalEncoding(nn.Module):
"""
Implements sinusoidal positional encoding as described in "Attention is All You Need".
Args:
d_model (int): Dimension of the embedding space.
dropout (float): Dropout rate after adding positional encodings.
max_len (int): Maximum length of supported input sequences.
"""
def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
# Create a (max_len, 1) position tensor: [[0], [1], ..., [max_len-1]]
positions = torch.arange(max_len).unsqueeze(1)
# Compute the scaling terms for each dimension (even indices only)
scale_factors = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
# Initialize the positional encoding matrix with shape (max_len, 1, d_model)
pe = torch.zeros(max_len, 1, d_model)
pe[:, 0, 0::2] = torch.sin(positions * scale_factors) # Apply sine to even indices
pe[:, 0, 1::2] = torch.cos(positions * scale_factors) # Apply cosine to odd indices
# Register as buffer (not a trainable parameter)
self.register_buffer("pe", pe)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Adds positional encoding to the input tensor.
Args:
x (torch.Tensor): Input tensor of shape (seq_len, batch_size, d_model)
Returns:
torch.Tensor: Tensor with positional encoding added.
"""
seq_len = x.size(0)
x = x + self.pe[:seq_len]
return self.dropout(x)
Model architecture#
Now we’re ready to define our model architecture! It’s going to include several key components that work together to generate text one token at a time:
nn.Embedding layer
: turns token IDs into dense vector representations.PositionalEncoding
: adds information about the position of each token in the sequence.TransformerDecoder
: the core of the model that processes the input using attention mechanisms.Causal mask: ensures the model only attends to earlier positions when generating text, so it doesn’t “peek ahead”.
Output layer (
nn.Linear
): maps decoder outputs to vocab logits so we can predict the next token.Weight initialization: helps the model start training with reasonable values instead of random chaos.
We’ll walk through each part step by step in the code below.
class RecipeGenerator(nn.Module):
def __init__(self, d_model, n_heads, num_layers, vocab_size, device, dropout=0.1):
"""
Initialize the RecipeGenerator which uses a transformer decoder architecture
for generating recipes.
Parameters:
d_model (int): The number of expected features in the encoder/decoder inputs.
n_heads (int): The number of heads in the multiheadattention models.
num_layers (int): The number of sub-decoder-layers in the transformer.
vocab_size (int): The size of the vocabulary.
device (torch.device): The device on which the model will be trained.
dropout (float): The dropout value used in PositionalEncoding and TransformerDecoderLayer.
"""
super(RecipeGenerator, self).__init__()
self.d_model = d_model
self.device = device
# Embedding layer for converting input text tokens into vectors
self.text_embedding = nn.Embedding(vocab_size , d_model)
# Positional Encoding to add position information to input embeddings
self.pos_encoding = PositionalEncoding(d_model=d_model, dropout=dropout)
# Define the Transformer decoder
decoder_layer=nn.TransformerDecoderLayer(d_model=d_model, nhead=n_heads, dropout=dropout)
self.TransformerDecoder = nn.TransformerDecoder(
decoder_layer,
num_layers=num_layers
)
# Final linear layer to map the output of the transformer decoder to vocabulary size
self.linear_layer = nn.Linear(d_model, vocab_size)
# Initialize the weights of the model
self.init_weights()
def init_weights(self):
"""
Initialize weights of the model to small random values.
"""
initrange = 0.1
self.text_embedding.weight.data.uniform_(-initrange, initrange)
self.linear_layer.bias.data.zero_()
self.linear_layer.weight.data.uniform_(-initrange, initrange)
def forward(self, text):
# Get the embeded input
encoded_text = self.embed_text(text)
# Get transformer output
transformer_output = self.decode(encoded_text)
# Final linear layer (unembedding layer)
return self.linear_layer(transformer_output)
def embed_text(self, text):
embedding = self.text_embedding(text) * math.sqrt(self.d_model)
return self.pos_encoding(embedding.permute(1, 0, 2))
def decode(self, encoded_text):
# Get the length of the sequences to be decoeded. This is needed to generate the causal masks
seq_len = encoded_text.size(0)
causal_mask = self.generate_mask(seq_len)
dummy_memory = torch.zeros_like(encoded_text)
return self.TransformerDecoder(tgt=encoded_text, memory=dummy_memory, tgt_mask=causal_mask)
def generate_mask(self, size):
mask = torch.triu(torch.ones(size, size, device=self.device), 1)
return mask.float().masked_fill(mask == 1, float('-inf'))
import torch
size = 10
mask = torch.triu(torch.ones(size, size), 1)
mask.float().masked_fill(mask == 1, float('-inf'))
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf],
[0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],
[0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf],
[0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
Let’s instantiate our model.
Let’s instantiate the model
# Define the hyperparameters and initalize the model. Feel free to change these hyperparameters.
d_model = 256
n_heads = 4
num_layers = 8
model = RecipeGenerator(d_model=d_model, n_heads=n_heads, num_layers=num_layers, vocab_size=vocab_size, device=device).to(device)
Model Training#
We define the loss function and optimizer and train the model using cross-entropy loss while applying gradient clipping.
train_text
tensor([[ 1, 48, 267, ..., 0, 0, 0],
[ 1, 56, 1135, ..., 0, 0, 0],
[ 1, 142, 488, ..., 0, 0, 0],
...,
[ 1, 693, 970, ..., 0, 0, 0],
[ 1, 684, 685, ..., 0, 0, 0],
[ 1, 14, 427, ..., 0, 0, 0]], device='mps:0')
train_text.shape
torch.Size([64, 25])
# pass inputs to your model
output = model(train_text)
output.shape
torch.Size([25, 64, 3699])
vocab_size
3699
def trainer(
model,
criterion,
optimizer,
train_dataloader,
test_dataloader,
epochs=5,
patience=5,
clip_norm=1.0
):
"""
Trains and evaluates the transformer model over multiple epochs using the provided dataloaders.
Args:
model: The Transformer model to train.
criterion: Loss function (e.g., CrossEntropyLoss).
optimizer: Optimizer (e.g., Adam).
train_dataloader: DataLoader for training data.
test_dataloader: DataLoader for validation data.
epochs: Number of training epochs.
patience: Early stopping patience – stop if validation loss increases `patience` times in a row.
clip_norm: Maximum norm for gradient clipping to avoid exploding gradients.
Returns:
train_losses: List of average training losses for each epoch.
test_losses: List of average test losses for each epoch.
"""
train_losses = []
test_losses = []
early_stopping_counter = 0
for epoch in range(epochs):
# Training phase
model.train()
total_train_loss = 0
for batch_inputs, batch_targets in train_dataloader:
# Move inputs and targets to the correct device (GPU or CPU)
batch_inputs, batch_targets = batch_inputs.to(device), batch_targets.to(device)
optimizer.zero_grad()
# Forward pass
predictions = model(batch_inputs) # shape: (seq_len, batch_size, vocab_size)
predictions = predictions.permute(1, 2, 0) # shape: (batch_size, vocab_size, seq_len)
loss = criterion(predictions, batch_targets)
loss.backward()
# Clip gradients to prevent exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_norm)
optimizer.step()
total_train_loss += loss.item()
avg_train_loss = total_train_loss / len(train_dataloader)
train_losses.append(avg_train_loss)
# Evaluation phase
model.eval()
total_test_loss = 0
with torch.no_grad():
for batch_inputs, batch_targets in test_dataloader:
batch_inputs, batch_targets = batch_inputs.to(device), batch_targets.to(device)
predictions = model(batch_inputs).permute(1, 2, 0)
loss = criterion(predictions, batch_targets)
total_test_loss += loss.item()
avg_test_loss = total_test_loss / len(test_dataloader)
test_losses.append(avg_test_loss)
print(f"Epoch {epoch+1}: Train Loss = {avg_train_loss:.4f}, Test Loss = {avg_test_loss:.4f}")
# Early stopping check
if epoch > 0 and avg_test_loss > test_losses[-2] * (1 + 1e-5):
early_stopping_counter += 1
else:
early_stopping_counter = 0
if early_stopping_counter >= patience:
print(f"Early stopping triggered at epoch {epoch+1}")
break
return train_losses, test_losses
# Define the optimizer and the loss function. Feel free to change the hyperparameters.
num_epoch = 20
clip_norm = 1.0
lr = 5e-5
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = torch.nn.CrossEntropyLoss(ignore_index=0) # Ignore the padding index
train_losses, test_losses = trainer(model, criterion, optimizer,train_dataloader, test_dataloader, epochs= num_epoch)
Epoch 1: Train Loss = 6.9585, Test Loss = 6.3770
Epoch 2: Train Loss = 6.0016, Test Loss = 5.6101
Epoch 3: Train Loss = 5.3564, Test Loss = 5.1428
Epoch 4: Train Loss = 4.9545, Test Loss = 4.8622
Epoch 5: Train Loss = 4.6845, Test Loss = 4.6780
Epoch 6: Train Loss = 4.4856, Test Loss = 4.5558
Epoch 7: Train Loss = 4.3282, Test Loss = 4.4594
Epoch 8: Train Loss = 4.1998, Test Loss = 4.3868
Epoch 9: Train Loss = 4.0888, Test Loss = 4.3136
Epoch 10: Train Loss = 3.9922, Test Loss = 4.2542
Epoch 11: Train Loss = 3.9038, Test Loss = 4.2114
Epoch 12: Train Loss = 3.8230, Test Loss = 4.1818
Epoch 13: Train Loss = 3.7493, Test Loss = 4.1433
Epoch 14: Train Loss = 3.6828, Test Loss = 4.1118
Epoch 15: Train Loss = 3.6258, Test Loss = 4.0892
Epoch 16: Train Loss = 3.5637, Test Loss = 4.0803
Epoch 17: Train Loss = 3.5074, Test Loss = 4.0499
Epoch 18: Train Loss = 3.4555, Test Loss = 4.0367
Epoch 19: Train Loss = 3.4072, Test Loss = 4.0278
Epoch 20: Train Loss = 3.3545, Test Loss = 4.0141
Recipe Generation#
We generate a new recipe by sampling tokens one by one from the trained model.
def generate_recipe(model, device, max_recipe_length=39, seed=[206], end_vocab=2):
"""
Generates a recipe for an image using the specified model and device.
Parameters:
model (torch.nn.Module): The trained model used for generating tokens.
device (torch.device): Device to run the model on.
max_recipe_length (int): Maximum number of tokens to generate.
seed (list[int]): A list of one or more token IDs to start generation with.
end_vocab (int): Token ID that indicates the end of the sequence.
Returns:
numpy.ndarray: A 1D array of token IDs representing the generated recipe.
"""
# Ensure seed is a list and convert to tensor of shape [1, len(seed)]
context = torch.tensor([seed], device=device)
# Generate tokens until max length or end token is reached
for _ in range(max_recipe_length - len(seed)): # subtract len(seed) to cap total length
logits = model(context)[-1] # Get logits for the last position
probabilities = torch.softmax(logits, dim=-1).flatten(start_dim=1)
next_vocab = torch.multinomial(probabilities, num_samples=1)
context = torch.cat([context, next_vocab], dim=1)
if next_vocab.item() == end_vocab:
break
return context.cpu().numpy().flatten()
recipe = generate_recipe(model, device, max_recipe_length=20)
generated_recipe = tokenizer_wrapper.decode(recipe)
generated_recipe
'chocolate chip chocolate frosting [SEP]'
The generation quality might not be great but the purpose here is to demonstrate different components involved in text generation using transformers.