Recipe Generation using Transformers

Recipe Generation using Transformers#

This notebook demonstrates how to build a Transformer-based model to generate recipe titles. You’ll learn about tokenization, preparing datasets, building and training the model, and generating new text.

Imports#

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import os
import re
import sys
from collections import Counter, defaultdict
from urllib.request import urlopen
import math

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torch
      2 import torch.nn as nn
      3 import numpy as np

File ~/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/__init__.py:367
    365     if USE_GLOBAL_DEPS:
    366         _load_global_deps()
--> 367     from torch._C import *  # noqa: F403
    370 class SymInt:
    371     """
    372     Like an int (including magic methods), but redirects all operations on the
    373     wrapped node. This is used in particular to symbolically record operations
    374     in the symbolic shape workflow.
    375     """

ImportError: dlopen(/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/_C.cpython-312-darwin.so, 0x0002): Library not loaded: @rpath/libgfortran.5.dylib
  Referenced from: <0B9C315B-A1DD-3527-88DB-4B90531D343F> /Users/kvarada/miniforge3/envs/jbook/lib/libopenblas.0.dylib
  Reason: tried: '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/lib/libgfortran.5.dylib' (no such file), '/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/../../../libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/lib/libgfortran.5.dylib' (no such file), '/Users/kvarada/miniforge3/envs/jbook/lib/python3.12/site-packages/torch/../../../libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/bin/../lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/Users/kvarada/miniforge3/envs/jbook/bin/../lib/libgfortran.5.dylib' (duplicate LC_RPATH '@loader_path'), '/usr/local/lib/libgfortran.5.dylib' (no such file), '/usr/lib/libgfortran.5.dylib' (no such file, not in dyld cache)

This is a demo for recipe generation using PyTorch and Transformers. For the purpose of this demo, we’ll sample 10_000 recipe titles from the corpus

Data#

orig_recipes_df = pd.read_csv("../data/RAW_recipes.csv")
orig_recipes_df = orig_recipes_df.dropna()
recipes_df = orig_recipes_df.sample(10_000)

recipes_df

	name	id	minutes	contributor_id	submitted	tags	nutrition	n_steps	steps	description	ingredients	n_ingredients
217580	tuna stuff	18264	20	29110	2002-01-28	['30-minutes-or-less', 'time-to-make', 'course...	[280.2, 14.0, 23.0, 25.0, 29.0, 18.0, 11.0]	7	['boil noodles', 'drain and return to pan', 'a...	when i first saw this i thought it sounded so ...	['macaroni and cheese mix', 'corn', 'tuna', 'b...	6
190826	soetkoekies sweet wine and spice south afric...	309794	30	539686	2008-06-17	['30-minutes-or-less', 'time-to-make', 'course...	[99.0, 4.0, 36.0, 3.0, 3.0, 5.0, 5.0]	16	['preheat the oven to 350 degrees', 'spray two...	i got this from a very old (1970) african cook...	['butter', 'all-purpose flour', 'cooking spray...	14
21969	berry fruit dip	181343	65	341355	2006-08-10	['time-to-make', 'course', 'main-ingredient', ...	[134.0, 0.0, 75.0, 4.0, 15.0, 1.0, 8.0]	3	['combine yogurt , orange rind , orange juice ...	berries, orange, and a touch of almond flavori...	['non-fat strawberry yogurt', 'orange rind', '...	4
189302	slow cooked texas style beef brisket	485907	1515	4439	2012-08-24	['time-to-make', 'course', 'main-ingredient', ...	[316.3, 20.0, 31.0, 17.0, 76.0, 23.0, 2.0]	8	['place the beef brisket in a large slow cooke...	this is a unique method of making lush, succul...	['beef brisket', 'strong black coffee', 'ketch...	8
215708	tortellini salad and basil dressing	93075	160	133174	2004-06-10	['time-to-make', 'course', 'main-ingredient', ...	[246.2, 7.0, 19.0, 10.0, 23.0, 11.0, 13.0]	11	['in a small bowl whisk together basil , pecti...	this salad is so pretty. perfect for a ladies ...	['fresh basil', 'powdered fruit pectin', 'dijo...	15
...	...	...	...	...	...	...	...	...	...	...	...	...
127459	lucky leprechaun smoothie	405618	5	628076	2009-12-29	['15-minutes-or-less', 'time-to-make', 'course...	[155.4, 4.0, 93.0, 5.0, 20.0, 8.0, 7.0]	1	['combine all ingredients in a shaking contain...	this recipe came from studio 5 - who could res...	['low-fat vanilla yogurt', 'low-fat milk', 'in...	4
86899	fresh fruit pudding milk mixer	407115	15	57042	2010-01-05	['weeknight', '15-minutes-or-less', 'time-to-m...	[79.3, 3.0, 34.0, 2.0, 8.0, 7.0, 3.0]	4	['place all ingredients in blender container',...	i found this chemung county dairy princess rec...	['2% low-fat milk', 'vanilla flavor instant pu...	4
111516	inside out pizza dilla margerita	205925	45	37779	2007-01-17	['60-minutes-or-less', 'time-to-make', 'course...	[583.0, 51.0, 20.0, 47.0, 62.0, 82.0, 12.0]	16	['heat a skillet over medium heat', 'add olive...	rachael ray	['extra virgin olive oil', 'garlic cloves', 'r...	10
112585	italian eggplant aubergine crepes	21508	120	15718	2002-03-05	['weeknight', 'time-to-make', 'course', 'main-...	[228.7, 18.0, 27.0, 17.0, 22.0, 23.0, 6.0]	21	['cut eggplant lengthwise into thin slices', '...	delicious italian/ mediterranean-style eggplan...	['eggplant', 'seasoned flour', 'olive oil', 'p...	22
133045	mediterranean herb baked chicken	112720	540	73836	2005-03-05	['time-to-make', 'course', 'main-ingredient', ...	[467.8, 25.0, 12.0, 30.0, 138.0, 19.0, 2.0]	9	['combine the parsley , cilantro , garlic , cu...	the selection of spices used in this dish crea...	['fresh parsley', 'fresh cilantro', 'fresh cil...	16

10000 rows × 12 columns

# Set the appropriate device depending upon your hardware. 

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu') 
print(device)

mps

recipes = recipes_df['name'].tolist()
recipes[:10]

['tuna stuff',
 'soetkoekies   sweet wine and spice south african cookies',
 'berry fruit dip',
 'slow cooked  texas style beef brisket',
 'tortellini salad and basil dressing',
 'brussels sprouts and carrots',
 'camarones en chile salsa   shrimp in chili gravy',
 'sirloin burgers with blue cheese mayo and sherry vidalia onions',
 'turnips and greens',
 'bacon wrapped parmesan breadsticks']

Tokenization#

Let’s start with tokenization.

We create a tokenizer wrapper to convert recipe names into tokens using a pre-trained language model (like BERT) that knows lots of words and subwords. But for our specific dataset (say, a bunch of recipe descriptions), we only need a much smaller dictionary, just the words (tokens) that actually show up in our dataset.

So this code helps us:

Use the tokenizer from a big pre-trained model.
Go through our dataset and extract just the tokens we need.
Build a mini vocabulary just for our data.
Be able to tokenize and decode texts using this mini vocab.

from transformers import AutoTokenizer
from tqdm import trange

class TokenizerWrapper:
    """Wraps AutoTokenizer with a custom vocabulary mapping."""

    def __init__(self, model_name="bert-base-cased"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        # Initialize mappings with special tokens: [PAD] -> 0, [CLS] -> 1, [SEP] -> 2
        self.token_id_to_vocab_id = {0: 0, 101: 1, 102: 2}
        self.vocab_id_to_token_id = {0: 0, 1: 101, 2: 102}
        
        self.vocab_id = 3  # Start after special tokens
        self.padding_len = None

    def build_dictionary(self, recipes: list[str]):
        """Builds vocabulary from a list of recipes and sets padding length."""
        tokenized = self.tokenizer(recipes, padding='longest').input_ids
        self.padding_len = len(tokenized[0])

        for tokens in tokenized:
            for token_id in tokens:
                if token_id not in self.token_id_to_vocab_id:
                    self.token_id_to_vocab_id[token_id] = self.vocab_id
                    self.vocab_id_to_token_id[self.vocab_id] = token_id
                    self.vocab_id += 1

    def get_vocab_size(self) -> int:
        """Returns the size of the custom vocabulary."""
        assert len(self.token_id_to_vocab_id) == len(self.vocab_id_to_token_id)
        return self.vocab_id

    def tokenize(self, text: str) -> list[int]:
        """Tokenizes text using custom vocabulary (requires build_dictionary first)."""
        assert self.padding_len is not None, "Call build_dictionary() before tokenizing."
        token_ids = self.tokenizer(text, padding='max_length', max_length=self.padding_len).input_ids
        return [self.token_id_to_vocab_id[token_id] for token_id in token_ids]

    def decode(self, vocab_ids: list[int]) -> str:
        """Decodes a list of custom vocab IDs into a string."""
        token_ids = [self.vocab_id_to_token_id[vocab_id] for vocab_id in vocab_ids]
        # decoded_string = self.tokenizer.decode(token_ids, skip_special_tokens=True)
        decoded_string = self.tokenizer.decode(token_ids, skip_special_tokens=False)
        return decoded_string

# Build the dictionary for our tokenizer  
from tqdm import tqdm, trange 
tokenizer_wrapper = TokenizerWrapper()
tokenizer_wrapper.build_dictionary(recipes_df["name"].to_list())

recipe_tokens = tokenizer_wrapper.tokenize(recipes_df['name'].iloc[10])
decoeded_recipe = tokenizer_wrapper.decode(recipe_tokens)
print('Recipe:', recipes_df['name'].iloc[10])
print('Tokens:', recipe_tokens)
print('Decoded recipe:', decoeded_recipe)

Recipe: roast teriyaki broccoli
Tokens: [1, 90, 91, 28, 92, 93, 33, 94, 95, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Decoded recipe: [CLS] roast teriyaki broccoli [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

vocab_size = tokenizer_wrapper.get_vocab_size()
vocab_size

❓❓ Questions for you#

Shouldn’t we just have a few meaningful indices above? What’s going on?
Why might we want to build a smaller custom vocabulary from our dataset instead of using the full vocabulary from a large pre-trained model?
What do you think the impact would be on memory usage?

Dataset preparation#

We split the dataset into training and test sets and convert each recipe name into a token sequence.

def build_data(data_df, tokenizer_wrapper):    
    dataset = []
    for row_id in trange(len(data_df)):
        reicpe_tokens = torch.tensor(tokenizer_wrapper.tokenize(data_df['name'].iloc[row_id]))  
        dataset.append({'token': reicpe_tokens})
    return dataset 

Let’s create train and test datasets by calling build_data on train and test splits.

from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(recipes_df, test_size=0.2, random_state=123)
train_data = build_data(train_df, tokenizer_wrapper)
test_data = build_data(test_df, tokenizer_wrapper)

  0%|                                                 | 0/8000 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|███████████████████████████████████| 8000/8000 [00:00<00:00, 19073.31it/s]
100%|███████████████████████████████████| 2000/2000 [00:00<00:00, 23614.18it/s]

train_data[:5]

[{'token': tensor([   1,  304,  110,  342, 1229,    2,    0,    0,    0,    0,    0,    0,
             0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0])},
 {'token': tensor([  1,  54,  61, 161,  48, 251,  69, 443,   2,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0])},
 {'token': tensor([   1,  588,  665,  788, 1095,  831,   40, 1027,  405, 1120,  106,   31,
             2,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0])},
 {'token': tensor([   1,   99,  198,  336,  223, 1316,    2,    0,    0,    0,    0,    0,
             0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0])},
 {'token': tensor([   1, 1273,   59,    2,    0,    0,    0,    0,    0,    0,    0,    0,
             0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0])}]

Custom PyTorch dataset and batching#

We define a PytorchDataset class to provide input-target token sequences for autoregressive training.
We prepare the input and target such that the model predicts the next token given previous ones.

class PytorchDataset():
    def __init__(self, data, pad_vocab_id=0):
        self.data = data
        self.pad_tensor = torch.tensor([pad_vocab_id])

    def __len__(self):
        return len(self.data)

    def __getitem__(self, ind):
        # Retrieve the next sequence of tokens from the current index
        # by excluding the first token of the current sequence and appending a padding token at the end.        
        target_sequence = torch.cat([self.data[ind]['token'][1:], self.pad_tensor])
        return self.data[ind]['token'], target_sequence

train_dataset = PytorchDataset(train_data)
test_dataset = PytorchDataset(test_data)
train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=50, shuffle=False)

Now let’s get a batch of data from DataLoader

train_text, train_target = next(iter(train_dataloader))
train_text = train_text.to(device)
train_text.shape

torch.Size([64, 25])

train_text[0]

tensor([   1,   48,  267,  645,  113,  968, 1491, 1897,    2,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0], device='mps:0')

train_target[0]

tensor([  48,  267,  645,  113,  968, 1491, 1897,    2,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0])

tokenizer_wrapper.decode(train_text[0].tolist())

'[CLS] carrot apple chicken nuggets [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

tokenizer_wrapper.decode(train_target[0].tolist())

'carrot apple chicken nuggets [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

The target is shifted one position to the left for autoregressive training.

Transformer Decoder Model#

We are now ready to define a transformer-based decoder-only model with positional encoding to generate text.

Let’s begin with positional encoding. Transformers don’t have any built-in notion of word order (unlike RNNs), so we need to explicitly tell the model the position of each word in the sequence.

In the interest of time, we won’t dive deep into the math, but we’ll use a standard implementation inspired by the Attention is all you need paper.

The code below adds these position signals to token embeddings so the model can learn not just what the tokens are, but where they appear in the sequence.

# The PositionalEncoding model is already defined for you. Do not change this class.
# We'll use this class in this exercise as well as the next exercise. 

class PositionalEncoding(nn.Module):
    """
    Implements sinusoidal positional encoding as described in "Attention is All You Need".

    Args:
        d_model (int): Dimension of the embedding space.
        dropout (float): Dropout rate after adding positional encodings.
        max_len (int): Maximum length of supported input sequences.
    """
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Create a (max_len, 1) position tensor: [[0], [1], ..., [max_len-1]]
        positions = torch.arange(max_len).unsqueeze(1)

        # Compute the scaling terms for each dimension (even indices only)
        scale_factors = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))

        # Initialize the positional encoding matrix with shape (max_len, 1, d_model)
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(positions * scale_factors)  # Apply sine to even indices
        pe[:, 0, 1::2] = torch.cos(positions * scale_factors)  # Apply cosine to odd indices

        # Register as buffer (not a trainable parameter)
        self.register_buffer("pe", pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Adds positional encoding to the input tensor.

        Args:
            x (torch.Tensor): Input tensor of shape (seq_len, batch_size, d_model)

        Returns:
            torch.Tensor: Tensor with positional encoding added.
        """
        seq_len = x.size(0)
        x = x + self.pe[:seq_len]
        return self.dropout(x)

Model architecture#

Now we’re ready to define our model architecture! It’s going to include several key components that work together to generate text one token at a time:

nn.Embedding layer: turns token IDs into dense vector representations.
PositionalEncoding: adds information about the position of each token in the sequence.
TransformerDecoder: the core of the model that processes the input using attention mechanisms.
Causal mask: ensures the model only attends to earlier positions when generating text, so it doesn’t “peek ahead”.
Output layer (nn.Linear): maps decoder outputs to vocab logits so we can predict the next token.
Weight initialization: helps the model start training with reasonable values instead of random chaos.

We’ll walk through each part step by step in the code below.

class RecipeGenerator(nn.Module):
    def __init__(self, d_model, n_heads, num_layers, vocab_size, device, dropout=0.1):
        """
        Initialize the RecipeGenerator which uses a transformer decoder architecture
        for generating recipes.

        Parameters:
            d_model (int): The number of expected features in the encoder/decoder inputs.
            n_heads (int): The number of heads in the multiheadattention models.
            num_layers (int): The number of sub-decoder-layers in the transformer.
            vocab_size (int): The size of the vocabulary.
            device (torch.device): The device on which the model will be trained.
            dropout (float): The dropout value used in PositionalEncoding and TransformerDecoderLayer.
        """        
        super(RecipeGenerator, self).__init__()
        self.d_model = d_model
        self.device = device
        # Embedding layer for converting input text tokens into vectors
        self.text_embedding = nn.Embedding(vocab_size , d_model)
        
        # Positional Encoding to add position information to input embeddings
        self.pos_encoding = PositionalEncoding(d_model=d_model, dropout=dropout)

        # Define the Transformer decoder
        decoder_layer=nn.TransformerDecoderLayer(d_model=d_model, nhead=n_heads, dropout=dropout)
        self.TransformerDecoder = nn.TransformerDecoder(
            decoder_layer, 
            num_layers=num_layers
        )

        # Final linear layer to map the output of the transformer decoder to vocabulary size        
        self.linear_layer = nn.Linear(d_model, vocab_size)
    

        # Initialize the weights of the model
        self.init_weights()
        
    def init_weights(self):
        """
        Initialize weights of the model to small random values.
        """
        initrange = 0.1
        self.text_embedding.weight.data.uniform_(-initrange, initrange)
        self.linear_layer.bias.data.zero_()
        self.linear_layer.weight.data.uniform_(-initrange, initrange)

    def forward(self, text):
        # Get the embeded input
        encoded_text = self.embed_text(text)        

        # Get transformer output
        transformer_output = self.decode(encoded_text)

        # Final linear layer (unembedding layer)
        return self.linear_layer(transformer_output)
    
    def embed_text(self, text):
        embedding = self.text_embedding(text) * math.sqrt(self.d_model)
        return self.pos_encoding(embedding.permute(1, 0, 2))
    
    def decode(self, encoded_text):
        # Get the length of the sequences to be decoeded. This is needed to generate the causal masks
        seq_len = encoded_text.size(0)
        causal_mask = self.generate_mask(seq_len)
        dummy_memory = torch.zeros_like(encoded_text)
        return self.TransformerDecoder(tgt=encoded_text, memory=dummy_memory, tgt_mask=causal_mask)
    
    def generate_mask(self, size):
        mask = torch.triu(torch.ones(size, size, device=self.device), 1)
        return mask.float().masked_fill(mask == 1, float('-inf'))

import torch 
size = 10
mask = torch.triu(torch.ones(size, size), 1)
mask.float().masked_fill(mask == 1, float('-inf'))

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Let’s instantiate our model.

Let’s instantiate the model

# Define the hyperparameters and initalize the model. Feel free to change these hyperparameters. 
d_model = 256 
n_heads = 4
num_layers = 8
model = RecipeGenerator(d_model=d_model, n_heads=n_heads, num_layers=num_layers, vocab_size=vocab_size, device=device).to(device)

Model Training#

We define the loss function and optimizer and train the model using cross-entropy loss while applying gradient clipping.

train_text

tensor([[   1,   48,  267,  ...,    0,    0,    0],
        [   1,   56, 1135,  ...,    0,    0,    0],
        [   1,  142,  488,  ...,    0,    0,    0],
        ...,
        [   1,  693,  970,  ...,    0,    0,    0],
        [   1,  684,  685,  ...,    0,    0,    0],
        [   1,   14,  427,  ...,    0,    0,    0]], device='mps:0')

train_text.shape

torch.Size([64, 25])

# pass inputs to your model
output = model(train_text)
output.shape

torch.Size([25, 64, 3699])

vocab_size

def trainer(
    model, 
    criterion, 
    optimizer, 
    train_dataloader, 
    test_dataloader, 
    epochs=5, 
    patience=5, 
    clip_norm=1.0
):
    """
    Trains and evaluates the transformer model over multiple epochs using the provided dataloaders.

    Args:
        model: The Transformer model to train.
        criterion: Loss function (e.g., CrossEntropyLoss).
        optimizer: Optimizer (e.g., Adam).
        train_dataloader: DataLoader for training data.
        test_dataloader: DataLoader for validation data.
        epochs: Number of training epochs.
        patience: Early stopping patience – stop if validation loss increases `patience` times in a row.
        clip_norm: Maximum norm for gradient clipping to avoid exploding gradients.

    Returns:
        train_losses: List of average training losses for each epoch.
        test_losses: List of average test losses for each epoch.
    """
    
    train_losses = []
    test_losses = []
    early_stopping_counter = 0

    for epoch in range(epochs):
        # Training phase
        model.train()
        total_train_loss = 0

        for batch_inputs, batch_targets in train_dataloader:
            # Move inputs and targets to the correct device (GPU or CPU)
            batch_inputs, batch_targets = batch_inputs.to(device), batch_targets.to(device)

            optimizer.zero_grad()

            # Forward pass
            predictions = model(batch_inputs)  # shape: (seq_len, batch_size, vocab_size)
            predictions = predictions.permute(1, 2, 0)  # shape: (batch_size, vocab_size, seq_len)

            loss = criterion(predictions, batch_targets)
            loss.backward()

            # Clip gradients to prevent exploding gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip_norm)

            optimizer.step()
            total_train_loss += loss.item()

        avg_train_loss = total_train_loss / len(train_dataloader)
        train_losses.append(avg_train_loss)

        # Evaluation phase
        model.eval()
        total_test_loss = 0

        with torch.no_grad():
            for batch_inputs, batch_targets in test_dataloader:
                batch_inputs, batch_targets = batch_inputs.to(device), batch_targets.to(device)

                predictions = model(batch_inputs).permute(1, 2, 0)
                loss = criterion(predictions, batch_targets)

                total_test_loss += loss.item()

        avg_test_loss = total_test_loss / len(test_dataloader)
        test_losses.append(avg_test_loss)

        print(f"Epoch {epoch+1}: Train Loss = {avg_train_loss:.4f}, Test Loss = {avg_test_loss:.4f}")

        # Early stopping check
        if epoch > 0 and avg_test_loss > test_losses[-2] * (1 + 1e-5):
            early_stopping_counter += 1
        else:
            early_stopping_counter = 0

        if early_stopping_counter >= patience:
            print(f"Early stopping triggered at epoch {epoch+1}")
            break

    return train_losses, test_losses

# Define the optimizer and the loss function. Feel free to change the hyperparameters. 

num_epoch = 20
clip_norm = 1.0
lr = 5e-5

optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = torch.nn.CrossEntropyLoss(ignore_index=0) # Ignore the padding index
train_losses, test_losses = trainer(model, criterion, optimizer,train_dataloader, test_dataloader, epochs= num_epoch)

Epoch 1: Train Loss = 6.9585, Test Loss = 6.3770
Epoch 2: Train Loss = 6.0016, Test Loss = 5.6101
Epoch 3: Train Loss = 5.3564, Test Loss = 5.1428
Epoch 4: Train Loss = 4.9545, Test Loss = 4.8622
Epoch 5: Train Loss = 4.6845, Test Loss = 4.6780
Epoch 6: Train Loss = 4.4856, Test Loss = 4.5558
Epoch 7: Train Loss = 4.3282, Test Loss = 4.4594
Epoch 8: Train Loss = 4.1998, Test Loss = 4.3868
Epoch 9: Train Loss = 4.0888, Test Loss = 4.3136
Epoch 10: Train Loss = 3.9922, Test Loss = 4.2542
Epoch 11: Train Loss = 3.9038, Test Loss = 4.2114
Epoch 12: Train Loss = 3.8230, Test Loss = 4.1818
Epoch 13: Train Loss = 3.7493, Test Loss = 4.1433
Epoch 14: Train Loss = 3.6828, Test Loss = 4.1118
Epoch 15: Train Loss = 3.6258, Test Loss = 4.0892
Epoch 16: Train Loss = 3.5637, Test Loss = 4.0803
Epoch 17: Train Loss = 3.5074, Test Loss = 4.0499
Epoch 18: Train Loss = 3.4555, Test Loss = 4.0367
Epoch 19: Train Loss = 3.4072, Test Loss = 4.0278
Epoch 20: Train Loss = 3.3545, Test Loss = 4.0141