Lecture 5: Topic Modeling

Lecture 5: Topic Modeling#

UBC Master of Data Science program, 2023-24

Instructor: Varada Kolhatkar

Lecture plan, imports, LO#

Imports#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option("display.max_colwidth", 200)

Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.

import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

Learning outcomes#

From this lesson you will be able to

Explain the general idea of topic modeling.
Identify and name the two main approaches for topic modeling.
Describe the key differences between Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).
Outline the process of data generation within an LDA topic model.
Explain the process of updating topic assignments in the LDA model at a high level.
Explain the importance of preprocessing in topic modeling.
Carry out topic modeling with sklearn as well as gensim.
Visualize the topics extracted through topic modeling.
Interpret the output of topic modeling.
Name a few possible ways to evaluate a topic model.
Use coherence scores as a guiding tool to choose the number of topics.
Use Top2Vec package for topic modeling.

1. Why topic modeling?#

1.1 Topic modeling motivation#

Topic modeling introduction activity (~5 mins)

Consider the following documents.

toy_df = pd.read_csv("data/toy_clustering.csv")
toy_df

	text
0	famous fashion model
1	elegant fashion model
2	fashion model at famous probabilistic topic model conference
3	fresh elegant fashion model
4	famous elegant fashion model
5	probabilistic conference
6	creative probabilistic model
7	model diet apple kiwi nutrition
8	probabilistic model
9	kiwi health nutrition
10	fresh apple kiwi health diet
11	health nutrition
12	fresh apple kiwi juice nutrition
13	probabilistic topic model conference
14	probabilistic topic model

Discuss the following questions with your neighbour and write your answers in this Google doc.

Suppose you are asked to identify topics discussed in these documents manually. How many topics would you identify?
What are the prominent words associated with each topic?
Are there documents which are a mixture of multiple topics?

Humans are pretty good at reading and understanding a document and answering questions such as
- What is it about?
- Which documents is it related to?
What if you’re given a large collection of documents on a variety of topics.

Example: A corpus of news articles

Example: A corpus of food magazines

A corpus of scientific articles

(Credit: Dave Blei’s presentation)

It would take years to read all documents and organize and categorize them so that they are easy to search.
You need an automated way
- to get an idea of what’s going on in the data or
- to pull documents related to a certain topic
Topic modeling gives you an ability to summarize the major themes in a large collection of documents (corpus).
- Example: The major themes in a collection of news articles could be
  - politics
  - entertainment
  - sports
  - technology
  - …

Topic modeling is a great EDA tool to get a sense of what’s going on in a large corpus.
Some examples
- If you want to pull documents related to a particular lawsuit.
- You want to examine people’s sentiment towards a particular candidate and/or political party and so you want to pull tweets or Facebook posts related to election.

1.2 How do you do topic modeling?#

A common tool to solve such problems is unsupervised ML methods.
Given the hyperparameter $K$, the goal of topic modeling is to describe a set of documents using $K$ “topics”.
In unsupervised setting, the input of topic modeling is
- A large collection of documents
- A value for the hyperparameter $K$ (e.g., $K = 3$)
and the output is
1. Topic-words association
  - For each topic, what words describe that topic?
2. Document-topics association
  - For each document, what topics are expressed by the document?

Topic modeling: example 1

Topic-words association
- For each topic, what words describe that topic?
- A topic is a mixture of words.

Document-topics association
- For each document, what topics are expressed by the document?
- A document is a mixture of topics.

Topic modeling example 2: Scientific papers

Topic modeling: Input

Credit: David Blei’s presentation

Topic modeling: output

(Credit: David Blei’s presentation)

Assigning labels is a human thing.

(Credit: David Blei’s presentation)

Topic modeling example 3: LDA topics in Yale Law Journal

(Credit: David Blei’s paper)

Topic modeling example 3: LDA topics in social media

(Credit: Health topics in social media)

Based on the tools in your toolbox what would you use for topic modeling?

There are two main approaches for topic modeling

Topic modeling as matrix factorization (LSA)
Latent Dirichlet Allocation (LDA)

Topic modeling as matrix factorization

We have seen this before in DSCI 563!
You can think of topic modeling as a matrix factorization problem.

\[X_{n \times d} \approx Z_{n \times k}W_{k \times d}\]

Where
- $n \rightarrow $ Number of documents
- $k \rightarrow $ Number of topics
- $d \rightarrow $ Number of features (e.g., the size of vocabulary)
$Z$ gives us document-topic assignments and $W$ gives us topic-word assignments.
This is LSA and we used sklearn’s TruncatedSVD for matrix factorization using SVD (without centering the data).
The idea is to reduce the dimensionality of the data to extract some semantically meaningful components.

Let’s try LSA on the toy dataset above.

toy_df

	text
0	famous fashion model
1	elegant fashion model
2	fashion model at famous probabilistic topic model conference
3	fresh elegant fashion model
4	famous elegant fashion model
5	probabilistic conference
6	creative probabilistic model
7	model diet apple kiwi nutrition
8	probabilistic model
9	kiwi health nutrition
10	fresh apple kiwi health diet
11	health nutrition
12	fresh apple kiwi juice nutrition
13	probabilistic topic model conference
14	probabilistic topic model

Let’s get BOW representation of the documents.

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words="english")
bow = cv.fit_transform(toy_df["text"]).toarray()
bow_df = pd.DataFrame(bow, columns=cv.get_feature_names_out(), index=toy_df["text"])
bow_df

	apple	conference	creative	diet	elegant	famous	fashion	fresh	health	juice	kiwi	model	nutrition	probabilistic	topic
text
famous fashion model	0	0	0	0	0	1	1	0	0	0	0	1	0	0	0
elegant fashion model	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0
fashion model at famous probabilistic topic model conference	0	1	0	0	0	1	1	0	0	0	0	2	0	1	1
fresh elegant fashion model	0	0	0	0	1	0	1	1	0	0	0	1	0	0	0
famous elegant fashion model	0	0	0	0	1	1	1	0	0	0	0	1	0	0	0
probabilistic conference	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0
creative probabilistic model	0	0	1	0	0	0	0	0	0	0	0	1	0	1	0
model diet apple kiwi nutrition	1	0	0	1	0	0	0	0	0	0	1	1	1	0	0
probabilistic model	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0
kiwi health nutrition	0	0	0	0	0	0	0	0	1	0	1	0	1	0	0
fresh apple kiwi health diet	1	0	0	1	0	0	0	1	1	0	1	0	0	0	0
health nutrition	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0
fresh apple kiwi juice nutrition	1	0	0	0	0	0	0	1	0	1	1	0	1	0	0
probabilistic topic model conference	0	1	0	0	0	0	0	0	0	0	0	1	0	1	1
probabilistic topic model	0	0	0	0	0	0	0	0	0	0	0	1	0	1	1

Let’s extract latent features from this data.

from sklearn.pipeline import make_pipeline
lsa_pipe = make_pipeline(
    CountVectorizer(stop_words="english"), TruncatedSVD(n_components=3)
)

Z = lsa_pipe.fit_transform(toy_df["text"]);

pd.DataFrame(
    np.round(Z, 4),
    columns=["latent topic 1 ", "latent topic 2", "latent topic 3"],
    index=toy_df["text"],
)

	latent topic 1	latent topic 2	latent topic 3
text
famous fashion model	1.3170	-0.1541	0.7692
elegant fashion model	1.2495	-0.0921	0.9877
fashion model at famous probabilistic topic model conference	2.8596	-0.4921	-0.3107
fresh elegant fashion model	1.3314	0.2235	1.1163
famous elegant fashion model	1.4893	-0.1560	1.2093
probabilistic conference	0.5718	-0.2149	-0.8377
creative probabilistic model	1.1374	-0.1828	-0.5709
model diet apple kiwi nutrition	0.9899	1.6467	-0.1801
probabilistic model	1.0893	-0.1682	-0.4951
kiwi health nutrition	0.1632	1.2857	-0.2039
fresh apple kiwi health diet	0.3019	1.8677	-0.0742
health nutrition	0.0888	0.7542	-0.1332
fresh apple kiwi juice nutrition	0.3019	1.8677	-0.0742
probabilistic topic model conference	1.5426	-0.3380	-1.0799
probabilistic topic model	1.3320	-0.2547	-0.7839

Let’s examine the components learned by LSA.

How much variance is covered by the components?

lsa_pipe.named_steps["truncatedsvd"].explained_variance_ratio_.sum()

0.6558779404028763

W = lsa_pipe.named_steps['truncatedsvd'].components_
W

array([[ 0.06747017,  0.21056851,  0.04815134,  0.05468907,  0.17230783,
         0.23986307,  0.34912148,  0.0819255 ,  0.02344878,  0.01278109,
         0.07437837,  0.72804564,  0.06535676,  0.3612223 ,  0.24275193],
       [ 0.42900399, -0.08329132, -0.01457154,  0.28013182, -0.00196643,
        -0.06394139, -0.05347004,  0.31555664,  0.31147456,  0.14887217,
         0.53148783, -0.03665847,  0.44273421, -0.13157525, -0.08646649],
       [-0.04364109, -0.29598848, -0.07583183, -0.03377992,  0.44010496,
         0.22151674,  0.50099603,  0.1285553 , -0.05464361, -0.00986116,
        -0.07072982,  0.04664235, -0.07856237, -0.54170558, -0.28884399]])

print("W.shape: {}".format(W.shape))
vocab = lsa_pipe.named_steps['countvectorizer'].get_feature_names_out()

W.shape: (3, 15)

import sys
sys.path.append("code/.")
from plotting_functions import *

plot_lda_w_vectors(W, ['topic 0', 'topic 1', 'topic 2'], vocab, width=800, height=600)

Let’s print the topics:

W_sorted = np.argsort(W, axis=1)[:, ::-1]

feature_names = np.array(
    lsa_pipe.named_steps["countvectorizer"].get_feature_names_out()
)
print_topics(
    topics=[0, 1, 2], feature_names=feature_names, sorting=W_sorted, n_words=5
)

topic 0       topic 1       topic 2       
--------      --------      --------      
model         kiwi          fashion       
probabilistic nutrition     elegant       
fashion       apple         famous        
topic         fresh         fresh         
famous        health        model         

Our features are word counts.
LSA has learned useful “concepts” or latent features from word count features.

2. Topic modeling with Latent Dirichlet Allocation (LDA)#

Attribution: Material and presentation in the next slides is adapted from Jordan Boyd-Graber’s excellent material on LDA.

Another common approach to topic modeling
A Bayesian, probabilistic, and generative approach
- Similar to Gaussian Mixture Models (GMMs), you can think of this as a procedure to generate the data
Developed by David Blei and colleagues in 2003.
- One of the most cited papers in the last 20 years.
Produces topics which are more interpretable and distinct from each other.
DISCLAIMER
- We won’t go into the math because it’s a bit involved and we do not have the time.
- My goal is to give you an intuition of the model and show you how to use it to solve topic modeling problems.

Suppose you are asked to create a probabilistic generative story of how our toy data came to be.

toy_df

	text
0	famous fashion model
1	elegant fashion model
2	fashion model at famous probabilistic topic model conference
3	fresh elegant fashion model
4	famous elegant fashion model
5	probabilistic conference
6	creative probabilistic model
7	model diet apple kiwi nutrition
8	probabilistic model
9	kiwi health nutrition
10	fresh apple kiwi health diet
11	health nutrition
12	fresh apple kiwi juice nutrition
13	probabilistic topic model conference
14	probabilistic topic model

2.1 Possible approaches to generate documents#

LDA is a generative model, which means that it comes up with a story that tells us from what distributions our data might have been generated. In other words, our goal is to effectively convey the process of generating observable data (documents, in this case) from hidden or latent variables (topics and their distributions within documents).

Let’s brainstorm some possible ways documents might have been generated.

Let’s assume all documents have $d$ words and word order doesn’t matter.

Approach 1: Multinomial distribution of words

Assume that each word $w_j$ comes from a multinomial distribution over all words.
To generate a document with $d$ words, sample $d$ words from the multinomial distribution over words.
Drawback: Misses that some documents are about different topics
Want: We want the word distribution to depend upon the topic

Approach 2: Mixture of multinomial distributions of words

How about using a mixture model of “topics”?
- Each component of a mixture has its own multinomial distribution over words
To generate a document with $d$ words
- sample a topic $z$ from the mixture of topics
- sample a word $d$ times from topic $z$’s distribution
Drawback: Misses that documents usually contain multiple topics

Approach 3: Multi-topic mixture of multinomial

Let’s introduce a vector $\theta$ of topic proportions associated with each document
To generate a document with $d$ words given probabilities $\theta$
- sample $d$ topics from $\theta$
- for each sampled topic, sample a word from the topic distribution

How do we compute $\theta$ for a new document?

Approach 4: Latent Dirichlet Allocation (LDA)

LDA puts a prior on topic proportions $\theta$
To generate a document with $d$ words
- sample document topic proportions $\theta$ from its prior
- sample $d$ topics from $\theta$
- for each sampled topic, sample a word from the topic distribution.

2.2 Generative story of LDA#

The story that tells us how our data was generated.
The generative story of LDA to create Document 1 below:
1. Pick a topic from the topic distribution for Document 1.
2. Pick a word from the selected topic’s word distribution.
Not a realistic story but a mathematically useful story.

Similar to LSA, we fix the number of topics $K$
Topic-word distribution: Each topic is a multinomial distribution over words
Document-topic distribution: Each document is a multinomial distribution over corpus-wide topics

(Credit: David Blei’s presentation)

Topic-word distribution
- If I pick a word from the pink topic, the probability of it being life is 0.02, evolve is 0.01, and organism is 0.01.
- Equivalent to the components $W$ in LSA.
Document-topic distribution
- If I pick a random word from the document, how likely it’s going to be from yellow topic vs. pink topic vs. cyan topic.
- Equivalent to the latent representation $Z$ in LSA.

2.3 LDA inference (high-level)#

Our goal is to infer the underlying distributional structure that explains the observed data.
How can we estimate the following latent variables that characterize the underlying topic structure of the document collection?
- the topic assignments for each word in a given document
- the topic-word distributions
- the document-topic distributions
Typically fit using Bayesian inference
- Most common algorithm is MCMC

Check out Appendix D for more details.

Note

(Optional) The Dirichlet distribution is used as a prior distribution in the Bayesian inference process of estimating the topic-word and document-topic distributions because it has some desirable properties such as being a conjugate prior, which means that the posterior has the same form as that of the prior which allows us to compute the posterior distribution of the topic-word and document-topic distributions, given the observed data (i.e., the words in the documents) analytically in closed form, which simplifies the inference process.

How do we find the posterior distribution?

We are interested in the posterior distribution, i.e., the updated probability distribution of the topic assignments for each word in the document, given the observed words in the document and the current estimate of the topic proportions for that document
How do we find it?
- Gibbs sampling (very accurate but very slow, appropriate for small datasets)
- Variational inference (faster but less accurate, extension of expectation maximization, appropriate for medium to large datasets)
Next let’s look at an intuition of Gibbs sampling for topic modeling.

Main Steps in Gibbs Sampling for LDA

Initialization: Assign each word in each document randomly to a topic. This initial assignment gives us a starting point for our counts in the “word-topic” and “document-topic” count tables.
Update Word-Topic Assignment: For each word in each document, perform the following steps:
- Decrease the counts in our count tables (“word-topic counts” and “document-topic counts”) for the current word and its current topic assignment.
- Calculate the probability of each topic for this word, based on the current state of counts (both the document’s topic distribution and the word’s topic distribution across all documents).
- Sample a new topic for the word based on these probabilities.
- Increase the counts in the count tables to reflect this new topic assignment.
Iterate: Repeat the above step for each word in a pass. Multiple passes are done until the assignments converge, which means the changes in topic assignments between sweeps are minimal or below a certain threshold.

After many passes, we can estimate the document-topic distributions and the topic-word distributions from the count tables. These distributions are our model that tells us the topics present in the documents and the words associated with each topic.

Gibbs sampling example

Suppose the number of topics is $3$ and the topics are numbered $1, 2, 3$
initialization: Randomly assign each word in each document to one of the topics.
- The same word in the vocabulary may have different topic assignments in different instances.

With the current topic assignment, here are the topic counts in our document

And these are word topic counts in the corpus, i.e., how much each word in the document is liked by each topic.

For each word in our current document (Document 10), calculate how often that word occurs with each topic in all documents.
Note that the counts in the toy example below are made-up counts

How to update Word-Topic Assignment?

Suppose our sampled word-topic assignment is the word probabilistic in Document 10 with assigned topic 3.
Goal: update this word topic assignment.

Step 1: Decrement counts

We want to update the word topic assignment of probabilistic and Topic 3.
How often does Topic 3 occur in Document 10? Once.
How often does the word probabilistic occur with Topic 3 in the corpus? Twice.
Decrement the count of the word from the word-topic counts.

Step 2: Calculating conditional probability distribution

What is the topic preference of Document 10?
What is the preference of the word probabilistic for each topic?

Document 10 likes Topic 1
Topic 1 likes the word probabilistic – it tends to occur a lot in Topic 1.

Step 3: Updating topic assignment

So update the topic of the current word probabilistic in document 10 to topic 1
Update the document-topic and word-topic counts accordingly.

Conclusion

In one pass, the algorithm repeats the above steps for each word in the corpus
If you do this for several passes, meaningful topics emerge.

❓❓ Questions for you#

Exercise 5.1: Select all of the following statements which are True (iClicker)#

iClicker join link: https://join.iclicker.com/ZTLY

(A) Latent Dirichlet Allocation (LDA) is an unsupervised approach.
(B) The assumption in topic modeling is that different topics tend to use different words.
(C) We could use LSA for topic modeling, where $Z$ gives us the mixture of topics per document.
(D) In LDA topic model, a document is a mixture of multiple topics.
(E) If I train a topic model on a large collection of news articles with K = 10, I would get 10 topic labels (e.g., sports, culture, politics, finance) as output.

Exercise 5.1: V’s Solutions!

A, B, C, D

Exercise 5.2: Questions for class discussion#

What’s the difference between topic modeling and document clustering using algorithms such as K-Means clustering?
Imagine that you are using LSA for topic modeling, where $X$ is bag-of-words representation of $N$ documents and $V$ features (number of words in the vocabulary). Explain which matrix would give you word distribution over topics and which matrix would give you topic distribution over documents. $$X_{N \times V} = Z_{N \times K} W_{K \times V}$$

Exercise 5.2: V’s Solutions!

In clustering with K-Means, documents are grouped based on a similarity measure, and each document is assigned to a single cluster. In contrast, topic modeling produces distributions over words for each topic, as well as distributions over topics per document. Topic modeling can be thought of as a form of soft clustering, where documents are assigned to multiple topics based on their probability distributions rather than a single cluster.
$Z$ would give distribution over topics per document and $W$ would give distribution over words per topic.

3. Topic modeling with Python#

3.1 Topic modeling pipeline#

There are three components in topic modeling pipeline.

Preprocess your corpus.
Train LDA using sklearn or Gensim.
Interpret your topics.

Let’s work with a slightly bigger corpus. We’ll extract text from a few Wikipedia articles and apply topic modeling on these texts.

import wikipedia

queries = [
    "Artificial Intelligence",
    "Supervised Machine Learning",
    "Supreme Court of Canada",
    "Peace, Order, and Good Government",
    "Canadian constitutional law",
    "ice hockey",
    "Google company",
    "Google litigation",
]
wiki_dict = {"wiki query": [], "text": []}
for i in range(len(queries)):
    text = wikipedia.page(queries[i]).content
    wiki_dict["text"].append(text)
    print(len(text))
    wiki_dict["wiki query"].append(queries[i])

wiki_df = pd.DataFrame(wiki_dict)
wiki_df

	wiki query	text
0	Artificial Intelligence	Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies m...
1	Supervised Machine Learning	Supervised learning (SL) is a paradigm in machine learning where input objects (for example, a vector of predictor variables) and a desired output value (also known as human-labeled supervisory si...
2	Supreme Court of Canada	The Supreme Court of Canada (SCC; French: Cour suprême du Canada, CSC) is the highest court in the judicial system of Canada. It comprises nine justices, whose decisions are the ultimate applicati...
3	Peace, Order, and Good Government	In many Commonwealth jurisdictions, the phrase "peace, order, and good government" (POGG) is an expression used in law to express the legitimate objects of legislative powers conferred by statute....
4	Canadian constitutional law	Canadian constitutional law (French: droit constitutionnel du Canada) is the area of Canadian law relating to the interpretation and application of the Constitution of Canada by the courts. All la...
5	ice hockey	Ice hockey (or simply hockey) is a team sport played on ice skates, usually on an ice skating rink with lines and markings specific to the sport. It belongs to a family of sports called hockey. In...
6	Google company	Google LLC ( , GOO-ghəl) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum compu...
7	Google litigation	Google has been involved in multiple lawsuits over issues such as privacy, advertising, intellectual property and various Google services such as Google Books and YouTube. The company's legal depa...

Preprocessing the corpus

Preprocessing is crucial!
Tokenization, converting text to lower case
Removing punctuation and stopwords
Discarding words with length < threshold or word frequency < threshold
Possibly lemmatization: Consider the lemmas instead of inflected forms.
Depending upon your application, restrict to specific part of speech;
- For example, only consider nouns, verbs, and adjectives

We’ll use spaCy for preprocessing.

# !python -m spacy download en_core_web_md

import spacy

nlp = spacy.load("en_core_web_md")

def preprocess_spacy(
    doc,
    min_token_len=2,
    irrelevant_pos=["ADV", "PRON", "CCONJ", "PUNCT", "PART", "DET", "ADP"],
):
    """
    Given text, min_token_len, and irrelevant_pos carry out preprocessing of the text
    and return a preprocessed string.

    Parameters
    -------------
    doc : (spaCy doc object)
        the spacy doc object of the text
    min_token_len : (int)
        min_token_length required
    irrelevant_pos : (list)
        a list of irrelevant pos tags

    Returns
    -------------
    (str) the preprocessed text
    """

    clean_text = []

    for token in doc:
        if (
            token.is_stop == False  # Check if it's not a stopword
            and len(token) > min_token_len  # Check if the word meets minimum threshold
            and token.pos_ not in irrelevant_pos
        ):  # Check if the POS is in the acceptable POS tags
            lemma = token.lemma_  # Take the lemma of the word
            clean_text.append(lemma.lower())
    return " ".join(clean_text)

wiki_df["text_pp"] = [preprocess_spacy(text) for text in nlp.pipe(wiki_df["text"])]

wiki_df

	wiki query	text	text_pp
0	Artificial Intelligence	Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies m...	artificial intelligence broad sense intelligence exhibit machine computer system field research computer science develop study method software enable machine perceive environment use learning inte...
1	Supervised Machine Learning	Supervised learning (SL) is a paradigm in machine learning where input objects (for example, a vector of predictor variables) and a desired output value (also known as human-labeled supervisory si...	supervised learning paradigm machine learning input object example vector predictor variable desire output value know human label supervisory signal train model training datum process build functi...
2	Supreme Court of Canada	The Supreme Court of Canada (SCC; French: Cour suprême du Canada, CSC) is the highest court in the judicial system of Canada. It comprises nine justices, whose decisions are the ultimate applicati...	supreme court canada scc french cour suprême canada csc high court judicial system canada comprise justice decision ultimate application canadian law grant permission litigant year appeal decision...
3	Peace, Order, and Good Government	In many Commonwealth jurisdictions, the phrase "peace, order, and good government" (POGG) is an expression used in law to express the legitimate objects of legislative powers conferred by statute....	commonwealth jurisdiction phrase peace order good government pogg expression law express legitimate object legislative power confer statute phrase appear imperial acts parliament letters patent co...
4	Canadian constitutional law	Canadian constitutional law (French: droit constitutionnel du Canada) is the area of Canadian law relating to the interpretation and application of the Constitution of Canada by the courts. All la...	canadian constitutional law french droit constitutionnel canada area canadian law relate interpretation application constitution canada court law canada provincial federal conform constitution law...
5	ice hockey	Ice hockey (or simply hockey) is a team sport played on ice skates, usually on an ice skating rink with lines and markings specific to the sport. It belongs to a family of sports called hockey. In...	ice hockey hockey team sport play ice skate ice skating rink line marking specific sport belong family sport call hockey ice hockey oppose team use ice hockey stick control advance shoot closed vu...
6	Google company	Google LLC ( , GOO-ghəl) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum compu...	google llc goo ghəl american multinational corporation technology company focus online advertising search engine technology cloud computing computer software quantum computing commerce consumer el...
7	Google litigation	Google has been involved in multiple lawsuits over issues such as privacy, advertising, intellectual property and various Google services such as Google Books and YouTube. The company's legal depa...	google involve multiple lawsuit issue privacy advertising intellectual property google service google books youtube company legal department expand 100 lawyer year business 2014 grow 400 lawyer go...

Training LDA with sklearn

docs = wiki_df["text_pp"]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(wiki_df["text_pp"])

from sklearn.decomposition import LatentDirichletAllocation

n_topics = 3
lda = LatentDirichletAllocation(
    n_components=n_topics, learning_method="batch", max_iter=10, random_state=0
)
document_topics = lda.fit_transform(dtm)

print("lda.components_.shape: {}".format(lda.components_.shape))

lda.components_.shape: (3, 4858)

sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
feature_names = np.array(vectorizer.get_feature_names_out())

print_topics(
    topics=range(3),
    feature_names=feature_names,
    sorting=sorting,
    topics_per_chunk=5,
    n_words=10,
)

topic 0       topic 1       topic 2       
--------      --------      --------      
court         learning      google        
law           intelligence  hockey        
provincial    algorithm     player        
federal       machine       team          
government    learn         ice           
power         problem       play          
canada        displaystyle  league        
justice       datum         game          
case          human         company       
supreme       artificial    puck          

Training LDA with gensim

I’ll also show you how to do topic modeling with gensim, which we used for Word2Vec, because it’s more flexible.

To train an LDA model with gensim, you need

Document-term matrix
Dictionary (vocabulary)
The number of topics ($K$): num_topics
The number of passes: passes

Let’s first create a dictionary using corpora.Dictionary.
It has two useful methods
- token2id gives token to index mapping
- doc2bow creates bag-of-words representation of a given document

import gensim
from gensim.corpora import Dictionary

corpus = [doc.split() for doc in wiki_df["text_pp"].tolist()]
dictionary = Dictionary(corpus)  # Create a vocabulary for the lda model

list(dictionary.token2id.items())[:4]

[('"criticism', 0),
 ('"generative', 1),
 ('0070087705.these', 2),
 ('0134610993', 3)]

pd.DataFrame(
    dictionary.token2id.keys(), index=dictionary.token2id.values(), columns=["Word"]
)

	Word
0	"criticism
1	"generative
2	0070087705.these
3	0134610993
4	127
...	...
4988	window
4989	windows
4990	wiretap
4991	writer
4992	yuan

4993 rows × 1 columns

Now let’s convert our corpus into document-term matrix for LDA using dictionary.doc2bow.
For each document, it stores the frequency of each token in the document in the format (token_id, frequency).

doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus]
doc_term_matrix[1][:20]

[(59, 4),
 (73, 1),
 (77, 3),
 (78, 1),
 (84, 1),
 (85, 1),
 (88, 3),
 (121, 51),
 (128, 1),
 (141, 4),
 (143, 1),
 (157, 1),
 (158, 1),
 (159, 4),
 (160, 3),
 (173, 1),
 (177, 1),
 (184, 2),
 (193, 2),
 (207, 3)]

Now we are ready to train an LDA model with Gensim!

from gensim.models import LdaModel

num_topics = 3

lda = LdaModel(
    corpus=doc_term_matrix,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    passes=10,
)

We can examine the topics and topic distribution for a document in our LDA model

lda.print_topics(num_words=5)  # Topics

[(0,
  '0.011*"algorithm" + 0.009*"function" + 0.009*"learn" + 0.008*"learning" + 0.008*"training"'),
 (1,
  '0.031*"google" + 0.013*"hockey" + 0.011*"court" + 0.008*"player" + 0.008*"team"'),
 (2,
  '0.009*"intelligence" + 0.007*"machine" + 0.007*"problem" + 0.006*"artificial" + 0.006*"human"')]

print("Document label: ", wiki_df.iloc[1,0])
print("Topic assignment for document: ", lda[doc_term_matrix[7]])  # Topic distribution

Document label:  Supervised Machine Learning
Topic assignment for document:  [(1, 0.99961084)]

topic_assignment = sorted(lda[doc_term_matrix[1]], key=lambda x: x[1], reverse=True)
df = pd.DataFrame(topic_assignment, columns=["topic id", "probability"])
df

	topic id	probability
0	0	0.999517

new_doc = "After the court yesterday lawyers working on Google lawsuits went to a ice hockey game."
pp_new_doc = preprocess_spacy(nlp(new_doc))
pp_new_doc

'court yesterday lawyer work google lawsuit go ice hockey game'

bow_vector = dictionary.doc2bow(pp_new_doc.split())

topic_assignment = sorted(lda[bow_vector], key=lambda x: x[1], reverse=True)
df = pd.DataFrame(topic_assignment, columns=["topic id", "probability"])
df

	topic id	probability
0	1	0.931061
1	2	0.034830
2	0	0.034109

Main hyperparameters:

num_topics or K: The number of requested latent topics to be extracted from the training corpus.
alpha: A-priori belief on document-topic distribution. When you set alpha as a scalar, it applies symmetrically across all topics, meaning each topic has the same prior importance in documents before observing the actual words. In such cases:
- A lower alpha encourages the model to assign fewer topics to each document, leading to a more sparse representation.
- A higher alpha suggests that documents are likely to contain a mixture of more topics.
eta (often referred to as beta in the literature): A-priori belief on topic-word distribution. When you set alpha as a scalar, it applies symmetrically across all topics, meaning each topic has the same prior importance in documents before observing the actual words. In such cases:
- A lower eta indicates that each topic is formed by a smaller set of words
- A higher eta allows topics to be more broadly defined by a large set of words

There is also an auto option for both hyperparameters alpha and eta which learns an asymmetric prior from the data. Check the documentation

3.2 A few comments on evaluation#

In topic modeling we would like each topic to have some semantic theme which is interpretable by humans.
The idea of a topic model is to tell a story to humans and that’s what we should care about and evaluate.

(Credit: Health topics in social media)

A few ways to evaluate a topic model

Eye balling top $n$ words or documents with high probability for each topic.
Human judgments
- Ask humans to evaluate a topic model and whether they are able to tell a coherent story about the collection of documents using the topic model.
- Word intrusion
Extrinsic evaluation
- If you are using topic modeling as features in tasks such as sentiment analysis or machine translation, evaluate whether topic model with the current hyperparameters improves the results of that task or not.
Quantifying topic interpretability
- Coherence scores

Human judgments: Word intrusion

Word intrusion: Take high probability words from a topic. Now add a high probability word from another topic to this topic.
Likely words from old topic
- dentist, appointment, doctors, tooth
Likely words from old topic + word intrusion
- dentist, mom, appointment, doctors, tooth
Are we able to distinguish this odd word out from the original topic?

Coherence model

Coherence models provide a framework to evaluate the coherence of the topic by measuring the degree of semantic similarity between most likely words in the topic.
You can calculate coherence scores using Gensim’s CoherenceModel.
- Ranges from -1 to 1
- Roughly, high coherence score means the topics are semantically interpretable and low coherence scores means they are not semantically similar.
Let’s try it out on our toy data.

# Compute Coherence Score
from gensim.models import CoherenceModel

K = [1, 2, 3, 4, 5, 6]

coherence_scores = []

for num_topics in K:
    lda = LdaModel(
        corpus=doc_term_matrix,
        id2word=dictionary,
        num_topics=num_topics,
        random_state=42,
        passes=10,
    )
    coherence_model_lda = CoherenceModel(
        model=lda, texts=corpus, dictionary=dictionary, coherence="c_v"
    )
    coherence_scores.append(coherence_model_lda.get_coherence())

df = pd.DataFrame(coherence_scores, index=K, columns=["Coherence score"])
df

	Coherence score
1	0.228784
2	0.361770
3	0.329059
4	0.497291
5	0.665912
6	0.557379

df.plot(title="Coherence scores", xlabel="num_topics", ylabel="Coherence score");

../_images/011a1c26c17e0fe4976174f0935ea4d82594ce5fae2389670e2fc488e718c1f0.png

We are getting the best coherence score with num_topics = 4. That said, similar to other methods introduced as a guiding tool to choose the optimal hyperparameters, take this with a grain of salt. The model with highest coherence score is not necessarily going to be the best topic model. It’s always a good idea to manually examine the topics resulted by the chosen number of topics and examine whether you are able to make sense of these topics or or not before finalizing the number of topics. In the end our goal is to get human interpretable topics, and that’s what we should try to focus on.

If you want to jointly optimize num_topics, alpha, and eta, you can do grid search or random search with coherence score as the evaluation metric.

(Optional) Visualize topics

You can also visualize the topics using pyLDAvis.

pip install pyLDAvis

Do not install it using conda. They have made some changes in the recent version and conda build is not available for this version yet.

import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

# import gensim
# import pyLDAvis.gensim_models as gensimvis
# import pyLDAvis

# vis = gensimvis.prepare(lda, doc_term_matrix, dictionary, sort_topics=False)
# pyLDAvis.display(vis)

4. (Optional) Some recent topic modeling tools#

4.1 Topic2Vec #

What are some limitations of LDA and LSA?

Need to know the number of topics in advance, which is hard for large and unfamiliar datasets
Quality of topics is very much dependent upon the quality of preprocessing
They use bag-of-words (BOW) representations of documents as input which ignore word semantics.
- The words study and studying would be treated as different words despite their semantic similarity
- Stemming or lemmatization can address this but lemmatization doesn’t help to recognize height and tall are similar

How does it work?

STEP 1: Create jointly embedded document and word vectors using Doc2Vec or Universal Sentence Encoder or BERT Sentence Transformer.
STEP 2: Create lower dimensional embedding of document vectors using UMAP.
STEP 3: Find dense areas of documents using HDBSCAN.
STEP 4: For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.
STEP 5: Find n-closest word vectors to the resulting topic vector.

Source

Let’s try it out to identify topics from Ted talk titles.

talks_df = pd.read_csv('data/20221013_ted_talks.csv')
talks_df['title']

                        Averting the climate crisis
                                   Simplicity sells
                                Greening the ghetto
                    The best stats you've ever seen
                        Do schools kill creativity?
                             ...                      
                       Is inequality inevitable?
   4 ways to design a disability-friendly future
  Can exercise actually "boost" your metabolism?
   How did they build the Great Pyramid of Giza?
  How to squeeze all the juice out of retirement
Name: title, Length: 5701, dtype: object

documents = talks_df['title'].to_list()

from top2vec import Top2Vec

model = Top2Vec(documents,embedding_model='distiluse-base-multilingual-cased', min_count=5)

2024-04-08 18:25:27,172 - top2vec - INFO - Pre-processing documents for training
/Users/kvarada/miniconda3/envs/575/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:525: UserWarning:

The parameter 'token_pattern' will not be used since 'tokenizer' is not None'

2024-04-08 18:25:27,373 - top2vec - INFO - Downloading distiluse-base-multilingual-cased model
2024-04-08 18:25:30,472 - top2vec - INFO - Creating joint document/word embedding
2024-04-08 18:25:40,379 - top2vec - INFO - Creating lower dimension embedding of documents

Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.

2024-04-08 18:25:52,296 - top2vec - INFO - Finding dense areas of documents
2024-04-08 18:25:52,497 - top2vec - INFO - Finding topics

Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.

# How many topics has it learned? 

model.get_num_topics() 

# What are the topic sizes? 

topic_sizes, topic_nums = model.get_topic_sizes()
topic_sizes

array([311, 278, 222, 204, 179, 146, 139, 133, 131, 130, 128, 122, 121,
       120, 119, 118, 106, 106, 103, 101,  99,  99,  99,  96,  96,  94,
        90,  88,  84,  83,  79,  76,  74,  70,  70,  69,  68,  67,  67,
        64,  57,  56,  55,  55,  53,  53,  52,  51,  51,  49,  47,  47,
        46,  44,  44,  39,  39,  38,  33,  32,  31,  30,  27,  23])

# What are the topic words? 

topic_words, word_scores, topic_nums = model.get_topics(1)
topic_words

array([['why', 'reasons', 'reason', 'causes', 'how', 'como', 'autism',
        'anxiety', 'co', 'diseases', 'habits', 'depression', 'for',
        'disease', 'fear', 'what', 'hate', 'addiction', 'never',
        'genetic', 'poverty', 'sick', 'instead', 'myths', 'hunger',
        'fallacy', 'vaccines', 'grief', 'when', 'cancer', 'que',
        'always', 'about', 'pain', 'empathy', 'tips', 'suicide', 'drugs',
        'crisis', 'avoid', 'mysteries', 'para', 'vaccine', 'pandemics',
        'as', 'riddle', 'ourselves', 'don', 'matters', 'strangers']],
      dtype='<U14')

import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["politics"], 
                                                                         num_topics=5)
for topic in topic_nums:
    model.generate_topic_wordcloud(topic)

../_images/db00705950b91292951d0f9598b4e06174845f2d99c01bbb2f863dc7ce4ff24f.png

../_images/e169645dc8d6b62e5391af68d59f791289cc58d2b8ec33a16bab5bda123ed1b9.png

../_images/87c9171d5c4d5ef7aa32d1c7e257f2ac8307fbedde0d50989717f23eb2e5549e.png

../_images/7709da6fdab04a21383ddb790a52d2e70ea5c86b47022f171a11c826f4187df5.png

../_images/b86578b29900e4b85d40e59a9b9562717e3f6c6c9226f3127fb14e8a88648fd6.png

documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=26, num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

Document: 1583, Score: 0.7956469655036926
-----------
Life that doesn't end with death
-----------

Document: 865, Score: 0.7733122110366821
-----------
Inspiring a life of immersion
-----------

Document: 2891, Score: 0.7583013772964478
-----------
A rite of passage for late life
-----------

Document: 477, Score: 0.7572578191757202
-----------
Life lessons through tinkering
-----------

Document: 7, Score: 0.7478621006011963
-----------
A life of purpose
-----------

documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["human", "love"], num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

Document: 264, Score: 0.6784869056838551
-----------
On humanity
-----------

Document: 5309, Score: 0.588455842303502
-----------
"Being Human"
-----------

Document: 258, Score: 0.5522820827921512
-----------
The brain in love
-----------

Document: 584, Score: 0.5464831966261261
-----------
The uniqueness of humans
-----------

Document: 4532, Score: 0.5446537580036017
-----------
The language of being human
-----------

topic_words, topic_vectors, topic_labels = model.get_topics()

topic_words

array([['why', 'reasons', 'reason', ..., 'don', 'matters', 'strangers'],
       ['how', 'ways', 'como', ..., 'ethical', 'paradox', 'emotional'],
       ['compassion', 'empathy', 'consciousness', ..., 'psychology',
        'perspective', 'minds'],
       ...,
       ['why', 'read', 'reading', ..., 'tale', 'pandemics', 'quantum'],
       ['smartphone', 'phone', 'mobile', ..., 'watch', 'magic', 'your'],
       ['cars', 'driving', 'car', ..., 'surveillance', 'walk',
        'themselves']], dtype='<U14')

Another recent tool to know is BERTopic, which leverages transformers.

Final comments and summary#

Important ideas to know#

Topic modeling is a tool to uncover important themes in a large collection of documents.
The overall goal is to tell a high-level story about a large collection of documents to humans.

(Credit: Health topics in social media)

Latent dirichlet allocation (LDA) is a commonly used model for topic modeling, which is a Bayesian, probabilistic, and generative model.
Topic is something that influences the choice of vocabulary of a document. For each document, we assume that several topics are active with varying importance.
The primary idea of the model is
- A document is a mixture of topics
- A topic is a mixture of words in the vocabulary

You can carry out topic modeling with LDA in Python using the Gensim library.
Preprocessing is extremely important in topic modeling.
Some of the most common steps in preprocessing for topic modeling include
- Sentence segmentation, tokenization, lemmatization, stopword and punctuation removal
You can visualize topic model using pyLDAvis.
There are more advanced and practical tools for topic modeling out there. Some examples include Top2Vec and BERTopic

Some useful resources and links#

LDA with sklearn
Jordan Boyd-Graber’s very approachable explanation of LDA
lda2vec
Original topic modeling paper: David Blei et al. 2003
Topic modeling for computational social scientists
spaCy’s Python for data science cheat sheet
If you want to learn more about practical aspects of LDA
- Rethinking LDA: Why Priors Matter
- LDA Revisited: Entropy, Prior and Convergence

Lecture 5: Topic Modeling

Contents

Lecture 5: Topic Modeling#

Lecture plan, imports, LO#

Imports#

Learning outcomes#

1. Why topic modeling?#

1.1 Topic modeling motivation#

1.2 How do you do topic modeling?#

2. Topic modeling with Latent Dirichlet Allocation (LDA)#

2.1 Possible approaches to generate documents#

2.2 Generative story of LDA#

2.3 LDA inference (high-level)#

❓❓ Questions for you#

Exercise 5.1: Select all of the following statements which are True (iClicker)#

Exercise 5.2: Questions for class discussion#

3. Topic modeling with Python#

3.1 Topic modeling pipeline#

3.2 A few comments on evaluation#

4. (Optional) Some recent topic modeling tools#

4.1 Topic2Vec#

Final comments and summary#

Important ideas to know#

Some useful resources and links#

4.1 Topic2Vec #