LDA details#

(Optional) Plate notation#

  • Used in Bayesian inference for representing variables that repeat.

  • It shows the generative process of the LDA model. It also shows the dependency structure in the probability distribution.

  • We are not going into the details but I would like you to be familiar with this picture at a high-level because it’s likely that you might see it in the context of topic modeling.

  • \(\beta_k \rightarrow\) Distribution over words for topic \(k\)

  • \(\theta_d \rightarrow\) Distribution over topics for document \(d\)

  • \(w_n \rightarrow\) word

  • \(Z_n \rightarrow\) topic

  • \(N \rightarrow\) Size of the vocabulary

  • \(M \rightarrow\) Number of documents

  • \(\lambda \rightarrow\) Hyperparameter for word proportion

  • \(\alpha\rightarrow\) Hyperparameter for topic proportion

(Optional) Mathematical presentation of the generative story (plate diagram)#

  • For each topic \(k \in \{1, \dots, K\}\) draw a multinomial distribution \(\beta_k\) from a Dirichlet distribution with parameter \(\lambda\).

  • For each document \(d \in \{1, \dots, M\}\), draw a multinomial distribution \(\theta_d\) from a Dirichlet distribution with parameter \(\alpha\).

  • For each word position \(n \in \{1, \dots, N\}\), select a hidden topic \(Z_n\) from the multinomial distribution parameterized by \(\theta\).

  • Choose the observed word \(w_n\) from the distribution \(\beta_{Z_n}\).


(Optional) LDA Inference#

  • Infer the underlying topic structure in the documents. In particular,

    • Learn the discrete probability distributions of topics in each document

    • Learn the discrete probability distributions of words in each topic

  • We are interested in the posterior distribution: \(P(z, \beta, \theta| w_n, \alpha, \lambda)\)

  • Observations: words. Everything else is hidden (latent).

  • \(\lambda\): Hyperparameter for word proportion

    • High \(\lambda\) → every topic contains a mixture of most of the words

    • Low \(\lambda\) → every topic contains a mixture of only few words

  • \(\alpha\): Hyperparameter for topic proportion

    • High \(\alpha\) → every document contains a mixture of most of the topics

    • Low \(\alpha\) → every document is representative of only a few topics

(Optional) Calculating the conditional probability#

  • Sets up a Markov chain that converges into the posterior distribution of the model parameters or word–topic assignments.

  • Two components

    • How much this document likes topic \(k\): $\(\frac{n_{d,k} + \alpha_k}{\sum^K_i n_{d,i} + \alpha_i}\)$

    • How much this topic likes word \(w_{d,n}\): $\(\frac{V_{k, w_{d,n}} + \lambda_{w_{d,n}}}{\sum_i V_{k,i} + \lambda_i}\)$

  • The conditional probability of word topic assignment given everything else in the model:

\[\frac{n_{d,k} + \alpha_k}{\sum^K_i n_{d,i} + \alpha_i} \frac{V_{k, w_{d,n}} + \lambda_{w_{d,n}}}{\sum_i V_{k,i} + \lambda_i}\]
  • \(n_{d,k} \rightarrow\) number of times document \(d\) uses topic \(k\)

  • \(V_{k, w_{d,n}} \rightarrow\) number of times topic \(k\) uses word type \(w_{d,n}\)

  • \(\alpha_k \rightarrow\) Dirichlet parameter for document to topic distribution

  • \(\lambda_{w_{d,n}} \rightarrow\) Dirichlet parameter for topic to word distribution

(Optional) LDA algorithm#

  • Suppose \(K\) is number of topics

  • For each iteration \(i\)

    • For each document \(d\) and word \(n\) currently assigned to topic \(Z_{old}\)

      • Decrement \(n_{d,Z_{old}}\) and \(V_{Z_{old}, w_{d,n}}\)

      • Sample \(Z_{new} = k\) with probability proportional to \(\frac{n_{d,k} + \alpha_k}{\sum^K_i n_{d,i} + \alpha_i} \frac{V_{k, w_{d,n}} + \lambda_{w_{d,n}}}{\sum_i V_{k,i} + \lambda_i}\)

      • Increment \(n_{d, Z_{new}} and V_{Z_{new}, w_{d,n}}\)