# HMM supervised POS tagging

In [2]:
import os
import re
import sys

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import IPython
from IPython.display import HTML, display
from nltk.tag.hmm import HiddenMarkovModelTrainer

Part-of-speech tagging task

- Given a text assign part-of-speech tags to the words in the text.

- Input sentence: 
<blockquote>
    MDS students are hard-working .
</blockquote>    

- POS-tagged sentence: 
<blockquote>
    MDS/<span style="color:green">PROPER_NOUN</span> students/<span style="color:green">NOUN</span> are/<span style="color:green">VERB</span> hard-working/<span style="color:green">ADJECTIVE</span> ./<span style="color:green">PUNCTUATION</span>
</blockquote>    


In [3]:
words = ["book", "that", "flight", "like", "I", "."]
POS = ["Noun", "Verb", "Punct", "Pron"]

In [4]:
corpus = [
    [("book", "Verb"), ("that", "Pron"), ("flight", "Noun"), (".", "Punct")],
    [
        ("I", "Pron"),
        ("like", "Verb"),
        ("that", "Pron"),
        ("book", "Noun"),
        (".", "Punct"),
    ],
    [("book", "Verb"), ("flight", "Noun"), (".", "Punct")],
    [("book", "Verb"), ("like", "Noun"), ("flight", "Noun")],
    [("I", "Pron"), ("book", "Verb"), ("flight", "Noun"), (".", "Punct")],
    [
        ("I", "Pron"),
        ("like", "Verb"),
        ("that", "Pron"),
        ("book", "Noun"),
        (".", "Punct"),
    ],
]

The syntax is a bit weird. This is just for demonstration purpose. You're unlikely to use this when you carry out POS tagging.

In [5]:
trainer = HiddenMarkovModelTrainer(POS, words)
hmm = trainer.train_supervised(
    corpus,
)

In [6]:
hmm._create_cache()
P, O, X, S = hmm._cache

  P[i] = self._priors.logprob(si)
  X[i, j] = self._transitions[si].logprob(self._states[j])
  O[i, k] = self._output_logprob(si, self._symbols[k])


### From the documentation: 

The cache is a tuple (P, O, X, S) where:

- S maps symbols to integers.  I.e., it is the inverse
mapping from self._symbols; for each symbol s in
self._symbols, the following is true::

  ```self._symbols[S[s]] == s```

- O is the log output probabilities::

  ```O[i,k] = log( P(token[t]=sym[k]|tag[t]=state[i]) )```

- X is the log transition probabilities::

  ```X[i,j] = log( P(tag[t]=state[j]|tag[t-1]=state[i]) )```

- P is the log prior probabilities::

  ```P[i] = log( P(tag[0]=state[i]) )```


- Mapping between the observations (symbols) to integers. 

In [7]:
S

{'book': 0, 'that': 1, 'flight': 2, 'like': 3, 'I': 4, '.': 5}

#### HMM states

In [8]:
hmm._states

['Noun', 'Verb', 'Punct', 'Pron']

#### Log prior probabilities 
- $\pi_0$ for all states 

In [9]:
pd.DataFrame(P, index=hmm._states, columns=["pi_0"])

Unnamed: 0,pi_0
Noun,-inf
Verb,-1.0
Punct,-inf
Pron,-1.0


#### Log output probabilities

- log(P(observation | tag)) for all observations and tags. 

In [10]:
pd.DataFrame(O, index=hmm._states, columns=S.keys())

Unnamed: 0,book,that,flight,like,I,.
Noun,-1.807355,-inf,-0.807355,-2.807355,-inf,-inf
Verb,-0.584962,-inf,-inf,-1.584962,-inf,-inf
Punct,-inf,-inf,-inf,-inf,-inf,0.0
Pron,-inf,-1.0,-inf,-inf,-1.0,-inf


#### Log transition probabilities 

- Transition matrix 

In [11]:
pd.DataFrame(X, index=hmm._states, columns=hmm._states)

Unnamed: 0,Noun,Verb,Punct,Pron
Noun,-2.584963,-inf,-0.263034,-inf
Verb,-1.0,-inf,-inf,-1.0
Punct,-inf,-inf,-inf,-inf
Pron,-1.0,-1.0,-inf,-inf


### Tagging a sentence 

In [12]:
hmm.tag(["book", "flight", "."])

[('book', 'Verb'), ('flight', 'Noun'), ('.', 'Punct')]

We'll see in the next lecture the algorithm used for such tagging. 

### Let's try it out on a bigger dataset

- You don't have to understand the code. 

In [13]:
import sys
sys.path.append("code/.")
from hmm_pos_demo import *

In [14]:
import nltk
# nltk.download('brown')

In [15]:
hmm = demo_pos_supervised()


HMM POS tagging demo

Training HMM...
Testing...
Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Entropy: 18.7331739704536

------------------------------------------------------------
Test: the/AT jury/NN further/RBR said/VBD in/IN term-end/NN presentments/NNS that/CS the/AT city/NN executive/JJ committee/NN ,/, which/WDT had/HVD over-all/JJ charge/NN of/IN the/AT election/NN

### Explanation of the output

- What do these tags (e.g., NN, AT, IN, NNS etc) mean? Where do they come from?
    - These tags come from [the Penn Treebank Project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
    - The Penn Treebank tagset consists of 36 POS tags to label different parts of speech of words in English. 
- Entropy is a common metric used to measure the degree of uncertainty or ambiguity in the tagging process. 
    - Lower entropy $\rightarrow$ the tagger is relatively certain about the tags
    - High entropy $\rightarrow$ the tagger is less certain about the tags    

### Let's try it out on a new unseen sentence

In [16]:
hmm.tag(["keep", "the", "book", "on", "the", "table", "."])

[('keep', 'VB'),
 ('the', 'AT'),
 ('book', 'NN'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('table', 'NN'),
 ('.', '.')]

### Why not use traditional ML models? 

- We could extract features and treat it as a multi-class classification problem of predicting POS for each word. Some example features could be: 
    - Whether the word ends with an "ing" (for verbs)
    - What's the previous word?         
    - Or whether the word occurs at the beginning or end of a sentence  
- But coming up with such features is time consuming and limited. It can get unwieldy quite quickly and it leads to fragile and overfit models.     
- HMM provide a much more elegant way to model sequences and usually they are a preferred way to model sequences.  