HMM supervised POS tagging

HMM supervised POS tagging#

import os
import re
import sys

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import IPython
from IPython.display import HTML, display
from nltk.tag.hmm import HiddenMarkovModelTrainer

Part-of-speech tagging task

Given a text assign part-of-speech tags to the words in the text.
Input sentence:

MDS students are hard-working .

POS-tagged sentence:

MDS/PROPER_NOUN students/NOUN are/VERB hard-working/ADJECTIVE ./PUNCTUATION

words = ["book", "that", "flight", "like", "I", "."]
POS = ["Noun", "Verb", "Punct", "Pron"]

corpus = [
    [("book", "Verb"), ("that", "Pron"), ("flight", "Noun"), (".", "Punct")],
    [
        ("I", "Pron"),
        ("like", "Verb"),
        ("that", "Pron"),
        ("book", "Noun"),
        (".", "Punct"),
    ],
    [("book", "Verb"), ("flight", "Noun"), (".", "Punct")],
    [("book", "Verb"), ("like", "Noun"), ("flight", "Noun")],
    [("I", "Pron"), ("book", "Verb"), ("flight", "Noun"), (".", "Punct")],
    [
        ("I", "Pron"),
        ("like", "Verb"),
        ("that", "Pron"),
        ("book", "Noun"),
        (".", "Punct"),
    ],
]

The syntax is a bit weird. This is just for demonstration purpose. You’re unlikely to use this when you carry out POS tagging.

trainer = HiddenMarkovModelTrainer(POS, words)
hmm = trainer.train_supervised(
    corpus,
)

hmm._create_cache()
P, O, X, S = hmm._cache

/Users/kvarada/miniconda3/envs/575/lib/python3.12/site-packages/nltk/tag/hmm.py:332: RuntimeWarning: overflow encountered in cast
  P[i] = self._priors.logprob(si)
/Users/kvarada/miniconda3/envs/575/lib/python3.12/site-packages/nltk/tag/hmm.py:334: RuntimeWarning: overflow encountered in cast
  X[i, j] = self._transitions[si].logprob(self._states[j])
/Users/kvarada/miniconda3/envs/575/lib/python3.12/site-packages/nltk/tag/hmm.py:336: RuntimeWarning: overflow encountered in cast
  O[i, k] = self._output_logprob(si, self._symbols[k])

From the documentation:#

The cache is a tuple (P, O, X, S) where:

S maps symbols to integers. I.e., it is the inverse mapping from self._symbols; for each symbol s in self._symbols, the following is true::

self._symbols[S[s]] == s
O is the log output probabilities::

O[i,k] = log( P(token[t]=sym[k]|tag[t]=state[i]) )
X is the log transition probabilities::

X[i,j] = log( P(tag[t]=state[j]|tag[t-1]=state[i]) )
P is the log prior probabilities::

P[i] = log( P(tag[0]=state[i]) )

Mapping between the observations (symbols) to integers.

{'book': 0, 'that': 1, 'flight': 2, 'like': 3, 'I': 4, '.': 5}

HMM states#

hmm._states

['Noun', 'Verb', 'Punct', 'Pron']

Log prior probabilities#

\(\pi_0\) for all states

pd.DataFrame(P, index=hmm._states, columns=["pi_0"])

	pi_0
Noun	-inf
Verb	-1.0
Punct	-inf
Pron	-1.0

Log output probabilities#

log(P(observation | tag)) for all observations and tags.

pd.DataFrame(O, index=hmm._states, columns=S.keys())

	book	that	flight	like	I	.
Noun	-1.807355	-inf	-0.807355	-2.807355	-inf	-inf
Verb	-0.584962	-inf	-inf	-1.584962	-inf	-inf
Punct	-inf	-inf	-inf	-inf	-inf	0.0
Pron	-inf	-1.0	-inf	-inf	-1.0	-inf

Log transition probabilities#

Transition matrix

pd.DataFrame(X, index=hmm._states, columns=hmm._states)

	Noun	Verb	Punct	Pron
Noun	-2.584963	-inf	-0.263034	-inf
Verb	-1.000000	-inf	-inf	-1.0
Punct	-inf	-inf	-inf	-inf
Pron	-1.000000	-1.0	-inf	-inf

Tagging a sentence#

hmm.tag(["book", "flight", "."])

[('book', 'Verb'), ('flight', 'Noun'), ('.', 'Punct')]

We’ll see in the next lecture the algorithm used for such tagging.

Let’s try it out on a bigger dataset#

You don’t have to understand the code.

import sys
sys.path.append("code/.")
from hmm_pos_demo import *

import nltk
# nltk.download('brown')

hmm = demo_pos_supervised()

HMM POS tagging demo

Training HMM...
Testing...
Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Entropy: 18.7331739704536

------------------------------------------------------------
Test: the/AT jury/NN further/RBR said/VBD in/IN term-end/NN presentments/NNS that/CS the/AT city/NN executive/JJ committee/NN ,/, which/WDT had/HVD over-all/JJ charge/NN of/IN the/AT election/NN ,/, ``/`` deserves/VBZ the/AT praise/NN and/CC thanks/NNS of/IN the/AT city/NN of/IN atlanta/NP ''/'' for/IN the/AT manner/NN in/IN which/WDT the/AT election/NN was/BEDZ conducted/VBN ./.

Untagged: the jury further said in term-end presentments that the city executive committee , which had over-all charge of the election , `` deserves the praise and thanks of the city of atlanta '' for the manner in which the election was conducted .

HMM-tagged: the/AT jury/NN further/RBR said/VBD in/IN term-end/AT presentments/NN that/CS the/AT city/NN executive/NN committee/NN ,/, which/WDT had/HVD over-all/VBN charge/NN of/IN the/AT election/NN ,/, ``/`` deserves/VBZ the/AT praise/NN and/CC thanks/NNS of/IN the/AT city/NN of/IN atlanta/NP ''/'' for/IN the/AT manner/NN in/IN which/WDT the/AT election/NN was/BEDZ conducted/VBN ./.

Entropy: 27.07087255188224

------------------------------------------------------------
Test: the/AT september-october/NP term/NN jury/NN had/HVD been/BEN charged/VBN by/IN fulton/NP superior/JJ court/NN judge/NN durwood/NP pye/NP to/TO investigate/VB reports/NNS of/IN possible/JJ ``/`` irregularities/NNS ''/'' in/IN the/AT hard-fought/JJ primary/NN which/WDT was/BEDZ won/VBN by/IN mayor-nominate/NN ivan/NP allen/NP jr./NP ./.

Untagged: the september-october term jury had been charged by fulton superior court judge durwood pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by mayor-nominate ivan allen jr. .

HMM-tagged: the/AT september-october/JJ term/NN jury/NN had/HVD been/BEN charged/VBN by/IN fulton/NP superior/JJ court/NN judge/NN durwood/TO pye/VB to/TO investigate/VB reports/NNS of/IN possible/JJ ``/`` irregularities/NNS ''/'' in/IN the/AT hard-fought/JJ primary/NN which/WDT was/BEDZ won/VBN by/IN mayor-nominate/NP ivan/NP allen/NP jr./NP ./.

Entropy: 33.82818742373867

------------------------------------------------------------
Test: ``/`` only/RB a/AT relative/JJ handful/NN of/IN such/JJ reports/NNS was/BEDZ received/VBN ''/'' ,/, the/AT jury/NN said/VBD ,/, ``/`` considering/IN the/AT widespread/JJ interest/NN in/IN the/AT election/NN ,/, the/AT number/NN of/IN voters/NNS and/CC the/AT size/NN of/IN this/DT city/NN ''/'' ./.

Untagged: `` only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .

HMM-tagged: ``/`` only/RB a/AT relative/JJ handful/NN of/IN such/JJ reports/NNS was/BEDZ received/VBN ''/'' ,/, the/AT jury/NN said/VBD ,/, ``/`` considering/IN the/AT widespread/JJ interest/NN in/IN the/AT election/NN ,/, the/AT number/NN of/IN voters/NNS and/CC the/AT size/NN of/IN this/DT city/NN ''/'' ./.

Entropy: 11.437819859633125

------------------------------------------------------------
Test: the/AT jury/NN said/VBD it/PPS did/DOD find/VB that/CS many/AP of/IN georgia's/NP$ registration/NN and/CC election/NN laws/NNS ``/`` are/BER outmoded/JJ or/CC inadequate/JJ and/CC often/RB ambiguous/JJ ''/'' ./.

Untagged: the jury said it did find that many of georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .

HMM-tagged: the/AT jury/NN said/VBD it/PPS did/DOD find/VB that/CS many/AP of/IN georgia's/NP$ registration/NN and/CC election/NN laws/NNS ``/`` are/BER outmoded/VBG or/CC inadequate/JJ and/CC often/RB ambiguous/VB ''/'' ./.

Entropy: 20.816362319185643

------------------------------------------------------------
Test: it/PPS recommended/VBD that/CS fulton/NP legislators/NNS act/VB ``/`` to/TO have/HV these/DTS laws/NNS studied/VBN and/CC revised/VBN to/IN the/AT end/NN of/IN modernizing/VBG and/CC improving/VBG them/PPO ''/'' ./.

Untagged: it recommended that fulton legislators act `` to have these laws studied and revised to the end of modernizing and improving them '' .

HMM-tagged: it/PPS recommended/VBD that/CS fulton/NP legislators/NNS act/VB ``/`` to/TO have/HV these/DTS laws/NNS studied/VBD and/CC revised/VBD to/IN the/AT end/NN of/IN modernizing/NP and/CC improving/VBG them/PPO ''/'' ./.

Entropy: 20.32449212032506

------------------------------------------------------------
Test: the/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN other/AP topics/NNS ,/, among/IN them/PPO the/AT atlanta/NP and/CC fulton/NP county/NN purchasing/VBG departments/NNS which/WDT it/PPS said/VBD ``/`` are/BER well/QL operated/VBN and/CC follow/VB generally/RB accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT interest/NN of/IN both/ABX governments/NNS ''/'' ./.

Untagged: the grand jury commented on a number of other topics , among them the atlanta and fulton county purchasing departments which it said `` are well operated and follow generally accepted practices which inure to the best interest of both governments '' .

HMM-tagged: the/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN other/AP topics/NNS ,/, among/IN them/PPO the/AT atlanta/NP and/CC fulton/NP county/NN purchasing/NN departments/NNS which/WDT it/PPS said/VBD ``/`` are/BER well/RB operated/VBN and/CC follow/VB generally/RB accepted/VBN practices/NNS which/WDT inure/VBZ to/IN the/AT best/JJT interest/NN of/IN both/ABX governments/NNS ''/'' ./.

Entropy: 31.383423146867017

------------------------------------------------------------
Test: merger/NN proposed/VBN

Untagged: merger proposed

HMM-tagged: merger/PPS proposed/VBD

Entropy: 5.671820394600896

------------------------------------------------------------
Test: however/WRB ,/, the/AT jury/NN said/VBD it/PPS believes/VBZ ``/`` these/DTS two/CD offices/NNS should/MD be/BE combined/VBN to/TO achieve/VB greater/JJR efficiency/NN and/CC reduce/VB the/AT cost/NN of/IN administration/NN ''/'' ./.

Untagged: however , the jury said it believes `` these two offices should be combined to achieve greater efficiency and reduce the cost of administration '' .

HMM-tagged: however/WRB ,/, the/AT jury/NN said/VBD it/PPS believes/VBZ ``/`` these/DTS two/CD offices/NNS should/MD be/BE combined/VBN to/TO achieve/VB greater/JJR efficiency/NN and/CC reduce/VB the/AT cost/NN of/IN administration/NN ''/'' ./.

Entropy: 8.2754594390854

------------------------------------------------------------
Test: the/AT city/NN purchasing/VBG department/NN ,/, the/AT jury/NN said/VBD ,/, ``/`` is/BEZ lacking/VBG in/IN experienced/VBN clerical/JJ personnel/NNS as/CS a/AT result/NN of/IN city/NN personnel/NNS policies/NNS ''/'' ./.

Untagged: the city purchasing department , the jury said , `` is lacking in experienced clerical personnel as a result of city personnel policies '' .

HMM-tagged: the/AT city/NN purchasing/NN department/NN ,/, the/AT jury/NN said/VBD ,/, ``/`` is/BEZ lacking/VBG in/IN experienced/AT clerical/JJ personnel/NNS as/CS a/AT result/NN of/IN city/NN personnel/NNS policies/NNS ''/'' ./.

Entropy: 16.762253727845472

------------------------------------------------------------
accuracy over 284 tokens: 92.96

Explanation of the output#

What do these tags (e.g., NN, AT, IN, NNS etc) mean? Where do they come from?
- These tags come from the Penn Treebank Project
- The Penn Treebank tagset consists of 36 POS tags to label different parts of speech of words in English.
Entropy is a common metric used to measure the degree of uncertainty or ambiguity in the tagging process.
- Lower entropy \(\rightarrow\) the tagger is relatively certain about the tags
- High entropy \(\rightarrow\) the tagger is less certain about the tags

Let’s try it out on a new unseen sentence#

hmm.tag(["keep", "the", "book", "on", "the", "table", "."])

[('keep', 'VB'),
 ('the', 'AT'),
 ('book', 'NN'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('table', 'NN'),
 ('.', '.')]

Why not use traditional ML models?#

We could extract features and treat it as a multi-class classification problem of predicting POS for each word. Some example features could be:
- Whether the word ends with an “ing” (for verbs)
- What’s the previous word?
- Or whether the word occurs at the beginning or end of a sentence
But coming up with such features is time consuming and limited. It can get unwieldy quite quickly and it leads to fragile and overfit models.
HMM provide a much more elegant way to model sequences and usually they are a preferred way to model sequences.