PageRenderTime 60ms CodeModel.GetById 21ms RepoModel.GetById 0ms app.codeStats 0ms

/nltk/tag/hmm.py

https://github.com/BrucePHill/nltk
Python | 1268 lines | 959 code | 52 blank | 257 comment | 66 complexity | 999ddf845f92bcfae8d2de4de76c331b MD5 | raw file
Possible License(s): Apache-2.0

Large files files are truncated, but you can click here to view the full file

  1. # Natural Language Toolkit: Hidden Markov Model
  2. #
  3. # Copyright (C) 2001-2013 NLTK Project
  4. # Author: Trevor Cohn <tacohn@csse.unimelb.edu.au>
  5. # Philip Blunsom <pcbl@csse.unimelb.edu.au>
  6. # Tiago Tresoldi <tiago@tresoldi.pro.br> (fixes)
  7. # Steven Bird <stevenbird1@gmail.com> (fixes)
  8. # Joseph Frazee <jfrazee@mail.utexas.edu> (fixes)
  9. # URL: <http://www.nltk.org/>
  10. # For license information, see LICENSE.TXT
  11. """
  12. Hidden Markov Models (HMMs) largely used to assign the correct label sequence
  13. to sequential data or assess the probability of a given label and data
  14. sequence. These models are finite state machines characterised by a number of
  15. states, transitions between these states, and output symbols emitted while in
  16. each state. The HMM is an extension to the Markov chain, where each state
  17. corresponds deterministically to a given event. In the HMM the observation is
  18. a probabilistic function of the state. HMMs share the Markov chain's
  19. assumption, being that the probability of transition from one state to another
  20. only depends on the current state - i.e. the series of states that led to the
  21. current state are not used. They are also time invariant.
  22. The HMM is a directed graph, with probability weighted edges (representing the
  23. probability of a transition between the source and sink states) where each
  24. vertex emits an output symbol when entered. The symbol (or observation) is
  25. non-deterministically generated. For this reason, knowing that a sequence of
  26. output observations was generated by a given HMM does not mean that the
  27. corresponding sequence of states (and what the current state is) is known.
  28. This is the 'hidden' in the hidden markov model.
  29. Formally, a HMM can be characterised by:
  30. - the output observation alphabet. This is the set of symbols which may be
  31. observed as output of the system.
  32. - the set of states.
  33. - the transition probabilities *a_{ij} = P(s_t = j | s_{t-1} = i)*. These
  34. represent the probability of transition to each state from a given state.
  35. - the output probability matrix *b_i(k) = P(X_t = o_k | s_t = i)*. These
  36. represent the probability of observing each symbol in a given state.
  37. - the initial state distribution. This gives the probability of starting
  38. in each state.
  39. To ground this discussion, take a common NLP application, part-of-speech (POS)
  40. tagging. An HMM is desirable for this task as the highest probability tag
  41. sequence can be calculated for a given sequence of word forms. This differs
  42. from other tagging techniques which often tag each word individually, seeking
  43. to optimise each individual tagging greedily without regard to the optimal
  44. combination of tags for a larger unit, such as a sentence. The HMM does this
  45. with the Viterbi algorithm, which efficiently computes the optimal path
  46. through the graph given the sequence of words forms.
  47. In POS tagging the states usually have a 1:1 correspondence with the tag
  48. alphabet - i.e. each state represents a single tag. The output observation
  49. alphabet is the set of word forms (the lexicon), and the remaining three
  50. parameters are derived by a training regime. With this information the
  51. probability of a given sentence can be easily derived, by simply summing the
  52. probability of each distinct path through the model. Similarly, the highest
  53. probability tagging sequence can be derived with the Viterbi algorithm,
  54. yielding a state sequence which can be mapped into a tag sequence.
  55. This discussion assumes that the HMM has been trained. This is probably the
  56. most difficult task with the model, and requires either MLE estimates of the
  57. parameters or unsupervised learning using the Baum-Welch algorithm, a variant
  58. of EM.
  59. For more information, please consult the source code for this module,
  60. which includes extensive demonstration code.
  61. """
  62. from __future__ import print_function, unicode_literals
  63. import re
  64. import types
  65. try:
  66. from numpy import zeros, ones, float32, float64, log2, hstack, array, argmax
  67. except ImportError:
  68. pass
  69. from nltk.probability import (FreqDist, ConditionalFreqDist,
  70. ConditionalProbDist, DictionaryProbDist,
  71. DictionaryConditionalProbDist,
  72. LidstoneProbDist, MutableProbDist,
  73. MLEProbDist, UniformProbDist)
  74. from nltk.metrics import accuracy
  75. from nltk.util import LazyMap, LazyConcatenation, LazyZip
  76. from nltk.compat import python_2_unicode_compatible
  77. from nltk.tag.api import TaggerI, HiddenMarkovModelTaggerTransformI
  78. # _NINF = float('-inf') # won't work on Windows
  79. _NINF = float('-1e300')
  80. _TEXT = 0 # index of text in a tuple
  81. _TAG = 1 # index of tag in a tuple
  82. @python_2_unicode_compatible
  83. class HiddenMarkovModelTagger(TaggerI):
  84. """
  85. Hidden Markov model class, a generative model for labelling sequence data.
  86. These models define the joint probability of a sequence of symbols and
  87. their labels (state transitions) as the product of the starting state
  88. probability, the probability of each state transition, and the probability
  89. of each observation being generated from each state. This is described in
  90. more detail in the module documentation.
  91. This implementation is based on the HMM description in Chapter 8, Huang,
  92. Acero and Hon, Spoken Language Processing and includes an extension for
  93. training shallow HMM parsers or specializaed HMMs as in Molina et.
  94. al, 2002. A specialized HMM modifies training data by applying a
  95. specialization function to create a new training set that is more
  96. appropriate for sequential tagging with an HMM. A typical use case is
  97. chunking.
  98. :param symbols: the set of output symbols (alphabet)
  99. :type symbols: seq of any
  100. :param states: a set of states representing state space
  101. :type states: seq of any
  102. :param transitions: transition probabilities; Pr(s_i | s_j) is the
  103. probability of transition from state i given the model is in
  104. state_j
  105. :type transitions: ConditionalProbDistI
  106. :param outputs: output probabilities; Pr(o_k | s_i) is the probability
  107. of emitting symbol k when entering state i
  108. :type outputs: ConditionalProbDistI
  109. :param priors: initial state distribution; Pr(s_i) is the probability
  110. of starting in state i
  111. :type priors: ProbDistI
  112. :param transform: an optional function for transforming training
  113. instances, defaults to the identity function.
  114. :type transform: function or HiddenMarkovModelTaggerTransform
  115. """
  116. def __init__(self, symbols, states, transitions, outputs, priors, **kwargs):
  117. self._symbols = list(set(symbols))
  118. self._states = list(set(states))
  119. self._transitions = transitions
  120. self._outputs = outputs
  121. self._priors = priors
  122. self._cache = None
  123. self._transform = kwargs.get('transform', IdentityTransform())
  124. if isinstance(self._transform, types.FunctionType):
  125. self._transform = LambdaTransform(self._transform)
  126. elif not isinstance(self._transform,
  127. HiddenMarkovModelTaggerTransformI):
  128. raise
  129. @classmethod
  130. def _train(cls, labeled_sequence, test_sequence=None,
  131. unlabeled_sequence=None, **kwargs):
  132. transform = kwargs.get('transform', IdentityTransform())
  133. if isinstance(transform, types.FunctionType):
  134. transform = LambdaTransform(transform)
  135. elif \
  136. not isinstance(transform, HiddenMarkovModelTaggerTransformI):
  137. raise
  138. estimator = kwargs.get('estimator', lambda fd, bins: \
  139. LidstoneProbDist(fd, 0.1, bins))
  140. labeled_sequence = LazyMap(transform.transform, labeled_sequence)
  141. symbols = list(set(word for sent in labeled_sequence
  142. for word, tag in sent))
  143. tag_set = list(set(tag for sent in labeled_sequence
  144. for word, tag in sent))
  145. trainer = HiddenMarkovModelTrainer(tag_set, symbols)
  146. hmm = trainer.train_supervised(labeled_sequence, estimator=estimator)
  147. hmm = cls(hmm._symbols, hmm._states, hmm._transitions, hmm._outputs,
  148. hmm._priors, transform=transform)
  149. if test_sequence:
  150. hmm.test(test_sequence, verbose=kwargs.get('verbose', False))
  151. if unlabeled_sequence:
  152. max_iterations = kwargs.get('max_iterations', 5)
  153. hmm = trainer.train_unsupervised(unlabeled_sequence, model=hmm,
  154. max_iterations=max_iterations)
  155. if test_sequence:
  156. hmm.test(test_sequence, verbose=kwargs.get('verbose', False))
  157. return hmm
  158. @classmethod
  159. def train(cls, labeled_sequence, test_sequence=None,
  160. unlabeled_sequence=None, **kwargs):
  161. """
  162. Train a new HiddenMarkovModelTagger using the given labeled and
  163. unlabeled training instances. Testing will be performed if test
  164. instances are provided.
  165. :return: a hidden markov model tagger
  166. :rtype: HiddenMarkovModelTagger
  167. :param labeled_sequence: a sequence of labeled training instances,
  168. i.e. a list of sentences represented as tuples
  169. :type labeled_sequence: list(list)
  170. :param test_sequence: a sequence of labeled test instances
  171. :type test_sequence: list(list)
  172. :param unlabeled_sequence: a sequence of unlabeled training instances,
  173. i.e. a list of sentences represented as words
  174. :type unlabeled_sequence: list(list)
  175. :param transform: an optional function for transforming training
  176. instances, defaults to the identity function, see ``transform()``
  177. :type transform: function
  178. :param estimator: an optional function or class that maps a
  179. condition's frequency distribution to its probability
  180. distribution, defaults to a Lidstone distribution with gamma = 0.1
  181. :type estimator: class or function
  182. :param verbose: boolean flag indicating whether training should be
  183. verbose or include printed output
  184. :type verbose: bool
  185. :param max_iterations: number of Baum-Welch interations to perform
  186. :type max_iterations: int
  187. """
  188. return cls._train(labeled_sequence, test_sequence,
  189. unlabeled_sequence, **kwargs)
  190. def probability(self, sequence):
  191. """
  192. Returns the probability of the given symbol sequence. If the sequence
  193. is labelled, then returns the joint probability of the symbol, state
  194. sequence. Otherwise, uses the forward algorithm to find the
  195. probability over all label sequences.
  196. :return: the probability of the sequence
  197. :rtype: float
  198. :param sequence: the sequence of symbols which must contain the TEXT
  199. property, and optionally the TAG property
  200. :type sequence: Token
  201. """
  202. return 2**(self.log_probability(self._transform.transform(sequence)))
  203. def log_probability(self, sequence):
  204. """
  205. Returns the log-probability of the given symbol sequence. If the
  206. sequence is labelled, then returns the joint log-probability of the
  207. symbol, state sequence. Otherwise, uses the forward algorithm to find
  208. the log-probability over all label sequences.
  209. :return: the log-probability of the sequence
  210. :rtype: float
  211. :param sequence: the sequence of symbols which must contain the TEXT
  212. property, and optionally the TAG property
  213. :type sequence: Token
  214. """
  215. sequence = self._transform.transform(sequence)
  216. T = len(sequence)
  217. N = len(self._states)
  218. if T > 0 and sequence[0][_TAG]:
  219. last_state = sequence[0][_TAG]
  220. p = self._priors.logprob(last_state) + \
  221. self._outputs[last_state].logprob(sequence[0][_TEXT])
  222. for t in range(1, T):
  223. state = sequence[t][_TAG]
  224. p += self._transitions[last_state].logprob(state) + \
  225. self._outputs[state].logprob(sequence[t][_TEXT])
  226. last_state = state
  227. return p
  228. else:
  229. alpha = self._forward_probability(sequence)
  230. p = _log_add(*alpha[T-1, :])
  231. return p
  232. def tag(self, unlabeled_sequence):
  233. """
  234. Tags the sequence with the highest probability state sequence. This
  235. uses the best_path method to find the Viterbi path.
  236. :return: a labelled sequence of symbols
  237. :rtype: list
  238. :param unlabeled_sequence: the sequence of unlabeled symbols
  239. :type unlabeled_sequence: list
  240. """
  241. unlabeled_sequence = self._transform.transform(unlabeled_sequence)
  242. return self._tag(unlabeled_sequence)
  243. def _tag(self, unlabeled_sequence):
  244. path = self._best_path(unlabeled_sequence)
  245. return list(zip(unlabeled_sequence, path))
  246. def _output_logprob(self, state, symbol):
  247. """
  248. :return: the log probability of the symbol being observed in the given
  249. state
  250. :rtype: float
  251. """
  252. return self._outputs[state].logprob(symbol)
  253. def _create_cache(self):
  254. """
  255. The cache is a tuple (P, O, X, S) where:
  256. - S maps symbols to integers. I.e., it is the inverse
  257. mapping from self._symbols; for each symbol s in
  258. self._symbols, the following is true::
  259. self._symbols[S[s]] == s
  260. - O is the log output probabilities::
  261. O[i,k] = log( P(token[t]=sym[k]|tag[t]=state[i]) )
  262. - X is the log transition probabilities::
  263. X[i,j] = log( P(tag[t]=state[j]|tag[t-1]=state[i]) )
  264. - P is the log prior probabilities::
  265. P[i] = log( P(tag[0]=state[i]) )
  266. """
  267. if not self._cache:
  268. N = len(self._states)
  269. M = len(self._symbols)
  270. P = zeros(N, float32)
  271. X = zeros((N, N), float32)
  272. O = zeros((N, M), float32)
  273. for i in range(N):
  274. si = self._states[i]
  275. P[i] = self._priors.logprob(si)
  276. for j in range(N):
  277. X[i, j] = self._transitions[si].logprob(self._states[j])
  278. for k in range(M):
  279. O[i, k] = self._outputs[si].logprob(self._symbols[k])
  280. S = {}
  281. for k in range(M):
  282. S[self._symbols[k]] = k
  283. self._cache = (P, O, X, S)
  284. def _update_cache(self, symbols):
  285. # add new symbols to the symbol table and repopulate the output
  286. # probabilities and symbol table mapping
  287. if symbols:
  288. self._create_cache()
  289. P, O, X, S = self._cache
  290. for symbol in symbols:
  291. if symbol not in self._symbols:
  292. self._cache = None
  293. self._symbols.append(symbol)
  294. # don't bother with the work if there aren't any new symbols
  295. if not self._cache:
  296. N = len(self._states)
  297. M = len(self._symbols)
  298. Q = O.shape[1]
  299. # add new columns to the output probability table without
  300. # destroying the old probabilities
  301. O = hstack([O, zeros((N, M - Q), float32)])
  302. for i in range(N):
  303. si = self._states[i]
  304. # only calculate probabilities for new symbols
  305. for k in range(Q, M):
  306. O[i, k] = self._outputs[si].logprob(self._symbols[k])
  307. # only create symbol mappings for new symbols
  308. for k in range(Q, M):
  309. S[self._symbols[k]] = k
  310. self._cache = (P, O, X, S)
  311. def best_path(self, unlabeled_sequence):
  312. """
  313. Returns the state sequence of the optimal (most probable) path through
  314. the HMM. Uses the Viterbi algorithm to calculate this part by dynamic
  315. programming.
  316. :return: the state sequence
  317. :rtype: sequence of any
  318. :param unlabeled_sequence: the sequence of unlabeled symbols
  319. :type unlabeled_sequence: list
  320. """
  321. unlabeled_sequence = self._transform.transform(unlabeled_sequence)
  322. return self._best_path(unlabeled_sequence)
  323. def _best_path(self, unlabeled_sequence):
  324. T = len(unlabeled_sequence)
  325. N = len(self._states)
  326. self._create_cache()
  327. self._update_cache(unlabeled_sequence)
  328. P, O, X, S = self._cache
  329. V = zeros((T, N), float32)
  330. B = ones((T, N), int) * -1
  331. V[0] = P + O[:, S[unlabeled_sequence[0]]]
  332. for t in range(1, T):
  333. for j in range(N):
  334. vs = V[t-1, :] + X[:, j]
  335. best = argmax(vs)
  336. V[t, j] = vs[best] + O[j, S[unlabeled_sequence[t]]]
  337. B[t, j] = best
  338. current = argmax(V[T-1,:])
  339. sequence = [current]
  340. for t in range(T-1, 0, -1):
  341. last = B[t, current]
  342. sequence.append(last)
  343. current = last
  344. sequence.reverse()
  345. return list(map(self._states.__getitem__, sequence))
  346. def best_path_simple(self, unlabeled_sequence):
  347. """
  348. Returns the state sequence of the optimal (most probable) path through
  349. the HMM. Uses the Viterbi algorithm to calculate this part by dynamic
  350. programming. This uses a simple, direct method, and is included for
  351. teaching purposes.
  352. :return: the state sequence
  353. :rtype: sequence of any
  354. :param unlabeled_sequence: the sequence of unlabeled symbols
  355. :type unlabeled_sequence: list
  356. """
  357. unlabeled_sequence = self._transform.transform(unlabeled_sequence)
  358. return self._best_path_simple(unlabeled_sequence)
  359. def _best_path_simple(self, unlabeled_sequence):
  360. T = len(unlabeled_sequence)
  361. N = len(self._states)
  362. V = zeros((T, N), float64)
  363. B = {}
  364. # find the starting log probabilities for each state
  365. symbol = unlabeled_sequence[0]
  366. for i, state in enumerate(self._states):
  367. V[0, i] = self._priors.logprob(state) + \
  368. self._output_logprob(state, symbol)
  369. B[0, state] = None
  370. # find the maximum log probabilities for reaching each state at time t
  371. for t in range(1, T):
  372. symbol = unlabeled_sequence[t]
  373. for j in range(N):
  374. sj = self._states[j]
  375. best = None
  376. for i in range(N):
  377. si = self._states[i]
  378. va = V[t-1, i] + self._transitions[si].logprob(sj)
  379. if not best or va > best[0]:
  380. best = (va, si)
  381. V[t, j] = best[0] + self._output_logprob(sj, symbol)
  382. B[t, sj] = best[1]
  383. # find the highest probability final state
  384. best = None
  385. for i in range(N):
  386. val = V[T-1, i]
  387. if not best or val > best[0]:
  388. best = (val, self._states[i])
  389. # traverse the back-pointers B to find the state sequence
  390. current = best[1]
  391. sequence = [current]
  392. for t in range(T-1, 0, -1):
  393. last = B[t, current]
  394. sequence.append(last)
  395. current = last
  396. sequence.reverse()
  397. return sequence
  398. def random_sample(self, rng, length):
  399. """
  400. Randomly sample the HMM to generate a sentence of a given length. This
  401. samples the prior distribution then the observation distribution and
  402. transition distribution for each subsequent observation and state.
  403. This will mostly generate unintelligible garbage, but can provide some
  404. amusement.
  405. :return: the randomly created state/observation sequence,
  406. generated according to the HMM's probability
  407. distributions. The SUBTOKENS have TEXT and TAG
  408. properties containing the observation and state
  409. respectively.
  410. :rtype: list
  411. :param rng: random number generator
  412. :type rng: Random (or any object with a random() method)
  413. :param length: desired output length
  414. :type length: int
  415. """
  416. # sample the starting state and symbol prob dists
  417. tokens = []
  418. state = self._sample_probdist(self._priors, rng.random(), self._states)
  419. symbol = self._sample_probdist(self._outputs[state],
  420. rng.random(), self._symbols)
  421. tokens.append((symbol, state))
  422. for i in range(1, length):
  423. # sample the state transition and symbol prob dists
  424. state = self._sample_probdist(self._transitions[state],
  425. rng.random(), self._states)
  426. symbol = self._sample_probdist(self._outputs[state],
  427. rng.random(), self._symbols)
  428. tokens.append((symbol, state))
  429. return tokens
  430. def _sample_probdist(self, probdist, p, samples):
  431. cum_p = 0
  432. for sample in samples:
  433. add_p = probdist.prob(sample)
  434. if cum_p <= p <= cum_p + add_p:
  435. return sample
  436. cum_p += add_p
  437. raise Exception('Invalid probability distribution - '
  438. 'does not sum to one')
  439. def entropy(self, unlabeled_sequence):
  440. """
  441. Returns the entropy over labellings of the given sequence. This is
  442. given by::
  443. H(O) = - sum_S Pr(S | O) log Pr(S | O)
  444. where the summation ranges over all state sequences, S. Let
  445. *Z = Pr(O) = sum_S Pr(S, O)}* where the summation ranges over all state
  446. sequences and O is the observation sequence. As such the entropy can
  447. be re-expressed as::
  448. H = - sum_S Pr(S | O) log [ Pr(S, O) / Z ]
  449. = log Z - sum_S Pr(S | O) log Pr(S, 0)
  450. = log Z - sum_S Pr(S | O) [ log Pr(S_0) + sum_t Pr(S_t | S_{t-1}) + sum_t Pr(O_t | S_t) ]
  451. The order of summation for the log terms can be flipped, allowing
  452. dynamic programming to be used to calculate the entropy. Specifically,
  453. we use the forward and backward probabilities (alpha, beta) giving::
  454. H = log Z - sum_s0 alpha_0(s0) beta_0(s0) / Z * log Pr(s0)
  455. + sum_t,si,sj alpha_t(si) Pr(sj | si) Pr(O_t+1 | sj) beta_t(sj) / Z * log Pr(sj | si)
  456. + sum_t,st alpha_t(st) beta_t(st) / Z * log Pr(O_t | st)
  457. This simply uses alpha and beta to find the probabilities of partial
  458. sequences, constrained to include the given state(s) at some point in
  459. time.
  460. """
  461. unlabeled_sequence = self._transform.transform(unlabeled_sequence)
  462. T = len(unlabeled_sequence)
  463. N = len(self._states)
  464. alpha = self._forward_probability(unlabeled_sequence)
  465. beta = self._backward_probability(unlabeled_sequence)
  466. normalisation = _log_add(*alpha[T-1, :])
  467. entropy = normalisation
  468. # starting state, t = 0
  469. for i, state in enumerate(self._states):
  470. p = 2**(alpha[0, i] + beta[0, i] - normalisation)
  471. entropy -= p * self._priors.logprob(state)
  472. #print 'p(s_0 = %s) =' % state, p
  473. # state transitions
  474. for t0 in range(T - 1):
  475. t1 = t0 + 1
  476. for i0, s0 in enumerate(self._states):
  477. for i1, s1 in enumerate(self._states):
  478. p = 2**(alpha[t0, i0] + self._transitions[s0].logprob(s1) +
  479. self._outputs[s1].logprob(
  480. unlabeled_sequence[t1][_TEXT]) +
  481. beta[t1, i1] - normalisation)
  482. entropy -= p * self._transitions[s0].logprob(s1)
  483. #print 'p(s_%d = %s, s_%d = %s) =' % (t0, s0, t1, s1), p
  484. # symbol emissions
  485. for t in range(T):
  486. for i, state in enumerate(self._states):
  487. p = 2**(alpha[t, i] + beta[t, i] - normalisation)
  488. entropy -= p * self._outputs[state].logprob(
  489. unlabeled_sequence[t][_TEXT])
  490. #print 'p(s_%d = %s) =' % (t, state), p
  491. return entropy
  492. def point_entropy(self, unlabeled_sequence):
  493. """
  494. Returns the pointwise entropy over the possible states at each
  495. position in the chain, given the observation sequence.
  496. """
  497. unlabeled_sequence = self._transform.transform(unlabeled_sequence)
  498. T = len(unlabeled_sequence)
  499. N = len(self._states)
  500. alpha = self._forward_probability(unlabeled_sequence)
  501. beta = self._backward_probability(unlabeled_sequence)
  502. normalisation = _log_add(*alpha[T-1, :])
  503. entropies = zeros(T, float64)
  504. probs = zeros(N, float64)
  505. for t in range(T):
  506. for s in range(N):
  507. probs[s] = alpha[t, s] + beta[t, s] - normalisation
  508. for s in range(N):
  509. entropies[t] -= 2**(probs[s]) * probs[s]
  510. return entropies
  511. def _exhaustive_entropy(self, unlabeled_sequence):
  512. unlabeled_sequence = self._transform.transform(unlabeled_sequence)
  513. T = len(unlabeled_sequence)
  514. N = len(self._states)
  515. labellings = [[state] for state in self._states]
  516. for t in range(T - 1):
  517. current = labellings
  518. labellings = []
  519. for labelling in current:
  520. for state in self._states:
  521. labellings.append(labelling + [state])
  522. log_probs = []
  523. for labelling in labellings:
  524. labelled_sequence = unlabeled_sequence[:]
  525. for t, label in enumerate(labelling):
  526. labelled_sequence[t] = (labelled_sequence[t][_TEXT], label)
  527. lp = self.log_probability(labelled_sequence)
  528. log_probs.append(lp)
  529. normalisation = _log_add(*log_probs)
  530. #ps = zeros((T, N), float64)
  531. #for labelling, lp in zip(labellings, log_probs):
  532. #for t in range(T):
  533. #ps[t, self._states.index(labelling[t])] += \
  534. # 2**(lp - normalisation)
  535. #for t in range(T):
  536. #print 'prob[%d] =' % t, ps[t]
  537. entropy = 0
  538. for lp in log_probs:
  539. lp -= normalisation
  540. entropy -= 2**(lp) * lp
  541. return entropy
  542. def _exhaustive_point_entropy(self, unlabeled_sequence):
  543. unlabeled_sequence = self._transform.transform(unlabeled_sequence)
  544. T = len(unlabeled_sequence)
  545. N = len(self._states)
  546. labellings = [[state] for state in self._states]
  547. for t in range(T - 1):
  548. current = labellings
  549. labellings = []
  550. for labelling in current:
  551. for state in self._states:
  552. labellings.append(labelling + [state])
  553. log_probs = []
  554. for labelling in labellings:
  555. labelled_sequence = unlabeled_sequence[:]
  556. for t, label in enumerate(labelling):
  557. labelled_sequence[t] = (labelled_sequence[t][_TEXT], label)
  558. lp = self.log_probability(labelled_sequence)
  559. log_probs.append(lp)
  560. normalisation = _log_add(*log_probs)
  561. probabilities = zeros((T, N), float64)
  562. probabilities[:] = _NINF
  563. for labelling, lp in zip(labellings, log_probs):
  564. lp -= normalisation
  565. for t, label in enumerate(labelling):
  566. index = self._states.index(label)
  567. probabilities[t, index] = _log_add(probabilities[t, index], lp)
  568. entropies = zeros(T, float64)
  569. for t in range(T):
  570. for s in range(N):
  571. entropies[t] -= 2**(probabilities[t, s]) * probabilities[t, s]
  572. return entropies
  573. def _forward_probability(self, unlabeled_sequence):
  574. """
  575. Return the forward probability matrix, a T by N array of
  576. log-probabilities, where T is the length of the sequence and N is the
  577. number of states. Each entry (t, s) gives the probability of being in
  578. state s at time t after observing the partial symbol sequence up to
  579. and including t.
  580. :param unlabeled_sequence: the sequence of unlabeled symbols
  581. :type unlabeled_sequence: list
  582. :return: the forward log probability matrix
  583. :rtype: array
  584. """
  585. T = len(unlabeled_sequence)
  586. N = len(self._states)
  587. alpha = zeros((T, N), float64)
  588. symbol = unlabeled_sequence[0][_TEXT]
  589. for i, state in enumerate(self._states):
  590. alpha[0, i] = self._priors.logprob(state) + \
  591. self._outputs[state].logprob(symbol)
  592. for t in range(1, T):
  593. symbol = unlabeled_sequence[t][_TEXT]
  594. for i, si in enumerate(self._states):
  595. alpha[t, i] = _NINF
  596. for j, sj in enumerate(self._states):
  597. alpha[t, i] = _log_add(alpha[t, i], alpha[t-1, j] +
  598. self._transitions[sj].logprob(si))
  599. alpha[t, i] += self._outputs[si].logprob(symbol)
  600. return alpha
  601. def _backward_probability(self, unlabeled_sequence):
  602. """
  603. Return the backward probability matrix, a T by N array of
  604. log-probabilities, where T is the length of the sequence and N is the
  605. number of states. Each entry (t, s) gives the probability of being in
  606. state s at time t after observing the partial symbol sequence from t
  607. .. T.
  608. :return: the backward log probability matrix
  609. :rtype: array
  610. :param unlabeled_sequence: the sequence of unlabeled symbols
  611. :type unlabeled_sequence: list
  612. """
  613. T = len(unlabeled_sequence)
  614. N = len(self._states)
  615. beta = zeros((T, N), float64)
  616. # initialise the backward values
  617. beta[T-1, :] = log2(1)
  618. # inductively calculate remaining backward values
  619. for t in range(T-2, -1, -1):
  620. symbol = unlabeled_sequence[t+1][_TEXT]
  621. for i, si in enumerate(self._states):
  622. beta[t, i] = _NINF
  623. for j, sj in enumerate(self._states):
  624. beta[t, i] = _log_add(beta[t, i],
  625. self._transitions[si].logprob(sj) +
  626. self._outputs[sj].logprob(symbol) +
  627. beta[t + 1, j])
  628. return beta
  629. def test(self, test_sequence, **kwargs):
  630. """
  631. Tests the HiddenMarkovModelTagger instance.
  632. :param test_sequence: a sequence of labeled test instances
  633. :type test_sequence: list(list)
  634. :param verbose: boolean flag indicating whether training should be
  635. verbose or include printed output
  636. :type verbose: bool
  637. """
  638. def words(sent):
  639. return [word for (word, tag) in sent]
  640. def tags(sent):
  641. return [tag for (word, tag) in sent]
  642. test_sequence = LazyMap(self._transform.transform, test_sequence)
  643. predicted_sequence = LazyMap(self._tag, LazyMap(words, test_sequence))
  644. if kwargs.get('verbose', False):
  645. # This will be used again later for accuracy so there's no sense
  646. # in tagging it twice.
  647. test_sequence = list(test_sequence)
  648. predicted_sequence = list(predicted_sequence)
  649. for test_sent, predicted_sent in zip(test_sequence,
  650. predicted_sequence):
  651. print('Test:',
  652. ' '.join('%s/%s' % (token, tag)
  653. for (token, tag) in test_sent))
  654. print()
  655. print('Untagged:',
  656. ' '.join("%s" % token for (token, tag) in test_sent))
  657. print()
  658. print('HMM-tagged:',
  659. ' '.join('%s/%s' % (token, tag)
  660. for (token, tag) in predicted_sent))
  661. print()
  662. print('Entropy:',
  663. self.entropy([(token, None) for
  664. (token, tag) in predicted_sent]))
  665. print()
  666. print('-' * 60)
  667. test_tags = LazyConcatenation(LazyMap(tags, test_sequence))
  668. predicted_tags = LazyConcatenation(LazyMap(tags, predicted_sequence))
  669. acc = accuracy(test_tags, predicted_tags)
  670. count = sum(len(sent) for sent in test_sequence)
  671. print('accuracy over %d tokens: %.2f' % (count, acc * 100))
  672. def __repr__(self):
  673. return ('<HiddenMarkovModelTagger %d states and %d output symbols>'
  674. % (len(self._states), len(self._symbols)))
  675. class HiddenMarkovModelTrainer(object):
  676. """
  677. Algorithms for learning HMM parameters from training data. These include
  678. both supervised learning (MLE) and unsupervised learning (Baum-Welch).
  679. Creates an HMM trainer to induce an HMM with the given states and
  680. output symbol alphabet. A supervised and unsupervised training
  681. method may be used. If either of the states or symbols are not given,
  682. these may be derived from supervised training.
  683. :param states: the set of state labels
  684. :type states: sequence of any
  685. :param symbols: the set of observation symbols
  686. :type symbols: sequence of any
  687. """
  688. def __init__(self, states=None, symbols=None):
  689. self._states = (states if states else [])
  690. self._symbols = (symbols if symbols else [])
  691. def train(self, labelled_sequences=None, unlabeled_sequences=None,
  692. **kwargs):
  693. """
  694. Trains the HMM using both (or either of) supervised and unsupervised
  695. techniques.
  696. :return: the trained model
  697. :rtype: HiddenMarkovModelTagger
  698. :param labelled_sequences: the supervised training data, a set of
  699. labelled sequences of observations
  700. :type labelled_sequences: list
  701. :param unlabeled_sequences: the unsupervised training data, a set of
  702. sequences of observations
  703. :type unlabeled_sequences: list
  704. :param kwargs: additional arguments to pass to the training methods
  705. """
  706. assert labelled_sequences or unlabeled_sequences
  707. model = None
  708. if labelled_sequences:
  709. model = self.train_supervised(labelled_sequences, **kwargs)
  710. if unlabeled_sequences:
  711. if model: kwargs['model'] = model
  712. model = self.train_unsupervised(unlabeled_sequences, **kwargs)
  713. return model
  714. def train_unsupervised(self, unlabeled_sequences, **kwargs):
  715. """
  716. Trains the HMM using the Baum-Welch algorithm to maximise the
  717. probability of the data sequence. This is a variant of the EM
  718. algorithm, and is unsupervised in that it doesn't need the state
  719. sequences for the symbols. The code is based on 'A Tutorial on Hidden
  720. Markov Models and Selected Applications in Speech Recognition',
  721. Lawrence Rabiner, IEEE, 1989.
  722. :return: the trained model
  723. :rtype: HiddenMarkovModelTagger
  724. :param unlabeled_sequences: the training data, a set of
  725. sequences of observations
  726. :type unlabeled_sequences: list
  727. kwargs may include following parameters:
  728. :param model: a HiddenMarkovModelTagger instance used to begin
  729. the Baum-Welch algorithm
  730. :param max_iterations: the maximum number of EM iterations
  731. :param convergence_logprob: the maximum change in log probability to
  732. allow convergence
  733. """
  734. N = len(self._states)
  735. M = len(self._symbols)
  736. symbol_dict = dict((self._symbols[i], i) for i in range(M))
  737. # create a uniform HMM, which will be iteratively refined, unless
  738. # given an existing model
  739. model = kwargs.get('model')
  740. if not model:
  741. priors = UniformProbDist(self._states)
  742. transitions = DictionaryConditionalProbDist(
  743. dict((state, UniformProbDist(self._states))
  744. for state in self._states))
  745. output = DictionaryConditionalProbDist(
  746. dict((state, UniformProbDist(self._symbols))
  747. for state in self._states))
  748. model = HiddenMarkovModelTagger(self._symbols, self._states,
  749. transitions, output, priors)
  750. # update model prob dists so that they can be modified
  751. model._priors = MutableProbDist(model._priors, self._states)
  752. model._transitions = DictionaryConditionalProbDist(
  753. dict((s, MutableProbDist(model._transitions[s], self._states))
  754. for s in self._states))
  755. model._outputs = DictionaryConditionalProbDist(
  756. dict((s, MutableProbDist(model._outputs[s], self._symbols))
  757. for s in self._states))
  758. # iterate until convergence
  759. converged = False
  760. last_logprob = None
  761. iteration = 0
  762. max_iterations = kwargs.get('max_iterations', 1000)
  763. epsilon = kwargs.get('convergence_logprob', 1e-6)
  764. while not converged and iteration < max_iterations:
  765. A_numer = ones((N, N), float64) * _NINF
  766. B_numer = ones((N, M), float64) * _NINF
  767. A_denom = ones(N, float64) * _NINF
  768. B_denom = ones(N, float64) * _NINF
  769. logprob = 0
  770. for sequence in unlabeled_sequences:
  771. sequence = list(sequence)
  772. if not sequence:
  773. continue
  774. # compute forward and backward probabilities
  775. alpha = model._forward_probability(sequence)
  776. beta = model._backward_probability(sequence)
  777. # find the log probability of the sequence
  778. T = len(sequence)
  779. lpk = _log_add(*alpha[T-1, :])
  780. logprob += lpk
  781. # now update A and B (transition and output probabilities)
  782. # using the alpha and beta values. Please refer to Rabiner's
  783. # paper for details, it's too hard to explain in comments
  784. local_A_numer = ones((N, N), float64) * _NINF
  785. local_B_numer = ones((N, M), float64) * _NINF
  786. local_A_denom = ones(N, float64) * _NINF
  787. local_B_denom = ones(N, float64) * _NINF
  788. # for each position, accumulate sums for A and B
  789. for t in range(T):
  790. x = sequence[t][_TEXT] #not found? FIXME
  791. if t < T - 1:
  792. xnext = sequence[t+1][_TEXT] #not found? FIXME
  793. xi = symbol_dict[x]
  794. for i in range(N):
  795. si = self._states[i]
  796. if t < T - 1:
  797. for j in range(N):
  798. sj = self._states[j]
  799. local_A_numer[i, j] = \
  800. _log_add(local_A_numer[i, j],
  801. alpha[t, i] +
  802. model._transitions[si].logprob(sj) +
  803. model._outputs[sj].logprob(xnext) +
  804. beta[t+1, j])
  805. local_A_denom[i] = _log_add(local_A_denom[i],
  806. alpha[t, i] + beta[t, i])
  807. else:
  808. local_B_denom[i] = _log_add(local_A_denom[i],
  809. alpha[t, i] + beta[t, i])
  810. local_B_numer[i, xi] = _log_add(local_B_numer[i, xi],
  811. alpha[t, i] + beta[t, i])
  812. # add these sums to the global A and B values
  813. for i in range(N):
  814. for j in range(N):
  815. A_numer[i, j] = _log_add(A_numer[i, j],
  816. local_A_numer[i, j] - lpk)
  817. for k in range(M):
  818. B_numer[i, k] = _log_add(B_numer[i, k],
  819. local_B_numer[i, k] - lpk)
  820. A_denom[i] = _log_add(A_denom[i], local_A_denom[i] - lpk)
  821. B_denom[i] = _log_add(B_denom[i], local_B_denom[i] - lpk)
  822. # use the calculated values to update the transition and output
  823. # probability values
  824. for i in range(N):
  825. si = self._states[i]
  826. for j in range(N):
  827. sj = self._states[j]
  828. model._transitions[si].update(sj, A_numer[i,j] -
  829. A_denom[i])
  830. for k in range(M):
  831. ok = self._symbols[k]
  832. model._outputs[si].update(ok, B_numer[i,k] - B_denom[i])
  833. # Rabiner says the priors don't need to be updated. I don't
  834. # believe him. FIXME
  835. # test for convergence
  836. if iteration > 0 and abs(logprob - last_logprob) < epsilon:
  837. converged = True
  838. print('iteration', iteration, 'logprob', logprob)
  839. iteration += 1
  840. last_logprob = logprob
  841. return model
  842. def train_supervised(self, labelled_sequences, **kwargs):
  843. """
  844. Supervised training maximising the joint probability of the symbol and
  845. state sequences. This is done via collecting frequencies of
  846. transitions between states, symbol observations while within each
  847. state and which states start a sentence. These frequency distributions
  848. are then normalised into probability estimates, which can be
  849. smoothed if desired.
  850. :return: the trained model
  851. :rtype: HiddenMarkovModelTagger
  852. :param labelled_sequences: the training data, a set of
  853. labelled sequences of observations
  854. :type labelled_sequences: list
  855. :param kwargs: may include an 'estimator' parameter, a function taking
  856. a FreqDist and a number of bins and returning a CProbDistI;
  857. otherwise a MLE estimate is used
  858. """
  859. # default to the MLE estimate
  860. estimator = kwargs.get('estimator')
  861. if estimator is None:
  862. estimator = lambda fdist, bins: MLEProbDist(fdist)
  863. # count occurrences of starting states, transitions out of each state
  864. # and output symbols observed in each state
  865. starting = FreqDist()
  866. transitions = ConditionalFreqDist()
  867. outputs = ConditionalFreqDist()
  868. for sequence in labelled_sequences:
  869. lasts = None
  870. for token in sequence:
  871. state = token[_TAG]
  872. symbol = token[_TEXT]
  873. if lasts is None:
  874. starting.inc(state)
  875. else:
  876. transitions[lasts].inc(state)
  877. outputs[state].inc(symbol)
  878. lasts = state
  879. # update the state and symbol lists
  880. if state not in self._states:
  881. self._states.append(state)
  882. if symbol not in self._symbols:
  883. self._symbols.append(symbol)
  884. # create probability distributions (with smoothing)
  885. N = len(self._states)
  886. pi = estimator(starting, N)
  887. A = ConditionalProbDist(transitions, estimator, N)
  888. B = ConditionalProbDist(outputs, estimator, len(self._symbols))
  889. return HiddenMarkovModelTagger(self._symbols, self._states, A, B, pi)
  890. class HiddenMarkovModelTaggerTransform(HiddenMarkovModelTaggerTransformI):
  891. """
  892. An abstract subclass of HiddenMarkovModelTaggerTransformI.
  893. """
  894. def __init__(self):
  895. if self.__class__ == HiddenMarkovModelTaggerTransform:
  896. raise NotImplementedError("Abstract classes can't be instantiated")
  897. class LambdaTransform(HiddenMarkovModelTaggerTransform):
  898. """
  899. A subclass of HiddenMarkovModelTaggerTransform that is backed by an
  900. arbitrary user-defined function, instance method, or lambda function.
  901. """
  902. def __init__(self, transform):
  903. """
  904. :param func: a user-defined or lambda transform function
  905. :type func: function
  906. """
  907. self._transform = transform
  908. def transform(self, labeled_symbols):
  909. return self._transform(labeled_symbols)
  910. class IdentityTransform(HiddenMarkovModelTaggerTransform):
  911. """
  912. A subclass of HiddenMarkovModelTaggerTransform that implements
  913. transform() as the identity function, i.e. symbols passed to
  914. transform() are returned unmodified.
  915. """
  916. def transform(self, labeled_symbols):
  917. return labeled_symbols
  918. def _log_add(*values):
  919. """
  920. Adds the logged values, returning the logarithm of the addition.
  921. """
  922. x = max(values)
  923. if x > _NINF:
  924. sum_diffs = 0
  925. for value in values:
  926. sum_diffs += 2**(value - x)
  927. return x + log2(sum_diffs)
  928. else:
  929. return x
  930. def demo():
  931. # demonstrates HMM probability calculation
  932. print()
  933. print("HMM probability calculation demo")
  934. print()
  935. # example taken from page 381, Huang et al
  936. symbols = ['up', 'down', 'unchanged']
  937. states = ['bull', 'bear', 'static']
  938. def pd(values, samples):
  939. d = {}
  940. for value, item in zip(values, samples):
  941. d[item] = value
  942. return DictionaryProbDist(d)
  943. def cpd(array, conditions, samples):
  944. d = {}
  945. for values, condition in zip(array, conditions):
  946. d[condition] = pd(values, samples)
  947. return DictionaryConditionalProbDist(d)
  948. A = array([[0.6, 0.2, 0.2], [0.5, 0.3, 0.2], [0.4, 0.1, 0.5]], float64)
  949. A = cpd(A, states, states)
  950. B = array([[0.7, 0.1, 0.2], [0.1, 0.6, 0.3], [0.3, 0.3, 0.4]], float64)
  951. B = cpd(B, states, symbols)
  952. pi = array([0.5, 0.2, 0.3], float64)
  953. pi = pd(pi, states)
  954. model = HiddenMarkovModelTagger(symbols=symbols, states=states,
  955. transitions=A, outputs=B, priors=pi)
  956. print('Testing', model)
  957. for test in [['up', 'up'], ['up', 'down', 'up'],
  958. ['down'] * 5, ['unchanged'] * 5 + ['up']]:
  959. sequence = [(t, None) for t in test]
  960. print('Testing with state sequence', test)
  961. print('probability =', model.probability(sequence))
  962. print('tagging = ', model.tag([word for (word,tag) in sequence]))
  963. print('p(tagged) = ', model.probability(sequence))
  964. print('H = ', model.entropy(sequence))
  965. print('H_exh = ', model._exhaustive_entropy(sequence))
  966. print('H(point) = ', model.point_entropy(sequence))
  967. print('H_exh(point)=', model._exhaustive_point_entropy(sequence))
  968. print()
  969. def load_pos(num_sents):
  970. from nltk.corpus import brown
  971. sentences = brown.tagged_sents(categories='news')[:num_sents]
  972. tag_re = re.compile(r'[*]|--|[^+*-]+')
  973. tag_set = set()
  974. symbols = set()
  975. cleaned_sentences = []
  976. for sentence in sentences:
  977. for i in range(len(sentence)):
  978. word, tag = sentence[i]
  979. word = word.lower() # normalize
  980. symbols.add(word) # log this word
  981. # Clean up the tag.
  982. tag = tag_re.match(tag).group()
  983. tag_set.add(tag)
  984. sentence[i] = (word, tag) # store cleaned-up tagged token
  985. cleaned_sentences += [sentence]
  986. return cleaned_sentences, list(tag_set), list(symbols)
  987. def demo_pos():
  988. # demonstrates POS tagging using supervised training
  989. print()
  990. print("HMM POS tagging demo")
  991. print()
  992. print('Training HMM...')
  993. labelled_sequences, tag_set, symbols = load_pos(200)
  994. trainer = HiddenMarkovModelTrainer(tag_set, symbols)
  995. hmm = trainer.train_supervised(labelled_sequences[10:],
  996. estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
  997. print('Testing...')
  998. hmm.test(labelled_sequences[:10], verbose=True)
  999. def _untag(sentences):
  1000. unlabeled = []
  1001. for sentence in sentences:
  1002. unlabeled.append((token[_TEXT], None) for token in sentence)
  1003. return unlabeled
  1004. def demo_pos_bw():
  1005. # demonstrates the Baum-Welch algorithm in POS tagging
  1006. print()
  1007. print("Baum-Welch demo for POS tagging")
  1008. print()
  1009. print('Training HMM (supervised)...')
  1010. sentences, tag_set, symbols = load_pos(210)
  1011. symbols = set()
  1012. for sentence in sentences:
  1013. for token in sentence:
  1014. symbols.add(token[_TEXT])
  1015. trainer = HiddenMarkovModelTrainer(tag_set, list(symbols))
  1016. hmm = trainer.train_supervised(sentences[10:200],
  1017. estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
  1018. print('Training (unsupervised)...')
  1019. # it's rather slow - so only use 10 samples
  1020. unlabeled = _untag(sentences[200:210])
  1021. hmm = trainer.train_unsupervised(unlabeled, model=hmm, max_iterations=5)
  1022. hmm.test(sentences[:10], verbose=True)
  1023. def demo_bw():
  1024. # demo Baum Welch by generating some sequences and then performing
  1025. # unsupervised training on them
  1026. print()
  1027. print("Baum-Welch demo for market example")
  1028. print()
  1029. # example taken from page 381, Huang et al
  1030. symbols = ['up', 'down', 'unchanged']
  1031. states = ['bull', 'bear', 'static']
  1032. def pd(values, samples):
  1033. d = {}
  1034. for value, item in zip(values, samples):
  1035. d[item] = value
  1036. return DictionaryProbDist(d)
  1037. def cpd(array, conditions, samples):
  1038. d = {}
  1039. for values, condition in zip(array, conditions):
  1040. d[condition] = pd(values, samples)
  1041. return DictionaryConditionalProbDist(d)
  1042. A = array([[0.6, 0.2, 0.2], [0.5, 0.3, 0.2], [0.4, 0.1, 0.5]], float64)
  1043. A = cpd(A, states, states)
  1044. B = array([[0.7, 0.1, 0.2], [0.1, 0.6, 0.3], [0.3, 0.3, 0.4]], float64)
  1045. B = cpd(B, states, symbols)
  1046. pi = array([0.5, 0.2, 0.3], flo

Large files files are truncated, but you can click here to view the full file