PageRenderTime 41ms CodeModel.GetById 13ms RepoModel.GetById 0ms app.codeStats 0ms

/nltk-old/psets/alta-1.tex

http://nltk.googlecode.com/
LaTeX | 312 lines | 270 code | 42 blank | 0 comment | 0 complexity | 87b26d047197eba04ffa24e243f98ff6 MD5 | raw file
Possible License(s): Apache-2.0, AGPL-1.0
  1. \def\bs{$\backslash$}
  2. \documentclass{worksheet}
  3. \usepackage{url}
  4. \title{Practical NLP using Python: Worksheet 1}
  5. \heading{Trevor Cohn \& Steven Bird}{ALTA Summer School 2003}
  6. \begin{document}
  7. \maketitle
  8. Python and the Natural Language Toolkit (NLTK) are installed on
  9. the workstations in the UG10 laboratory. Log in to the guest
  10. account (username=pcguest, password=ALTss\&w3, logon-to=CSSE). Start
  11. IDLE, the Python interpreter, from the Start menu, Programs submenu,
  12. Python 2.3 submenu.
  13. If you wish to install Python and NLTK on your own machine, you will
  14. find distributions on the ALTA CD. These may also be downloaded from
  15. \texttt{http://www.python.org} and
  16. \texttt{http://nltk.sourceforge.net}.
  17. \section*{Basic Python}
  18. Introductory material on Python is available at
  19. \url{http://www.python.org/doc/Intros.html}.
  20. The following examples concern lists and strings. Please try them
  21. out, and experiment with variations. You will need to do this from inside IDLE:
  22. \begin{verbatim}
  23. >>> a = ['colourless', 'green', 'ideas']
  24. >>> a
  25. ['colourless', 'green', 'ideas']
  26. >>> len(a)
  27. 3
  28. >>> a[1]
  29. 'green'
  30. >>> a[1:]
  31. ['green', 'ideas']
  32. >>> b = a + ['sleep', 'furiously']
  33. >>> b
  34. ['colourless', 'green', 'ideas', 'sleep', 'furiously']
  35. >>> b[-1]
  36. 'furiously'
  37. >>> b.sort()
  38. >>> b
  39. ['colourless', 'furiously', 'green', 'ideas', 'sleep']
  40. >>> b.reverse()
  41. >>> b
  42. ['sleep', 'ideas', 'green', 'furiously', 'colourless']
  43. >>> b[2] + b[1]
  44. 'greenideas'
  45. >>> for w in b:
  46. ... print w[0]
  47. ...
  48. 's'
  49. 'i'
  50. 'g'
  51. 'f'
  52. 'c'
  53. >>> b[2][1]
  54. 'r'
  55. >>> b.index('green')
  56. 2
  57. >>> b[5]
  58. IndexError: list index out of range
  59. >>> b[0] * 3
  60. 'sleepsleepsleep'
  61. >>> c = ' '.join(b)
  62. >>> c
  63. 'sleep ideas green furiously colourless'
  64. >>> c.split('r')
  65. ['sleep ideas g', 'een fu', 'iously colou', 'less']
  66. >>> map(lambda x: len(x), b)
  67. [5, 5, 5, 9, 10]
  68. >>> [(x, len(x)) for x in b]
  69. [('sleep', 5), ('ideas', 5), ('green', 5), ('furiously', 9), ('colourless', 10)]
  70. \end{verbatim}
  71. Experiment further with lists and strings until you are comfortable with using them.
  72. Next we'll take a look at Python ``dictionaries'' (or associative arrays).
  73. \begin{verbatim}
  74. >>> d = {}
  75. >>> d['colourless'] = 'adj'
  76. >>> d['furiously'] = 'adv'
  77. >>> d['ideas'] = 'n'
  78. >>> d.keys()
  79. ['colourless', 'furiously', 'ideas']
  80. >>> d.values()
  81. ['adv', 'adj', 'n']
  82. >>> d
  83. {'furiously': 'adv', 'colourless': 'adj', 'ideas': 'n'}
  84. >>> d.has_key('ideas')
  85. 1
  86. >>> for w in d.keys():
  87. ... print "%s [%s]," % (w, d[w]),
  88. ...
  89. furiously [adv], colourless [adj], ideas [n],
  90. >>>
  91. \end{verbatim}
  92. Finally, try out Python's regular expression module, for substituting and searching within
  93. strings.
  94. \begin{verbatim}
  95. >>> import re
  96. >>> s = "colourless green ideas sleep furiously"
  97. >>> re.sub('l', 's', s)
  98. 'cosoursess green ideas sseep furioussy'
  99. >>> re.sub('green', 'red', s)
  100. 'colourless red ideas sleep furiously'
  101. >>> re.findall('[^aeiou][aeiou]', s)
  102. ['co', 'lo', 'le', 're', ' i', 'de', 'le', 'fu', 'ri']
  103. >>> re.findall('([^aeiou])([aeiou])', s)
  104. [('c', 'o'), ('l', 'o'), ('l', 'e'), ('r', 'e'), (' ', 'i'), ('d', 'e'),
  105. ('l', 'e'), ('f', 'u'), ('r', 'i')]
  106. >>> re.findall('(.).\bs 1', s)
  107. ['o', 's']
  108. >>>
  109. \end{verbatim}
  110. Please see the NLTK regular expression tutorial, or the
  111. Python documentation, for further information on regular expressions:\\
  112. \url{http://nltk.sf.net/tutorial/regexps/}\\
  113. \url{http://www.python.org/doc/current/lib/module-re.html}.
  114. \pagebreak
  115. \section*{Using NLTK}
  116. NLTK is installed on the workstations and accessible from IDLE.
  117. During these exercises you may find it helpful to consult online
  118. documentation at \url{http://nltk.sf.net/docs.html}.
  119. In order to use the NLTK, you will need to import the appropriate
  120. symbols. For these problems, the following commands (in IDLE) will
  121. suffice:
  122. \begin{verbatim}
  123. >>> from nltk.tokenizer import *
  124. >>> from nltk.tagger import *
  125. >>> from nltk.corpus import *
  126. >>> from nltk.probability import *
  127. \end{verbatim}
  128. \section*{Tokenization}
  129. NLTK Tokenizers convert a string into a list of \texttt{Token}s.
  130. \begin{enumerate}
  131. \item Try creating some tokens using the built-in whitespace tokenizer:
  132. \begin{verbatim}
  133. >>> ws = WSTokenizer()
  134. >>> tokens = ws.tokenize('My dog likes your dog')
  135. >>> tokens
  136. ['My'@[0w], 'dog'@[1w], 'likes'@[2w], 'your'@[3w], 'dog'@[4w]]
  137. \end{verbatim}
  138. Extract the third token's type and location using the
  139. \texttt{type()} and \texttt{loc} methods.
  140. \item Next, use the corpus module to extract some tokenized data.
  141. \begin{verbatim}
  142. >>> gutenberg.items()
  143. ['austen-emma.txt', 'bible-kjv.txt', ..., 'chesterton-thursday.txt', ...]
  144. >>> ttext = gutenberg.tokenize('chesterton-thursday.txt')
  145. \end{verbatim}
  146. The ttext variable contains the entire novel as a list of tokens.
  147. Extract tokens 2631-2643.
  148. \item The regular expression tokenizer \texttt{RETokenizer()} was
  149. discussed in the lecture. Provide a regular expression to match `words'
  150. containing punctuation.
  151. \begin{verbatim}
  152. >>> t = RETokenizer('your regular expression goes here')
  153. >>> t.tokenize('OK, we'll email $20.95 payment to Amazon.com.')
  154. \end{verbatim}
  155. \item Try out your tokenizer on the header of a Gutenberg corpus file,
  156. which contains a lot of punctuation. How many tokens are there? Is it
  157. the same as what other people get?
  158. \begin{verbatim}
  159. >>> text = gutenberg.read('chesterton-thursday.txt')[:1200]
  160. >>> ttext = t.tokenize(text)
  161. >>> len(ttext)
  162. \end{verbatim}
  163. \end{enumerate}
  164. \section*{Part-of-speech tagging}
  165. Tokens are often ``tagged'' with additional information, such as their
  166. part-of-speech. The \texttt{TaggedToken} class is used to associate tags with
  167. word types.
  168. \begin{enumerate}
  169. \item Try creating a few tagged types, and embedding them inside tokens:
  170. \begin{verbatim}
  171. >>> chair_type = TaggedType('chair', 'NN')
  172. >>> chair_type
  173. 'chair'/'NN'
  174. >>> chair_token = Token(chair_type, Location(1, unit='w'))
  175. >>> chair_token
  176. 'chair'/'NN'@[1w]
  177. \end{verbatim}
  178. Extract the token's type and the type's tag and base using token's
  179. \texttt{type()} method and tagged type's \texttt{tag()} and
  180. \texttt{base()} methods.
  181. \item Use the \texttt{TaggedTokenizer} to tokenize a tagged sentence into a
  182. list of tagged tokens. The input is a string of the form:
  183. \begin{verbatim}
  184. >>> input = 'I/NP saw/VB a/DT man/NN'
  185. \end{verbatim}
  186. You will need to use the tokenizer's \texttt{tokenize} method.
  187. \item Use the corpus module to extract some tagged data. Both the Brown and
  188. Penn Treebank corpora contain tagged text. These corpora can be accessed
  189. using \texttt{brown} and \texttt{treebank}. Each corpus is divided into
  190. groups and items. Items are the logical units, usually files, into which
  191. the corpus has been split. Groups are logical groupings or subdivisions of
  192. the corpus corresponding to different sources, genre or markup for
  193. instance. The items may be listed exhaustively, or limited to only those
  194. beloning to a given group:
  195. \begin{verbatim}
  196. >>> brown
  197. <Corpus: brown (contains 500 items; 15 groups)>
  198. >>> brown.groups()
  199. ['skill and hobbies', 'humor', 'popular lore', 'fiction: mystery',
  200. 'belles-lettres', ...
  201. >>> brown.items('popular lore')
  202. ('cf01', 'cf02', 'cf03', 'cf04', 'cf05', 'cf06', 'cf07', 'cf08',
  203. 'cf09', 'cf10', ...
  204. >>> brown.read('cf04')
  205. '{\bs}n{\bs}n{\bs}t``/`` The/at food/nn is/bez wonderful/jj and/cc
  206. it/pps ...
  207. >>> brown.tokenize('cf04')[:10]
  208. ['``'/'``'@[0w], 'The'/'AT'@[1w], 'food'/'NN'@[2w], 'is'/'BEZ'@[3w],
  209. ...
  210. \end{verbatim}
  211. Try extracting some tagged text from other items and groups of the Brown
  212. corpus and the Penn Treebank. You will need to use the \texttt{'tagged'}
  213. group of the Treebank. The tag sets used differs between the two
  214. corpora. See
  215. \texttt{http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html} and
  216. \texttt{http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.html} for
  217. descriptions of the tag sets.
  218. \item Use the \texttt{NN\_CD\_Tagger} to tag a sequence of tokens. First
  219. extract some tagged text, remove all tags using the \texttt{untag()}
  220. methods then apply the tagger. Tagging accuracy can be measured using
  221. the \texttt{accuracy()} method:
  222. \begin{verbatim}
  223. tagged_tokens = brown.tokenize('cf04')
  224. untagged_tokens = untag(tagged_tokens)
  225. nn_cd_tagger = NN_CD_Tagger()
  226. tagged_tokens[205:209]
  227. ['home'/'NN'@[205w], 'to'/'IN'@[206w], '60'/'CD'@[207w], 'children'/'NNS'@[208w]]
  228. nn_cd_tagger.tag(untagged_tokens[205:209])
  229. ['home'/'NN'@[205w], 'to'/'NN'@[206w], '60'/'CD'@[207w], 'children'/'NN'@[208w]]
  230. retagged_tokens = nn_cd_tagger.tag(untagged_tokens)
  231. accuracy(tagged_tokens, retagged_tokens)
  232. 0.17139061116031887
  233. \end{verbatim}
  234. Inspect the output (\texttt{retagged\_tokens}) by hand, comparing it to
  235. the original in order to see what kind of errors were made.
  236. \item Use the \texttt{UnigramTagger} and \texttt{NthOrderTagger} with varying
  237. order (2 or more) on the same data. These taggers need to be trained in
  238. order to initialise their probability estimates. It is best to train on
  239. different data to that tested, hence we'll use a different item:
  240. \begin{verbatim}
  241. >>> training_tokens = brown.tokenize('cf01')
  242. >>> unigram = UnigramTagger()
  243. >>> unigram.train(training_tokens)
  244. >>> retagged_tokens = unigram.tag(untagged_tokens)
  245. \end{verbatim}
  246. This tagger doesn't perform very well as it hasn't seen much training data
  247. and thus its probability estimates are quite biased. See if and how much
  248. the accuracy can be improved by increasing the amount of training data
  249. (while ensuring that you're not using the training data for testing).
  250. \item You may have noticed that the \texttt{NthOrderTagger} failed miserably
  251. for high orders, where the \texttt{UnigramTagger} was quite robust. Why do
  252. you think this happens?
  253. Use a \texttt{BackoffTagger} with a unigram and nn cd tagger as shown
  254. below. Add some higher order taggers (eg. second, third order) to the
  255. start of the list of taggers. Does performance improve?
  256. \begin{verbatim}
  257. >>> backoff = BackoffTagger([unigram, nn_cd_tagger])
  258. >>> retagged = backoff.tag(untagged_tokens)
  259. \end{verbatim}
  260. \item Find the 10 most common tags in a group of items of the Brown corpus.
  261. Use the \texttt{FreqDist} class to count the number of instances of each
  262. tag, using the \texttt{inc()} method for the tag of each token as they are
  263. processed. You may need to refer to the lecture slides and NLTK
  264. documentation on the \texttt{FreqDist} class.
  265. \end{enumerate}
  266. Note: NLTK is installed in the subdirectories
  267. \texttt{C:{\bs}python23{\bs}site-packages{\bs}nltk} and
  268. \texttt{C:{\bs}python23{\bs}nltk}. The first contains the source code
  269. to NLTK, and the later contains the corpus data and documentation.
  270. \end{document}