/nltk-old/psets/alta-1.tex
LaTeX | 312 lines | 270 code | 42 blank | 0 comment | 0 complexity | 87b26d047197eba04ffa24e243f98ff6 MD5 | raw file
Possible License(s): Apache-2.0, AGPL-1.0
- \def\bs{$\backslash$}
- \documentclass{worksheet}
- \usepackage{url}
- \title{Practical NLP using Python: Worksheet 1}
- \heading{Trevor Cohn \& Steven Bird}{ALTA Summer School 2003}
- \begin{document}
- \maketitle
- Python and the Natural Language Toolkit (NLTK) are installed on
- the workstations in the UG10 laboratory. Log in to the guest
- account (username=pcguest, password=ALTss\&w3, logon-to=CSSE). Start
- IDLE, the Python interpreter, from the Start menu, Programs submenu,
- Python 2.3 submenu.
- If you wish to install Python and NLTK on your own machine, you will
- find distributions on the ALTA CD. These may also be downloaded from
- \texttt{http://www.python.org} and
- \texttt{http://nltk.sourceforge.net}.
- \section*{Basic Python}
- Introductory material on Python is available at
- \url{http://www.python.org/doc/Intros.html}.
- The following examples concern lists and strings. Please try them
- out, and experiment with variations. You will need to do this from inside IDLE:
- \begin{verbatim}
- >>> a = ['colourless', 'green', 'ideas']
- >>> a
- ['colourless', 'green', 'ideas']
- >>> len(a)
- 3
- >>> a[1]
- 'green'
- >>> a[1:]
- ['green', 'ideas']
- >>> b = a + ['sleep', 'furiously']
- >>> b
- ['colourless', 'green', 'ideas', 'sleep', 'furiously']
- >>> b[-1]
- 'furiously'
- >>> b.sort()
- >>> b
- ['colourless', 'furiously', 'green', 'ideas', 'sleep']
- >>> b.reverse()
- >>> b
- ['sleep', 'ideas', 'green', 'furiously', 'colourless']
- >>> b[2] + b[1]
- 'greenideas'
- >>> for w in b:
- ... print w[0]
- ...
- 's'
- 'i'
- 'g'
- 'f'
- 'c'
- >>> b[2][1]
- 'r'
- >>> b.index('green')
- 2
- >>> b[5]
- IndexError: list index out of range
- >>> b[0] * 3
- 'sleepsleepsleep'
- >>> c = ' '.join(b)
- >>> c
- 'sleep ideas green furiously colourless'
- >>> c.split('r')
- ['sleep ideas g', 'een fu', 'iously colou', 'less']
- >>> map(lambda x: len(x), b)
- [5, 5, 5, 9, 10]
- >>> [(x, len(x)) for x in b]
- [('sleep', 5), ('ideas', 5), ('green', 5), ('furiously', 9), ('colourless', 10)]
- \end{verbatim}
- Experiment further with lists and strings until you are comfortable with using them.
- Next we'll take a look at Python ``dictionaries'' (or associative arrays).
- \begin{verbatim}
- >>> d = {}
- >>> d['colourless'] = 'adj'
- >>> d['furiously'] = 'adv'
- >>> d['ideas'] = 'n'
- >>> d.keys()
- ['colourless', 'furiously', 'ideas']
- >>> d.values()
- ['adv', 'adj', 'n']
- >>> d
- {'furiously': 'adv', 'colourless': 'adj', 'ideas': 'n'}
- >>> d.has_key('ideas')
- 1
- >>> for w in d.keys():
- ... print "%s [%s]," % (w, d[w]),
- ...
- furiously [adv], colourless [adj], ideas [n],
- >>>
- \end{verbatim}
- Finally, try out Python's regular expression module, for substituting and searching within
- strings.
- \begin{verbatim}
- >>> import re
- >>> s = "colourless green ideas sleep furiously"
- >>> re.sub('l', 's', s)
- 'cosoursess green ideas sseep furioussy'
- >>> re.sub('green', 'red', s)
- 'colourless red ideas sleep furiously'
- >>> re.findall('[^aeiou][aeiou]', s)
- ['co', 'lo', 'le', 're', ' i', 'de', 'le', 'fu', 'ri']
- >>> re.findall('([^aeiou])([aeiou])', s)
- [('c', 'o'), ('l', 'o'), ('l', 'e'), ('r', 'e'), (' ', 'i'), ('d', 'e'),
- ('l', 'e'), ('f', 'u'), ('r', 'i')]
- >>> re.findall('(.).\bs 1', s)
- ['o', 's']
- >>>
- \end{verbatim}
- Please see the NLTK regular expression tutorial, or the
- Python documentation, for further information on regular expressions:\\
- \url{http://nltk.sf.net/tutorial/regexps/}\\
- \url{http://www.python.org/doc/current/lib/module-re.html}.
- \pagebreak
- \section*{Using NLTK}
- NLTK is installed on the workstations and accessible from IDLE.
- During these exercises you may find it helpful to consult online
- documentation at \url{http://nltk.sf.net/docs.html}.
- In order to use the NLTK, you will need to import the appropriate
- symbols. For these problems, the following commands (in IDLE) will
- suffice:
- \begin{verbatim}
- >>> from nltk.tokenizer import *
- >>> from nltk.tagger import *
- >>> from nltk.corpus import *
- >>> from nltk.probability import *
- \end{verbatim}
- \section*{Tokenization}
- NLTK Tokenizers convert a string into a list of \texttt{Token}s.
- \begin{enumerate}
- \item Try creating some tokens using the built-in whitespace tokenizer:
- \begin{verbatim}
- >>> ws = WSTokenizer()
- >>> tokens = ws.tokenize('My dog likes your dog')
- >>> tokens
- ['My'@[0w], 'dog'@[1w], 'likes'@[2w], 'your'@[3w], 'dog'@[4w]]
- \end{verbatim}
- Extract the third token's type and location using the
- \texttt{type()} and \texttt{loc} methods.
- \item Next, use the corpus module to extract some tokenized data.
- \begin{verbatim}
- >>> gutenberg.items()
- ['austen-emma.txt', 'bible-kjv.txt', ..., 'chesterton-thursday.txt', ...]
- >>> ttext = gutenberg.tokenize('chesterton-thursday.txt')
- \end{verbatim}
- The ttext variable contains the entire novel as a list of tokens.
- Extract tokens 2631-2643.
- \item The regular expression tokenizer \texttt{RETokenizer()} was
- discussed in the lecture. Provide a regular expression to match `words'
- containing punctuation.
- \begin{verbatim}
- >>> t = RETokenizer('your regular expression goes here')
- >>> t.tokenize('OK, we'll email $20.95 payment to Amazon.com.')
- \end{verbatim}
- \item Try out your tokenizer on the header of a Gutenberg corpus file,
- which contains a lot of punctuation. How many tokens are there? Is it
- the same as what other people get?
- \begin{verbatim}
- >>> text = gutenberg.read('chesterton-thursday.txt')[:1200]
- >>> ttext = t.tokenize(text)
- >>> len(ttext)
- \end{verbatim}
- \end{enumerate}
- \section*{Part-of-speech tagging}
- Tokens are often ``tagged'' with additional information, such as their
- part-of-speech. The \texttt{TaggedToken} class is used to associate tags with
- word types.
- \begin{enumerate}
- \item Try creating a few tagged types, and embedding them inside tokens:
- \begin{verbatim}
- >>> chair_type = TaggedType('chair', 'NN')
- >>> chair_type
- 'chair'/'NN'
- >>> chair_token = Token(chair_type, Location(1, unit='w'))
- >>> chair_token
- 'chair'/'NN'@[1w]
- \end{verbatim}
- Extract the token's type and the type's tag and base using token's
- \texttt{type()} method and tagged type's \texttt{tag()} and
- \texttt{base()} methods.
- \item Use the \texttt{TaggedTokenizer} to tokenize a tagged sentence into a
- list of tagged tokens. The input is a string of the form:
- \begin{verbatim}
- >>> input = 'I/NP saw/VB a/DT man/NN'
- \end{verbatim}
- You will need to use the tokenizer's \texttt{tokenize} method.
- \item Use the corpus module to extract some tagged data. Both the Brown and
- Penn Treebank corpora contain tagged text. These corpora can be accessed
- using \texttt{brown} and \texttt{treebank}. Each corpus is divided into
- groups and items. Items are the logical units, usually files, into which
- the corpus has been split. Groups are logical groupings or subdivisions of
- the corpus corresponding to different sources, genre or markup for
- instance. The items may be listed exhaustively, or limited to only those
- beloning to a given group:
- \begin{verbatim}
- >>> brown
- <Corpus: brown (contains 500 items; 15 groups)>
- >>> brown.groups()
- ['skill and hobbies', 'humor', 'popular lore', 'fiction: mystery',
- 'belles-lettres', ...
- >>> brown.items('popular lore')
- ('cf01', 'cf02', 'cf03', 'cf04', 'cf05', 'cf06', 'cf07', 'cf08',
- 'cf09', 'cf10', ...
- >>> brown.read('cf04')
- '{\bs}n{\bs}n{\bs}t``/`` The/at food/nn is/bez wonderful/jj and/cc
- it/pps ...
- >>> brown.tokenize('cf04')[:10]
- ['``'/'``'@[0w], 'The'/'AT'@[1w], 'food'/'NN'@[2w], 'is'/'BEZ'@[3w],
- ...
- \end{verbatim}
- Try extracting some tagged text from other items and groups of the Brown
- corpus and the Penn Treebank. You will need to use the \texttt{'tagged'}
- group of the Treebank. The tag sets used differs between the two
- corpora. See
- \texttt{http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html} and
- \texttt{http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.html} for
- descriptions of the tag sets.
- \item Use the \texttt{NN\_CD\_Tagger} to tag a sequence of tokens. First
- extract some tagged text, remove all tags using the \texttt{untag()}
- methods then apply the tagger. Tagging accuracy can be measured using
- the \texttt{accuracy()} method:
- \begin{verbatim}
- tagged_tokens = brown.tokenize('cf04')
- untagged_tokens = untag(tagged_tokens)
- nn_cd_tagger = NN_CD_Tagger()
- tagged_tokens[205:209]
- ['home'/'NN'@[205w], 'to'/'IN'@[206w], '60'/'CD'@[207w], 'children'/'NNS'@[208w]]
- nn_cd_tagger.tag(untagged_tokens[205:209])
- ['home'/'NN'@[205w], 'to'/'NN'@[206w], '60'/'CD'@[207w], 'children'/'NN'@[208w]]
- retagged_tokens = nn_cd_tagger.tag(untagged_tokens)
- accuracy(tagged_tokens, retagged_tokens)
- 0.17139061116031887
- \end{verbatim}
- Inspect the output (\texttt{retagged\_tokens}) by hand, comparing it to
- the original in order to see what kind of errors were made.
- \item Use the \texttt{UnigramTagger} and \texttt{NthOrderTagger} with varying
- order (2 or more) on the same data. These taggers need to be trained in
- order to initialise their probability estimates. It is best to train on
- different data to that tested, hence we'll use a different item:
- \begin{verbatim}
- >>> training_tokens = brown.tokenize('cf01')
- >>> unigram = UnigramTagger()
- >>> unigram.train(training_tokens)
- >>> retagged_tokens = unigram.tag(untagged_tokens)
- \end{verbatim}
- This tagger doesn't perform very well as it hasn't seen much training data
- and thus its probability estimates are quite biased. See if and how much
- the accuracy can be improved by increasing the amount of training data
- (while ensuring that you're not using the training data for testing).
- \item You may have noticed that the \texttt{NthOrderTagger} failed miserably
- for high orders, where the \texttt{UnigramTagger} was quite robust. Why do
- you think this happens?
- Use a \texttt{BackoffTagger} with a unigram and nn cd tagger as shown
- below. Add some higher order taggers (eg. second, third order) to the
- start of the list of taggers. Does performance improve?
- \begin{verbatim}
- >>> backoff = BackoffTagger([unigram, nn_cd_tagger])
- >>> retagged = backoff.tag(untagged_tokens)
- \end{verbatim}
- \item Find the 10 most common tags in a group of items of the Brown corpus.
- Use the \texttt{FreqDist} class to count the number of instances of each
- tag, using the \texttt{inc()} method for the tag of each token as they are
- processed. You may need to refer to the lecture slides and NLTK
- documentation on the \texttt{FreqDist} class.
- \end{enumerate}
- Note: NLTK is installed in the subdirectories
- \texttt{C:{\bs}python23{\bs}site-packages{\bs}nltk} and
- \texttt{C:{\bs}python23{\bs}nltk}. The first contains the source code
- to NLTK, and the later contains the corpus data and documentation.
- \end{document}