PageRenderTime 29ms CodeModel.GetById 25ms app.highlight 3ms RepoModel.GetById 0ms app.codeStats 0ms

ReStructuredText | 117 lines | 89 code | 28 blank | 0 comment | 0 complexity | 51b1b5650377dc2fc41746b6fcafafec MD5 | raw file
  5Module for extracting tags from text documents.
  7Copyright (C) 2011 by Alessandro Presta
 13python2.7, stemming, nltk (optional), lxml (optional)
 15You can install the stemming package with::
 17    $ easy_install stemming
 22Tagging a text document from Python::
 24    import tagger
 25    weights = pickle.load(open('data/dict.pkl', 'rb')) # or your own dictionary
 26    myreader = tagger.Reader() # or your own reader class
 27    mystemmer = tagger.Stemmer() # or your own stemmer class
 28    myrater = tagger.Rater(weights) # or your own... (you got the idea)
 29    mytagger = Tagger(myreader, mystemmer, myrater)
 30    best_3_tags = mytagger(text_string, 3)
 32Running the module as a script::
 34    $ ./ <text document(s) to tag>
 38    $ ./ tests/*
 39    Loading dictionary... 
 40    Tags for  tests/bbc1.txt :
 41    ['bin laden', 'obama', 'pakistan', 'killed', 'raid']
 42    Tags for  tests/bbc2.txt :
 43    ['jo yeates', 'bristol', 'vincent tabak', 'murder', 'strangled']
 44    Tags for  tests/bbc3.txt :
 45    ['snp', 'party', 'election', 'scottish', 'labour']
 46    Tags for  tests/guardian1.txt :
 47    ['bin laden', 'al-qaida', 'killed', 'pakistan', 'al-fawwaz']
 48    Tags for  tests/guardian2.txt :
 49    ['clegg', 'tory', 'lib dem', 'party', 'coalition']
 50    Tags for  tests/post1.txt :
 51    ['sony', 'stolen', 'playstation network', 'hacker attack', 'lawsuit']
 52    Tags for  tests/wikipedia1.txt :
 53    ['universe', 'anthropic principle', 'observed', 'cosmological', 'theory']
 54    Tags for  tests/wikipedia2.txt :
 55    ['beetroot', 'beet', 'betaine', 'blood pressure', 'dietary nitrate']
 56    Tags for  tests/wikipedia3.txt :
 57    ['the lounge lizards', 'jazz', 'john lurie', 'musical', 'albums']
 59A brief explanation
 62Extracting tags from a text document involves at least three steps: splitting the document into words, grouping together variants of the same word, and ranking them according to their relevance.
 63These three tasks are carried out respectively by the **Reader**, **Stemmer** and **Rater** classes, and their work is put together by the **Tagger** class.
 65A **Reader** object may accept as input a document in some format, perform some normalisation of the text (such as turning everything into lower case), analyse the structure of the phrases and punctuation, and return a list of words respecting the order in the text, perhaps with some additional information such as which ones look like proper nouns, or are at the end of a phrase.
 66A very straightforward way of doing this would be to just match all the words with a regular expression, and this is indeed what the **SimpleReader** class does.
 68The **Stemmer** tries to recognise the root of a word, in order to identify slightly different forms. This is already a quite complicated task, and it's clearly language-specific.
 69The *stem* module in the NLTK package provides algorithms for many languages
 70and integrates nicely with the tagger::
 72    import nltk
 73    # an English stemmer using Lancaster's algorithm
 74    mystemmer = Stemmer(nltk.stem.LancasterStemmer)
 75    # an Italian stemmer
 76    class MyItalianStemmer(Stemmer):
 77        def __init__(self):
 78            Stemmer.__init__(self, nltk.stem.ItalianStemmer)
 79        def preprocess(self, string):
 80            # do something with the string before passing it to nltk's stemmer
 82The **Rater** takes the list of words contained in the document, together with any additional information gathered at the previous stages, and returns a list of tags (i.e. words or small units of text) ordered by some idea of "relevance".
 84It turns out that just working on the information contained in the document itself is not enough, because it says nothing about the frequency of a term in the language. For this reason, an early "off-line" phase of the algorithm consists in analysing a *corpus* (i.e. a sample of documents written in the same language) to build a dictionary of known words. This is taken care by the **build_dict()** function.
 85It is advised to build your own dictionaries, and the **build_dict_from_nltk()** function in the *extras* module enables you to use the corpora included in NLTK::
 87    build_dict_from_nltk(output_file, nltk.corpus.brown, 
 88                         nltk.corpus.stopwords.words('english'), measure='ICF')
 90So far, we may define the relevance of a word as the product of two distinct functions: one that depends on the document itself, and one that depends on the corpus.
 91A standard measure in information retrieval is TF-IDF (*term frequency-inverse
 92document frequency*): the frequency of the word in the document multiplied by
 93the (logarithm of) the inverse of its frequency in the corpus (i.e. the cardinality of the corpus divided by the number of documents where the word is found).
 94If we treat the whole corpus as a single document, and count the total occurrences of the term instead, we obtain ICF (*inverse collection frequency*).
 95Both of these are implemented in the *build_dict* module, and any other reasonable measure should be fine, provided that it is normalised in the interval [0,1]. The dictionary is passed to the **Rater** object as the *weights* argument in its constructor.
 96We might also want to define the first term of the product in a different way, and this is done by overriding the **rate_tags()** method (which by default calculates TF for each word and multiplies it by its weight)::
 98    class MyRater(Rater):
 99        def rate_tags(self, tags):
100            # set each tag's rating as you wish
102If we were not too picky about the results, these few bits would already make an acceptable tagger.
103However, it's a matter of fact that tags formed only by single words are quite limited: while "obama" and "barack obama" are both reasonable tags (and it is quite easy to treat cases like this in order to regard them as equal), having "laden" and "bin" as two separate tags is definitely not acceptable and misleading.
104Compare the results on the same document using the **NaiveRater** class (defined in the module *extras*) instead of the standard one.
106The *multitag_size* parameter in the **Rater**'s constructor defines the maximum number of words that can constitute a tag. Multitags are generated in the **create_multitags()** method; if additional information about the position of a word in the phrase is available (i.e. the **terminal** member of the class **Tag**), this can be done in a more accurate way.
107The rating of a **MultiTag** is computed from the ratings of its unit tags.
108By default, the **combined_rating()** method uses the geometric mean, with a special treatment of proper nouns if that information is available too (in the **proper** member).
109This method can be overridden too, so there is room for experimentation.
111With a few "common sense" heuristics the results are greatly improved.
112The final stage of the default rating algorithm involves discarding redundant tags (i.e. tags that contain or are contained in other, less relevant tags).
114It should be stressed that the default implementation doesn't make any assumption on the type of document that is being tagged (except for it being written in English) and on the kinds of tags that should be given priority (which sometimes can be a matter of taste or depend on the particular task we are using the tags for).
115With some additional assumptions and an accurate treatment of corner cases, the tagger can be tailored to suit the user's needs.
117This is proof-of-concept software and extensive experimentation is encouraged. The design of the base classes should allow for this, and the few examples in the *extras* module are a good starting point for customising the algorithm.