Natural Language Processing (NLP) in Python with NLTK: An Introduction

Learn the basics of NLP with Python and NLTK, from tokenization to vocabulary, and prepare text data for analysis or machine learning applications.
Author
Affiliation
John Henrich

Tensorscience.com

Published

February 9, 2023

Introduction

This tutorial is on natural language processing (NLP) in Python with the excellent NLTK package. Natural language processing (NLP) is the domain of artificial intelligence concerned with developing applications and services that have the ability to parse and understand natural (or human) languages. Some examples of this are speech recognition in devices like Alexa and Google Home, sentiment analysis of tweets on Twitter to gauge the mood of investors, etc.

The inherent difficulty in understanding natural languages is that the data is unstructured. It doesn’t come in a format that is easy to analyse and interpret. Natural language processing is the process of transforming a piece of text into a structured format that a computer can process and interpret. It is a very exciting field of machine learning.

Benefits of NLP

Every day there are millions of new words written on blogs, social websites and web pages. This data is used by companies to understand their users and their passions and adjust their strategies in real-time.

Say a person loves traveling and regularly posts on social media about his travel plans. These user interactions are used to provide him with relevant advertisements by online hotel and flight booking apps.

There are lot more ways NLP can manifest itself. Online retailers like Amazon may need to analyse the product reviews to assist their sorting algorithms; governments may need to create action plans on the basis of large-scale surveys; or journalists may want to determine the sentiments of the voters towards certain topics or candidates during voting season.

NLP Implementations

Some of the most prevelant implementations of NLP are:

  • Search engines like Google and Yahoo. Search engines understands from your history that you are interested in technology so they tailors their search results.

  • Social feeds like the Facebook and Twitter. The algorithm that generates your feed understands your interests using natural language processing and shows you relevant ads and posts.

  • Speech engines like Apple Siri and Alexa.

  • Spam filters in your email inbox. Rather than look for certain keywords, modern spam filters process the email’s content and then decide if it’s spam.

As you can see, there are many areas natural languages is beneficial. In this tutorial we will go through some of the key elements involved.

Installing the NLTK package in Python

In this tutorial, we will use Python 3. First we check that the correct compilers are installed by opening a terminal window and running the following command.

python -v

We set up a new directory for our project and navigate into it.

mkdir nlp-tutorial
cd nlp-tutorial

We can now install the Python library we will be using, called Natural Language Toolkit (NLTK). The NLTK package is supported by an active open-source community and contains many language processing tools to help format our data. We can use the pip command to install it.

pip install nltk

We now need to download the necessary data within NLTK. We do this by including the following code in a new Python file.

import nltk
nltk.download()

A window should open which looks similar to the one below. As it’s a large library, we can choose to download only the parts we need. Here we select the book collection, and click download.

This completes our initial setup and we can working with NLTK in Python.

Dataset

To begin our introduction to natural language processing, we first need some data to work with. In practice, we can create our own datasets by scraping the web or downloading existing files. But for now we will use one of the large datasets that come built into the NLTK library. These include classic novels, film scripts, encyclopedia, and even conversations overheard in London. We will use a set of movie reviews for our analysis.

We begin by importing the dataset and calling an associated ‘readme’ function to understand the underlying structure of the data.

from nltk.corpus import movie_reviews
movie_reviews.readme()

We will try to understand the data by simply printing words and their frequencies, at first. We use the raw function to get the entire collection of movie reviews as a single chunk of data:

raw = movie_reviews.raw()
print(raw)
plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 

As you can see, this outputs the data with no formatting at all. We can also choose to print just the first element. The output is a single character.

print(raw[0])
p

Clearly, a single chunk isn’t the format we want our data in. We use the ‘words’ function to split our data into individual words, and store them all in one ‘dictionary’.

dictionary = movie_reviews.words()
print(dictionary)
> ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

Now we can start to analyse statistics like individual words and their frequencies within the dictionary. We also calculate the frequency distribution (counting the number of times each word is used in the dictionary), and then plot or print out the top 50 words. This shows us which words occur most frequently.

from nltk import FreqDist
freq_dist = FreqDist(corpus)
print(freq_dist)
FreqDist with 39768 samples and 1583820 outcomes
print(freq_dist.most_common(50))

>[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595), (')', 11781), ('(', 11664), ('as', 11378), ('with', 10792), ('for', 9961), ('his', 9587), ('this', 9578), ('film', 9517), ('i', 8889), ('he', 8864), ('but', 8634), ('on', 7385), ('are', 6949), ('t', 6410), ('by', 6261), ('be', 6174), ('one', 5852), ('movie', 5771), ('an', 5744), ('who', 5692), ('not', 5577), ('you', 5316), ('from', 4999), ('at', 4986), ('was', 4940), ('have', 4901), ('they', 4825), ('has', 4719), ('her', 4522), ('all', 4373), ('?', 3771), ('there', 3770), ('like', 3690), ('so', 3683), ('out', 3637)]
freq_dist.plot(50)

This tells us how many total words are in our dictionary (1,583,820 outcomes), and how many unique words in contains (39,768 samples). We also get a look at the top 50 most frequently appearing words, and their frequency plot.

However, we begin to realize that simple analysis of the text in its natural format does not yield many useful results – the frequent use of punctuation and common words like ‘the’ and ‘of’ do not help us understand the context or meaning of the text. In the next few sections, we will explain some common and useful steps involved in NLP, which allow us to transform a text document into a format that can be analysed more effectively.

Tokenization

Earlier, when we split the raw text into individual words, we utilised the concept of tokenisation. The idea of Tokenisation is to take a text document and break it down into individual ‘tokens’ (often words or groups of words), and store these separately. Before storing these tokens, we could also apply some minor formatting methods. For example, punctuation could be removed.

Let’s test this on an individual review. First we split our raw text by review, instead of by word as before.

reviews = []
for i in range (0,len(movie_reviews.fileids())):
    reviews.append(movie_reviews.raw(movie_reviews.fileids()[i]))

In our raw document, each movie review is assigned a unique file ID. So this ‘for’ loop runs through each of the file IDs, and appends the raw text associated with that ID to a ‘reviews’ list. We can now access individual reviews, for example, to look at the first review by itself:

print(reviews[0])
plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly .

Now that we have the individual reviews in their natural form, we can break it down even further using tokenizer methods in the NLTK library. The ‘sent_tokenize’ function splits the review into tokens of sentences, and ‘word_tokenize’ splits the review into tokens of words.

from nltk.tokenize import word_tokenize, sent_tokenize
sentences = nltk.sent_tokenize(reviews[0])
words = nltk.word_tokenize(reviews[0])
print(sentences[0])
plot : two teen couples go to a church party , drink and then drive .

There are lots of options for tokenizing in NLTK which you can read about in the API documentation here. For our purposes, we will remove punctuation with a regular expression tokenizer, and use the Python function for transforming strings to lowercase, to build our final tokenizer.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(reviews[0].lower())
print(tokens)

['plot', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', 'drink', 'and', 'then', 'drive', 'they', 'get', 'into', 'an', 'accident', 'one', 'of', 'the', 'guys', 'dies', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', 'and', 'has', 'nightmares', 'what', 's', 'the', 'deal', 'watch', 'the', 'movie', 'and', 'sorta', 'find', 'out', 'critique', 'a', 'mind', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', 'which', 'is', 'what', 'makes', 'this', 'review', ...]

Stop Words

We’re now getting a more concise representation of the reviews; one which only holds the useful parts of the raw text. We have successfully broken down our reviews into tokens - words and sentences. A further key step in NLP is to remove a category called ‘stop words’, for example ‘the’, ‘and’, ‘to’. These words are used too frequently in all kinds of text, to add any significant value in terms of context or meaning. To implement this, we can compare our text document against a predefined list of stop words and remove the ones that match.

from nltk.corpus import stopwords
tokens = [token for token in tokens if token not in stopwords.words('english')]
print(tokens)

['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get', 'accident', 'one', 'guys', 'dies', 'girlfriend', 'continues', 'see', 'life', 'nightmares', 'deal', 'watch', 'movie', 'sorta', 'find', 'critique', 'mind', 'fuck', 'movie', 'teen', 'generation', 'touches', 'cool', 'idea', 'presents', 'bad', 'package', 'makes', 'review', 'even', 'harder', 'one', 'write', 'since', 'generally', 'applaud', 'films', 'attempt' ...]

Focusing on just the first review, we find that this reduces the number of tokens from 726 to 343 – so nearly half the words in this review were essentially redundant.

Lemmatization and Stemming

English is a funny language. Often different inflections of a word have the same general meaning (at least for the purposes of our analysis), and so it may be wise to group them together as one. For example, instead of handling the words ‘talk’, ‘talks’, ‘talked’, ‘talking’ as separate instances, we could choose to treat them all as the same word. There are two common approaches to implementing this concept – lemmatization and stemming.

Lemmatization mirrors what a human would do. It takes an inflected form of a word and returns its base form – the lemma. In order to achieve this, we need some context for the word’s usage, such as whether it is a noun or adjective, verb or adverb. Stemming is a much cruder attempt at generating a base form, often returning a word which is simply the first few characters that are shared by any form of the word (but not always a real word itself). Let’s see an example.

First we import and define a stemmer and lemmatizer from the NLTK library. Next we print the base word generated by each approach.

from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
test_word = "worrying"
word_stem = stemmer.stem(test_word)
word_lemmatise = lemmatizer.lemmatize(test_word)
word_lemmatise_verb = lemmatizer.lemmatize(test_word, pos="v")
word_lemmatise_adj = lemmatizer.lemmatize(test_word, pos="a")
print("Stem:",word_stem,"\nLemma:", word_lemmatise, word_lemmatise_verb, word_lemmatise_adj)
Stem: worri   
Lemma: worrying worry worrying

As we had thought possible, the stemmer has settled on the word ‘worri’ which is not a word recognised by the Oxford Dictionary, but rather a stemmed version of the various inflectons of the word ‘worry’. On the other hand, our lemmatizer needs to know whether the word was used as an adjective or verb, to correctly lemmatize it. But once it gets this information, it works as expected. The process of attaching this extra information in the form of contextual tags is called part-of-speech tagging and is outlined in the next section. Though this little exercise might have highlighted stemming as a less than ideal approach, in practice it is known to perform equally or negibly worse. You can see this by testing other words, such as ‘walking’, to realize that the stemmer and lemmatizer would give the same result.

Part-of-Speech Tagging

As mentioned above, part-of-speech tagging is the act of tagging a word with a grammatical category. In order to do, we require context on the sentence in which the word appears – for example, which words are adjacent to it and how its definition depends on this proximity. Fortunately, the NLTK library has a function that can be called to view all possible part-of-speech tags.

nltk.help.upenn_tagset()

The complete list includes PoS tags such as VB (verb in base form), VBG (verb as present participle), etc.

We simply call the ‘pos-tag’ function to generate these tags for our tokens:

The complete list includes PoS tags such as VB (verb in base form), VBG (verb as present participle), etc.

from nltk import pos_tag
pos_tokens = nltk.pos_tag(tokens)
print(pos_tokens)
[('plot', 'NN'), ('two', 'CD'), ('teen', 'NN'), ('couples', 'NNS'), ('go', 'VBP'), ('church', 'NN'), ('party', 'NN'), ('drink', 'VBP'), ('drive', 'JJ'), ('get', 'NN'), ('accident', 'JJ'), ('one', 'CD'), ('guys', 'NN'), ('dies', 'VBZ'), ('girlfriend', 'VBP'), ('continues', 'VBZ'), ('see', 'VBP'), ('life', 'NN'), ('nightmares', 'NNS'), ('deal', 'VBP'), ('watch', 'JJ'), ('movie', 'NN'), ('sorta', 'NN'), ('find', 'VBP'), ('critique', 'JJ'), ('mind', 'NN'), ('fuck', 'JJ'), ('movie', 'NN'), ('teen', 'JJ'), ('generation', 'NN'), ('touches', 'NNS'), ('cool', 'VBP'), ('idea', 'NN'), ('presents', 'NNS'), ('bad', 'JJ'), ('package', 'NN'), ('makes', 'VBZ'), ('review', 'VB'), ('even', 'RB'), ('harder', 'RBR'), ('one', 'CD'), ('write', 'NN'), ('since', 'IN'), ('generally', 'RB'), ('applaud', 'VBN'), ('films', 'NNS'), ('attempt', 'VB'), ('break', 'JJ'), ('mold', 'NN'), ('mess', 'NN'), ('head', 'NN'), ('lost', 'VBD'), ('highway', 'RB'), ('memento', 'JJ'), ('good', 'JJ'), ('bad', 'JJ'), ('ways', 'NNS'), ('making', 'VBG'), ('types', 'NNS'), ('films', 'NNS'), ('folks', 'NNS'), ('snag', 'VBP'), ('one', 'CD'), ('correctly', 'RB'), ('seem', 'VBP'), ('taken', 'VBN'), ('pretty', 'RB'), ('neat', 'JJ'), ('concept', 'NN'), ('executed', 'VBD'), ('terribly', 'RB'), ('problems', 'NNS'), ('movie', 'NN'), ('well', 'RB'), ('main', 'JJ'), ('problem', 'NN'), ('simply', 'RB'), ('jumbled', 'VBD'), ('starts', 'NNS'), ('normal', 'JJ'), ('downshifts', 'NNS'), ('fantasy', 'JJ'), ('world', 'NN'), ('audience', 'NN'), ('member', 'NN'), ('idea', 'NN'), ('going', 'VBG'), ('dreams', 'JJ'), ('characters', 'NNS'), ('coming', 'VBG'), ('back', 'RB'), ('dead', 'JJ'), ('others', 'NNS'), ('look', 'VBP'), ('like', 'IN'), ('dead', 'JJ'), ('strange', 'JJ') ...]

Vocabulary

Up until now, we have built and tested our NLP tools on individual reviews, but the real test would be to use all reviews together. If we did do this, our dictionary would contain a grand total of 1,336,782 word tokens. But this resulting list may contain multiple instances of the same word. For instance the sentence “They talked and talked” generates the tokens [‘They’, ‘talked’, ‘and’, ‘talked’]. This was very useful when counting frequencies, but it may be more useful to have a collection of all the unique words used in the dictionary, for example [‘They’, ‘talked’, ‘and’]. To this end, we can use a built-in Python function called ‘set’.

dictionary_tokens = tokenizer.tokenize(raw.lower())
vocab = sorted(set(dictionary_tokens))
['plot', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', 'drink', 'and', 'then', 'drive', 'they', 'get', 'into', 'an', 'accident', 'one', 'of', 'the', 'guys', 'dies', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', 'and', 'has', 'nightmares', 'what', 's', 'the', 'deal', 'watch', 'the', 'movie', 'and', 'sorta', 'find', 'out', 'critique', 'a', 'mind', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', 'which', 'is', 'what', 'makes', 'this', 'review', …]

This introduces us to the idea of the vocabulary of a dictionary, and we can use this to compare sizes of the vocabulary of different texts, or percentages of the vocabulary which refer to a certain topic. For example, whilst the total size of the dictionary of reviews contains 1,336,782 words (after tokenization), the size of the vocabulary is 39,696. And this smaller collection of words may be a better representation of the raw text in certain situations.

print("Tokens:", len(dictionary_tokens))
Tokens: 1336782
print("Vocabulary:", len(vocab))
Vocabulary: 39696

Conclusion

In this tutorial, we have transformed a dataset of movie reviews from raw text document into a structured format that a computer can understand and analyze. This structured form lends itself readily for data analysis or as further input into machine learning algorithms that hone in on topics discussed, analyse the mood of the review writer, or infer hidden meanings. We go into this in the next natural language processing tutorial on sentiment analysis:

If not continuing with the tutorial above, feel free to soldier ahead and use the NLP concepts we discussed to start working with text-based data of your own! The movie reviews dataset we used is already labelled as positive and negative reviews, so a worthwhile project would be to use the tools from this tutorial to build a machine learning classifier that predict the label of a new movie review from, say, IMDb.