A basic guide to using NLP for corpus analysis with R (Part 1): Installing Python, spacy, and cleanNLP

Jan. 9, 2018

This is Part 1 of a basic guide for setting up and using a natural language processing (NLP) tool with R. I specifically utilze the spaCy “industrial strength natural language processing” Python library, and an R wrapper called cleanNLP that provides tools for annotating texts and obtaining data tables. In this post, I will explain step-by-step how to get ready to work with text data using these two packages, and a snippet of what they can do.

I decided to write this post because for someone like me who has no background in computer science but wants to get their feet wet with NLP. It could be difficult to get started with these tools because you don’t know where to start or what to exactly look for. I have not yet seen any webpages that describe what I’m trying to do all in one place. It took a while for me to hunt down the information that I needed to carry out the kind of analyses that I wanted to do, and took many failed attempts to finally get the computer programs working. SpaCy and cleanNLP utilized in R have so far been the easiest to learn and work with in my experience as a beginner.

In this post, I assume you have some experience with R, but you should be able to follow along (though might not be able to understand what’s going on exactly) even if you haven’t used it before.

In case you need to install R and RStudio:
Download R for Windows
Download RStudio

Before you begin, check your computer system environment and compare it with mine. My guidelines might not work in your setting and you might experience issues that I don’t mention here (use Google search!). My system environment is: Windows 10 64-bit operating system.

1 . Installing Python

If you have never installed Python on your computer, great. You start with a clean slate and it might work the best. If you already have Python installed (must be Python 2.7 and up) but not sure if it works, skip to the end of this section (1.3) to test it.

Here, I’m installing the latest version of Python, 3.6.4, which works with my OS. If you don’t use Python for anything else, just following these instructions will probably be your best bet. There are different ways to install Python, for example, using Anaconda which provides GUI and may be better in the long run, but I won’t use it here.

1.1 . Download Python

First step, download Python from here link for Windows. Make sure to get an installer that matches your operating system and R. In other words, if you have 32-bit R installed, download the 32-bit version (labelled “Windows x86”). With 64-bit R, download 64-bit Python (labelled “Windows x86-64”).

1.2 . Install Python

After the download is complete, execute the installer. Two important things. First, check the box that says “add to PATH” when you see it. For Python 2.7.x, this option has a different appearance - not as a checkbox but an icon to click somewhere during the setup. If you have accidentally neglected to check the “add to PATH” box, you will need to manually add the path for Python to the system environment variables (but it might be difficult to find what works for your system so you might as well just uninstall, reboot, and reinstall, enabling the above-mentioned option). Second, I recommend just use the default directory provided as the location. In my case, Python did not work properly when I installed it under a different location.

1.3 . Test if Python works

Open up a Command Prompt by searching “command …” or “cmd” from the windows start menu. Preferably, open the program by right-clicking the mouse and selecting “run as administrator”. If you don’t run it as admin, it’s fine for now but you have to run it as admin when you execute Section 2.2 Download language models.

Type python and hit enter.

Your installation was successful if you see something like this:
> Python 3.6:d48eceb, Dec 19 2017, 06:54:40 64 bit (AMD64)] on win32 > Type “help”, “copyright”, “credits” or “license” for more information.

If you get a message that says “python is not recognized as internal or external command”, which I don’t think you will, as long as you have followed the instructions exactly so far, it means trouble. If you don’t know well about these issues, the simplest solution is to uninstall, reboot, and reinstall. Hopefully you have not encountered any issues and have Python up and running now.

Now type in exit() to move on to installing spaCy.

2 . Installing NLP backend: spaCy

To put it very simply, spaCy is a library that processes and analyzes languages. CleanNLP utilizes this specific Python library. Another option with cleanNLP is coreNLP (Java). I find spaCy lighter and easier to set up than coreNLP (a point also noted by the creators). Here I only demonstrate installing spaCy. See the “Backends” section from here for more information related to this.

2.1 . Install spacy

In your Command Prompt, enter the following code:
pip install -U spacy

It can take several minutes to finish installing. If it fails to install with the error message that says Microsoft Visual C++ 14.0 is required, go download Visual C++ Build Tools and retry.

2.2 . Download language models

Next step is to install the language models (as in Python packages) that you want. SpaCy provides packages for a number of languages: see the models for languages other than English. In this example, I will install English models that I will use to process texts in English. One thing to note is that within English, different models are offered. Go to this page and scroll down to read about the models. The larger the model is, the more accurate it is, though the differences don’t seem to be that big.

Enter the following code to the command line:
python -m spacy download en_core_web_sm

The default model is en_core_web_sm. You can install multiple models if you wish. I installed the largest model:
python -m spacy download en_core_web_lg

If you see an error message saying that linking was unsuccessful, it could be because you didn’t run the Command as administrator.

After downloading and linking is complete, test if the installed modules work by trying the following codes, one line at a time:
python import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is a sentence.')
print(doc)

Did the last code return “This is a sentence.”? If yes, yay! - it’s probably safe to assume everything is running smoothly. Quit the Command Prompt and start RStudio. (Restart RStudio if you have had it open while installing spacy.)

3 . Getting ready with RStudio

3.1 . Install all requirements

This portion has been updated on December 21, 2018 to accommodate changes made to the cleanNLP package.

Create a new R script. The required packages for cleanNLP are:

#install packages: cleanNLP, reticulate
install.packages("cleanNLP"); install.packages("reticulate")
#load the packages
library(cleanNLP)
library(reticulate)

To get started with processing texts, you need to initialize spacy:

#initalized the NLP backend
cnlp_init_spacy(model_name = "en_core_web_sm")

Now, if this doesn’t work, try the following code to see whether there’s any problems with Python configuration on your computer.

#check your Python configuration
py_config()
## python:         C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37\python.exe
## libpython:      C:/Users/susie.DESKTOP-TF1TRHP/AppData/Local/Programs/Python/Python37/python37.dll
## pythonhome:     C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37
## version:        3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy
## numpy_version:  1.17.2
## 
## python versions found: 
##  C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37\python.exe
##  C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37\\python.exe

A little anecdote is that when I first tested the automatic annotation, I used a line from “The Stairway to Heaven”: “There’s a lady who’s sure all that glitters is gold, and she’s buying a stairway to heaven.” The default model, the small and medium sized models marked “glitters” as a noun. The large model marked the word correctly as a verb. After seeing this, I’ve been using the large model.

3.2 . Processing a text string

This section will serve as a simple test and demonstration of the basic use of the cleanNLP package. It is based on the procedure described on the cleanNLP webpage

#make up a text
text <- "Let me be the one you call. If you jump, I'll break your fall."

Now we have a text to analyze. Give R a command to automatically annotate the text.

#run the annotator
obj <- cnlp_annotate(text)

Now you should have new data named obj. It’s populated with lists of various data. Let’s view this data table with annotated objects.

#view the annotated data
cnlp_get_token(obj)
##      id sid tid  word  lemma  upos  pos cid
## 2  doc1   1   1   Let    let  VERB   VB   0
## 3  doc1   1   2    me -PRON-  PRON  PRP   4
## 4  doc1   1   3    be     be  VERB   VB   7
## 5  doc1   1   4   the    the   DET   DT  10
## 6  doc1   1   5   one    one  NOUN   NN  14
## 7  doc1   1   6   you -PRON-  PRON  PRP  18
## 8  doc1   1   7  call   call  VERB  VBP  22
## 9  doc1   1   8     .      . PUNCT    .  26
## 11 doc1   2   1    If     if   ADP   IN  28
## 12 doc1   2   2   you -PRON-  PRON  PRP  31
## 13 doc1   2   3  jump   jump  VERB  VBP  35
## 14 doc1   2   4     ,      , PUNCT    ,  39
## 15 doc1   2   5     I -PRON-  PRON  PRP  41
## 16 doc1   2   6 \\'ll   will   AUX   MD  42
## 17 doc1   2   7 break  break  VERB   VB  46
## 18 doc1   2   8  your -PRON-   DET PRP$  52
## 19 doc1   2   9  fall   fall  NOUN   NN  57
## 20 doc1   2  10     .      . PUNCT    .  61

The table returns rows consisted of information regarding each token. Here are some detailed descriptions of what each column means:

  • id: Text id. Here ’id’s are all 1 because there is one text source.
  • sid: Sentence id (or sentence number) nested under ‘id’. We can see that we have 2 sentences in this text.
  • tid: Token id nested under ‘sid’. Note that for each sentence, ‘tid’ starts from 0 to mark ‘ROOT’, which indicates the start of a sentence. The default code hides the 0s, but they are there, and will be useful for corpus analysis.
  • word: The raw word as is in the source text.
  • lemma: The basic form of the token with no morphological changes.
  • upos: Universal part-of-speech, more general than ‘pos’.
  • pos: Language specific part-of-speech. For example, you can see tokens that are labelled as ‘VERB’ under the ‘upos’ column are further specified into verb infinitive (VB), verb present tense (VBP), and modal verbs (MD).

See ?cnlp_get_token for more information.

Here is another data table you might be interested in:

cnlp_get_dependency(obj)
##      id sid tid tid_target relation relation_full
## 1  doc1   1   0          1     ROOT            NA
## 2  doc1   1   3          2    nsubj            NA
## 3  doc1   1   1          3    ccomp            NA
## 4  doc1   1   5          4      det            NA
## 5  doc1   1   3          5     attr            NA
## 6  doc1   1   7          6    nsubj            NA
## 7  doc1   1   5          7    relcl            NA
## 8  doc1   1   1          8    punct            NA
## 9  doc1   2   3          1     mark            NA
## 10 doc1   2   3          2    nsubj            NA
## 11 doc1   2   7          3    advcl            NA
## 12 doc1   2   7          4    punct            NA
## 13 doc1   2   7          5    nsubj            NA
## 14 doc1   2   7          6      aux            NA
## 15 doc1   2   0          7     ROOT            NA
## 16 doc1   2   9          8     poss            NA
## 17 doc1   2   7          9     dobj            NA
## 18 doc1   2   7         10    punct            NA

This data frame contains annotations produced by grammatically parsing the text. In other words, they describe what roles the elements serve in each sentence. Though it is valuable information, I’m skeptical as to how accurate this parser is with learner language, so I won’t be dealing with this feature as much.

This concludes Part 1 of the series. In Part 2, I will go through how to process text files - single file, multiple files, and all files under a specific directory. Please contact me if you have any questions, corrections, or suggestions!