This is Part 1 of a basic guide for setting up and using a natural language processing (NLP) tool with R. I specifically utilze the spaCy “industrial strength natural language processing” Python library, and an R wrapper called cleanNLP that provides tools for annotating texts and obtaining data tables. In this post, I will explain step-by-step how to get ready to work with text data using these two packages, and a snippet of what they can do.
I decided to write this post because for someone like me who has no background in computer science but wants to get their feet wet with NLP. It could be difficult to get started with these tools because you don’t know where to start or what to exactly look for. I have not yet seen any webpages that describe what I’m trying to do all in one place. It took a while for me to hunt down the information that I needed to carry out the kind of analyses that I wanted to do, and took many failed attempts to finally get the computer programs working. SpaCy and cleanNLP utilized in R have so far been the easiest to learn and work with in my experience as a beginner.
In this post, I assume you have some experience with R, but you should be able to follow along (though might not be able to understand what’s going on exactly) even if you haven’t used it before.
In case you need to install R and RStudio:
Download R for Windows
Download RStudio
Before you begin, check your computer system environment and compare it with mine. My guidelines might not work in your setting and you might experience issues that I don’t mention here (use Google search!). My system environment is: Windows 10 64-bit operating system.
If you have never installed Python on your computer, great. You start with a clean slate and it might work the best. If you already have Python installed (must be Python 2.7 and up) but not sure if it works, skip to the end of this section (1.3) to test it.
Here, I’m installing the latest version of Python, 3.6.4, which works with my OS. If you don’t use Python for anything else, just following these instructions will probably be your best bet. There are different ways to install Python, for example, using Anaconda which provides GUI and may be better in the long run, but I won’t use it here.
First step, download Python from here link for Windows. Make sure to get an installer that matches your operating system and R. In other words, if you have 32-bit R installed, download the 32-bit version (labelled “Windows x86”). With 64-bit R, download 64-bit Python (labelled “Windows x86-64”).
After the download is complete, execute the installer. Two important things. First, check the box that says “add to PATH” when you see it. For Python 2.7.x, this option has a different appearance - not as a checkbox but an icon to click somewhere during the setup. If you have accidentally neglected to check the “add to PATH” box, you will need to manually add the path for Python to the system environment variables (but it might be difficult to find what works for your system so you might as well just uninstall, reboot, and reinstall, enabling the above-mentioned option). Second, I recommend just use the default directory provided as the location. In my case, Python did not work properly when I installed it under a different location.
Open up a Command Prompt by searching “command …” or “cmd” from the windows start menu. Preferably, open the program by right-clicking the mouse and selecting “run as administrator”. If you don’t run it as admin, it’s fine for now but you have to run it as admin when you execute Section 2.2 Download language models.
Type python
and hit enter.
Your installation was successful if you see something like this:
> Python 3.6:d48eceb, Dec 19 2017, 06:54:40 64 bit (AMD64)] on win32
> Type “help”, “copyright”, “credits” or “license” for more information.
If you get a message that says “python is not recognized as internal or external command”, which I don’t think you will, as long as you have followed the instructions exactly so far, it means trouble. If you don’t know well about these issues, the simplest solution is to uninstall, reboot, and reinstall. Hopefully you have not encountered any issues and have Python up and running now.
Now type in exit()
to move on to installing spaCy.
To put it very simply, spaCy is a library that processes and analyzes languages. CleanNLP utilizes this specific Python library. Another option with cleanNLP is coreNLP (Java). I find spaCy lighter and easier to set up than coreNLP (a point also noted by the creators). Here I only demonstrate installing spaCy. See the “Backends” section from here for more information related to this.
In your Command Prompt, enter the following code:
pip install -U spacy
It can take several minutes to finish installing. If it fails to install with the error message that says Microsoft Visual C++ 14.0 is required, go download Visual C++ Build Tools and retry.
Next step is to install the language models (as in Python packages) that you want. SpaCy provides packages for a number of languages: see the models for languages other than English. In this example, I will install English models that I will use to process texts in English. One thing to note is that within English, different models are offered. Go to this page and scroll down to read about the models. The larger the model is, the more accurate it is, though the differences don’t seem to be that big.
Enter the following code to the command line:
python -m spacy download en_core_web_sm
The default model is en_core_web_sm. You can install multiple models if you wish. I installed the largest model:
python -m spacy download en_core_web_lg
If you see an error message saying that linking was unsuccessful, it could be because you didn’t run the Command as administrator.
After downloading and linking is complete, test if the installed modules work by trying the following codes, one line at a time:
python
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is a sentence.')
print(doc)
Did the last code return “This is a sentence.”? If yes, yay! - it’s probably safe to assume everything is running smoothly. Quit the Command Prompt and start RStudio. (Restart RStudio if you have had it open while installing spacy.)
This portion has been updated on December 21, 2018 to accommodate changes made to the cleanNLP package.
Create a new R script. The required packages for cleanNLP are:
#install packages: cleanNLP, reticulate
install.packages("cleanNLP"); install.packages("reticulate")
#load the packages
library(cleanNLP)
library(reticulate)
To get started with processing texts, you need to initialize spacy:
#initalized the NLP backend
cnlp_init_spacy(model_name = "en_core_web_sm")
Now, if this doesn’t work, try the following code to see whether there’s any problems with Python configuration on your computer.
#check your Python configuration
py_config()
## python: C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37\python.exe
## libpython: C:/Users/susie.DESKTOP-TF1TRHP/AppData/Local/Programs/Python/Python37/python37.dll
## pythonhome: C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37
## version: 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
## Architecture: 64bit
## numpy: C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy
## numpy_version: 1.17.2
##
## python versions found:
## C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37\python.exe
## C:\Users\susie.DESKTOP-TF1TRHP\AppData\Local\Programs\Python\Python37\\python.exe
A little anecdote is that when I first tested the automatic annotation, I used a line from “The Stairway to Heaven”: “There’s a lady who’s sure all that glitters is gold, and she’s buying a stairway to heaven.” The default model, the small and medium sized models marked “glitters” as a noun. The large model marked the word correctly as a verb. After seeing this, I’ve been using the large model.
This section will serve as a simple test and demonstration of the basic use of the cleanNLP package. It is based on the procedure described on the cleanNLP webpage
#make up a text
text <- "Let me be the one you call. If you jump, I'll break your fall."
Now we have a text to analyze. Give R a command to automatically annotate the text.
#run the annotator
obj <- cnlp_annotate(text)
Now you should have new data named obj
. It’s populated with lists of various data. Let’s view this data table with annotated objects.
#view the annotated data
cnlp_get_token(obj)
## id sid tid word lemma upos pos cid
## 2 doc1 1 1 Let let VERB VB 0
## 3 doc1 1 2 me -PRON- PRON PRP 4
## 4 doc1 1 3 be be VERB VB 7
## 5 doc1 1 4 the the DET DT 10
## 6 doc1 1 5 one one NOUN NN 14
## 7 doc1 1 6 you -PRON- PRON PRP 18
## 8 doc1 1 7 call call VERB VBP 22
## 9 doc1 1 8 . . PUNCT . 26
## 11 doc1 2 1 If if ADP IN 28
## 12 doc1 2 2 you -PRON- PRON PRP 31
## 13 doc1 2 3 jump jump VERB VBP 35
## 14 doc1 2 4 , , PUNCT , 39
## 15 doc1 2 5 I -PRON- PRON PRP 41
## 16 doc1 2 6 \\'ll will AUX MD 42
## 17 doc1 2 7 break break VERB VB 46
## 18 doc1 2 8 your -PRON- DET PRP$ 52
## 19 doc1 2 9 fall fall NOUN NN 57
## 20 doc1 2 10 . . PUNCT . 61
The table returns rows consisted of information regarding each token. Here are some detailed descriptions of what each column means:
See ?cnlp_get_token
for more information.
Here is another data table you might be interested in:
cnlp_get_dependency(obj)
## id sid tid tid_target relation relation_full
## 1 doc1 1 0 1 ROOT NA
## 2 doc1 1 3 2 nsubj NA
## 3 doc1 1 1 3 ccomp NA
## 4 doc1 1 5 4 det NA
## 5 doc1 1 3 5 attr NA
## 6 doc1 1 7 6 nsubj NA
## 7 doc1 1 5 7 relcl NA
## 8 doc1 1 1 8 punct NA
## 9 doc1 2 3 1 mark NA
## 10 doc1 2 3 2 nsubj NA
## 11 doc1 2 7 3 advcl NA
## 12 doc1 2 7 4 punct NA
## 13 doc1 2 7 5 nsubj NA
## 14 doc1 2 7 6 aux NA
## 15 doc1 2 0 7 ROOT NA
## 16 doc1 2 9 8 poss NA
## 17 doc1 2 7 9 dobj NA
## 18 doc1 2 7 10 punct NA
This data frame contains annotations produced by grammatically parsing the text. In other words, they describe what roles the elements serve in each sentence. Though it is valuable information, I’m skeptical as to how accurate this parser is with learner language, so I won’t be dealing with this feature as much.
This concludes Part 1 of the series. In Part 2, I will go through how to process text files - single file, multiple files, and all files under a specific directory. Please contact me if you have any questions, corrections, or suggestions!