About

applied linguist, Korean language teacher


Latest posts

Jan. 12, 2019

Web Scraping with R: A Great Resource for Language Learning and Teaching

1 . Introduction 1.1 . Tools Needed 2 . Example 1: Scraping Webpages 2.1 . Wikipedia entries 2.2 . More ideas 3 . Example 2: Scraping Blogs 3.1 . The Big Bang Theory transcripts 3.2 . More ideas 4 . Example 3: Scraping Online Newspapers 4.1 . Social Commentary from CNN 4.2 . More ideas 5 . Summary 1 . Introduction Recently, I helped a colleague scrape text from Wikipedia for a class project.

Jan. 1, 2019

Using Stanford CoreNLP with R: Bigram and Trigram Analysis

1 . Preparation 1.1 . Install Java 1.2 . Install cleanNLP and language model 2 . Annotation Using Stanford CoreNLP 3 . Example Text Analysis: Creating Bigrams and Trigrams 3.1 . With tidytext 3.2 . Manually Creating Bigrams and Trigrams 3.3 . Example Analysis: Be + words Forget my previous posts on using the Stanford NLP engine via command and retreiving information from XML files in R…. I’ve found that everything can be done in RStudio (at least I learned more about how to work with XML in R).

Dec. 25, 2018

Comparing tools for obtaining word token and type

1 . Text files 2 . Working with R packages 2.1 . Quanteda 2.2 . Tidytext 3 . Results from Natural Language Processing Tools 3.1 . spacy 3.2 . Stanford CoreNLP 4 . Comparisons 4.1 . Tokens 4.2 . Types When analyzing texts in any context, the most basic linguistic characteristics of the corpus (i.e., texts) to describe are word tokens (i.e., the number of words) and types (i.e., the number of distinct words).

Latest photos

pic_221123_0200.png pic_221011_0737.JPG pic_190121_1257.jpg