Taking text datato the next level

# Taking text datato the next level
## Using supervised and unsupervised approaches in NLP
### Cosima Meyer
### 2020-12-08

---

---
## Researcher and data lover

<img src="lamp.png" width=100 height=100>
]

**Co-founder** and **co-editor** of the data science blog [**Methods Bites**](https://www.mzes.uni-mannheim.de/socialsciencedatalab/)

**Maintainer** and **author** of the CRAN package [**overviewR**](https://cosimameyer.github.io/overviewR/)
]

---

---

## Where do you find text data?

- Short answer: **everywhere!**



- It can be a speech, a peace agreement, a treaty, a newspaper article, a diary report, an open survey answer, some archival notes, tweets, website text, ...

![smallcenter](text-data.png#center)
---

## What is text data?

## What is text data?

## What is text data?

## What is text data?

---

## What is text data?
.pull-left[
![small](corpus2.png)
]
.pull-right[
**Corpus**: collection of documents
]
---
## What is text data?
.pull-left[
![small](corpus2.png)

![small](token2.png)
]
.pull-right[
**Corpus**: collection of documents











**Tokens**: each individual word in a text (but it could also be a sentence, paragraph, or character)
]

---

## What is text data?

---

## What is text data?

---

## What is text data?

## What is text data?

.pull-left[
![small](tokenization2.png)
]
.pull-right[
**Tokenization**: Creating a **bag of words**
]

---

## What is text data?

![small](dfm3.png)
]
.pull-right[
**Tokenization**: Creating a **bag of words**








**Document-feature matrix (DFM)**: First split the text into its single terms (tokens), then count how frequently each token occurs in each document
]

---

## What is text data?

![small](lemmatization2.png)
]
.pull-right[
**Stemming**: Getting the stem of the word








**Lemmatization**: Getting the meaningful stem of the word
]

---
layout: true
.footer[   ]

---
## Speeches at the UNGD

Speeches are based on the data set by [**Mikhaylov et al. (2017)**](https://doi.org/10.7910/DVN/0TJX8Y)

---

---
class: inverse
background-image: url("overview_new.jpg")
background-size: 600px

---

class: inverse
background-image: url("overview_new_1.jpg")
background-size: 600px

---

class: inverse
background-image: url("overview_new_2.jpg")
background-size: 600px

---

class: inverse
background-image: url("overview_new_3.jpg")
background-size: 600px

---

class: inverse
background-image: url("overview_new_4.jpg")
background-size: 600px

---

class: inverse
background-image: url("overview_new_5.jpg")
background-size: 600px

---
class: inverse, center, middle

# Quanteda

---

## Quanteda

- `quanteda`: **qu**antitative **an**alysis of **te**xtual **da**ta
  - Fully-featured package that allows you to easily perform NLP tasks

- What are alternatives?
  - `tm`: simpler grammar but fewer features

- `tidytext`: good integration with the `tidyverse`

- `koRpus`: good for part-of-speech tagging
  
  - ...

---

## How do we use quanteda?

1) **Import** the data

2) Build a **corpus**

3) **Pre-process your data**

4) Calculate a **document-feature matrix** (DFM)
--
![](dfm.png)
<center><a href="https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/#stm">Methods Bites (2019)</a></center>

---
## How do we use quanteda?

1) **Import** the data

```r
# Load packages
library(quanteda) # For NLP

load("../data/UN-data.RData")
```

---
## How do we use quanteda?

2) Build a **corpus**

```r
# Generate a corpus from our data set
mycorpus <- corpus(un_data)

# Assigns a unique identifier to each text
docvars(mycorpus, "Textno") <-
 sprintf("%02d", 1:ndoc(mycorpus)) 
```

---
## How do we use quanteda?

3) **Preprocess the text**

- Remove numbers
- Remove punctuation
- Remove symbols
- Remove URLS
- Remove hyphens
- But keep doc vars
]
.pull-right[

```r
# Create tokens
token <-
 tokens(
 mycorpus,
 remove_numbers = TRUE,
 remove_punct = TRUE,
 remove_symbols = TRUE,
 remove_url = TRUE,
 split_hyphens = TRUE,
 include_docvars = TRUE
 )
```
]

---
## How do we use quanteda?

```r
# Clean tokens created by OCR
token_ungd <- tokens_select(
 token,
 c("[\\d-]", "[[:punct:]]", "^.{1,2}$"),
 selection = "remove",
 valuetype = "regex",
 verbose = TRUE
)
```

```
## removed 1,619 features
```

---
layout: true
.footer[   ]

---
## How do we use quanteda?

4) Calculate a **document-feature matrix** (DFM)

- Lower the text
- Stem the words
- remove stop words
]
.pull-right[

```r
mydfm <- dfm(
 token_ungd,
 tolower = TRUE,
 stem = TRUE,
 remove =
 stopwords("english")
)
```
]
--

```r
head(mydfm)
```
--
```
Document-feature matrix of: 6 documents, 19,265 features (96.6% sparse) 
                            and 4 docvars.
                 features
docs              way assembl hall inform suprem state council islam 
  AFG_55_2000.txt   1       8    1      2      1    14       9    16 
  AGO_55_2000.txt   5       2    0      0      0     2       5     0 
  ALB_55_2000.txt   2       1    0      0      0     1       2     0 
  AND_55_2000.txt   2       2    1      1      0     7       1     0 
  ARE_55_2000.txt   0       2    0      1      0    10       6     4 
  ARG_55_2000.txt   4       4    0      1      0    11       7     0 
```

---
layout: true
.footer[bit.ly/nlp-talk]

---

## How do we use quanteda?

Trim the words: remove all the words that appear less than 7.5% of the time and more than 90% of the time

```r
mydfm.trim <-
 dfm_trim(
 mydfm,
 min_docfreq = 0.075,
 # min 7.5%
 max_docfreq = 0.90,
 # max 90%
 docfreq_type = "prop"
 ) 
```

---
## Visualizing text data
.pull-left[

```r
quanteda::textplot_wordcloud(
  mydfm,
  min_count = 3,
  max_words = 500,
  color = "darkblue"
)
```
]
.pull-right[
![](unnamed-chunk-9-1.png)
]

---
class: inverse, center, middle

# Practice 💻

---
class: inverse, center, middle

# Supervised approaches

---

## Supervised approach:Dictionary-based approach &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sentiment analysis

![](supervised.png)
---

### Supervised approach:Dictionary-based approach

![reactivity](dict1.png)

---

### Supervised approach:Dictionary-based approach

![reactivity](dict2.png)

---

### Supervised approach:Dictionary-based approach

![reactivity](dict3.png)
---

### Supervised approach:Dictionary-based approach

```r
# Define dictionary
dict <- dictionary(file =
 "dict")

# Apply the dictionary to our DFM
dfm_dict <- dfm(mydfm.trim,
 groups =
 "country",
 dictionary =
 dict)
```

]

---
layout: true
.footer[   ]

---

### Supervised approach:Dictionary-based approach

```r
# Define dictionary
dict <- dictionary(file =
 "dict")

# Apply the dictionary to our DFM
dfm_dict <- dfm(mydfm.trim,
 groups =
 "country",
 dictionary =
 dict)
```
]

```r
head(dfm_dict)
```
```
Document-feature matrix of: 6 documents, 28 features (35.7% sparse) 
and 1 docvar.
     features
docs  macroeconomics civil_rights healthcare agriculture 
  AFG              3           14         13           0     
  AGO             11            4         10           0

```
---
layout: true
.footer[bit.ly/nlp-talk]

---
class: inverse, center, middle

# Practice 💻

---

### Supervised approach:Sentiment analysis

Works often in a similar way as the dictionary-based approach

![](dictionary1.png)

---

### Supervised approach:Sentiment analysis

Works often in a similar way as the dictionary-based approach

![](dictionary2.png)
---

### Supervised approach:Sentiment analysis

Works often in a similar way as the dictionary-based approach

![](dictionary3.png)
---

### Supervised approach:Sentiment analysis

Works often in a similar way as the dictionary-based approach

![](dictionary4.png)
---
layout: true
.footer[   ]

---
### Supervised approach:Sentiment analysis

```r
# Call a dictionary
dict <- data_dictionary_LSD2015

dfmat_lsd <-
 dfm(mydfm.trim,
 dictionary =
 dict[1:2])
```
]

---
### Supervised approach:Sentiment analysis

```r
# Call a dictionary
dict <- data_dictionary_LSD2015

dfmat_lsd <-
 dfm(mydfm.trim,
 dictionary =
 dict[1:2])
```
]

```r
head(dfmat_lsd, 2)
```

```
Document-feature matrix of: 2 documents, 
2 features (0.0% sparse) and 4 docvars.

features
docs              negative positive
  AFG_55_2000.txt       84       68
  AGO_55_2000.txt       88       95
```

---
class: inverse, center, middle

# Practice 💻

---
class: inverse, center, middle

# Unsupervised approaches

---
layout: true
.footer[bit.ly/nlp-talk]

---

## Unsupervised approach:Topic models

![reactivity](stm1.png)

---

## Unsupervised approach:Topic models

![reactivity](stm2.png)

---

## Unsupervised approach:Topic models

![reactivity](stm3.png)

---

## Unsupervised approach:Topic models

![reactivity](stm4.png)

---

## Unsupervised approach:Topic models

![reactivity](stm5.png)
---

## Unsupervised approach:Topic models

- [**Latent Dirichlet Allocation (LDA)**](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158)

- [**Structural Topic Models**](http://www.luigicurini.com/uploads/6/7/9/8/67985527/stm_paper_ajps12103.pdf): extension of LDA but takes other context variables (e.g., information about the document) into account

---

## Unsupervised approach:Topic models
.pull-left[![reactivity](stm5.png)
]
.pull-right[

```r
# Load package
library(stm)

# Assigns the number of topics
topic.count <- 5

# Convert DFM to a STM object
dfm2stm <- convert(mydfm.trim,
 to = "stm")

# Run the topic model
model.stm <- stm(
 dfm2stm$documents,
 dfm2stm$vocab,
 K = topic.count,
 prevalence = ~ country,
 data = dfm2stm$meta,
 init.type = "Spectral"
)
```
]

---
class: inverse, center, middle

# Practice 💻
---
class: inverse, center, middle

# How can you use NLP in your life?

---
## Build your own search engine...

[![](coro2vid.png)](https://cosima-meyer.shinyapps.io/coro2vid-19-shinyapp/)

---

## ... or your own chat bot

[![](chatbot.gif)](https://medium.com/@kumaramanjha2901/building-a-chatbot-in-python-using-chatterbot-and-deploying-it-on-web-7a66871e1d9b)

---
class: inverse
background-image: url("computer2.png")
background-size: 200px
background-position: 95% 8%

# More resources

.pull-left[.small[
- Quanteda
  - [Kohei Watanabe and Stefan Müller: Quanteda Tutorials](https://tutorials.quanteda.io)
  - [Quanteda Cheat Sheet](https://muellerstefan.net/files/quanteda-cheatsheet.pdf)

- More on text mining and NLP
  - [Cosima Meyer and Cornelius Puschmann: Advancing Text Mining with R and quanteda](https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/)
  - [Justin Grimmer and Brandon Stewart: Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts](https://www.cambridge.org/core/journals/political-analysis/article/text-as-data-the-promise-and-pitfalls-of-automatic-content-analysis-methods-for-political-texts/F7AAC8B2909441603FEB25C156448F20)
  - [Dan Jurafsky and James H. Martin: Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)

]]
  
.pull-right[.small[
- Sentiment analysis
  - [sentimentr](https://github.com/trinker/sentimentr)
  - [Hammerschmidt/Meyer 2020: Money Makes the World Go Frowned - Analyzing the Impact of Chinese Foreign Aid on States' Sentiment Using Natural Language Processing](https://www.tectum-shop.de/titel/chinas-rolle-in-einer-neuen-weltordnung-id-97867/)

- More general resources
  - [Data Science & Society](https://dssoc.github.io/schedule/)
  - [RegEx Cheat Sheet](https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf)
  - [Stringr Cheat Sheet](https://github.com/rstudio/cheatsheets/blob/master/strings.pdf)
  
- Model validation
  - [oolong: Validation of dictionary approaches and topic models](https://cran.r-project.org/web/packages/oolong/index.html)
  - [stminsights](https://github.com/cschwem2er/stminsights)
]]

---
layout: true
.footer[   ]

---
class: inverse, middle, center

---
.footer[Illustrations are either created by myself or provided by www.canva.com]