class: center, middle, inverse, title-slide # Taking text datato the next level ## Using supervised and unsupervised approaches in NLP ### Cosima Meyer ### 2020-12-08 --- class: inverse, middle, center <img src="computer2.png" width=200 height=150> <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 496 512"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> Code and slides at </br></br> [bit.ly/nlp-talk](http://bit.ly/nlp-talk) --- ## Researcher and data lover .left-column[ <img src="book.png" width=100 height=100> <br> <img src="computer.png" width=100 height=100> <br> <img src="lamp.png" width=100 height=100> ] .right-column[ **PhD Candidate** and **researcher** at the University of Mannheim <br><br><br> **Co-founder** and **co-editor** of the data science blog [**Methods Bites**](https://www.mzes.uni-mannheim.de/socialsciencedatalab/) <br><br><br><br><br> **Maintainer** and **author** of the CRAN package [**overviewR**](https://cosimameyer.github.io/overviewR/) ] --- layout: true .footer[bit.ly/nlp-talk] --- ## Where do you find text data? - Short answer: **everywhere!** </br> </br> </br> - It can be a speech, a peace agreement, a treaty, a newspaper article, a diary report, an open survey answer, some archival notes, tweets, website text, ...  --- ## What is text data? .pull-left[  ] .pull-right[ Text data can be **documents** ] --- ## What is text data? .pull-left[  ] .pull-right[ Text data can be documents, **paragraphs** ] --- ## What is text data? .pull-left[  ] .pull-right[ Text data can be documents, paragraphs, **sentences** ] --- ## What is text data? .pull-left[  ] .pull-right[ Text data can be documents, paragraphs, sentences or also **single words** ] --- ## What is text data? .pull-left[  ] .pull-right[ **Corpus**: collection of documents ] --- ## What is text data? .pull-left[   ] .pull-right[ **Corpus**: collection of documents </br> </br> </br> </br> </br> </br> </br> </br> </br> </br> </br> **Tokens**: each individual word in a text (but it could also be a sentence, paragraph, or character) ] --- ## What is text data? .pull-left[  ] --- ## What is text data? .pull-left[  ] --- ## What is text data? .pull-left[  ] --- ## What is text data? .pull-left[  ] .pull-right[ **Tokenization**: Creating a **bag of words** ] --- ## What is text data? .pull-left[   ] .pull-right[ **Tokenization**: Creating a **bag of words** </br> </br> </br> </br> </br> </br> </br> </br> **Document-feature matrix (DFM)**: First split the text into its single terms (tokens), then count how frequently each token occurs in each document ] --- ## What is text data? .pull-left[  ] .pull-right[ **Stemming**: Getting the stem of the word ] --- ## What is text data? .pull-left[  </br> </br>  ] .pull-right[ **Stemming**: Getting the stem of the word </br> </br> </br> </br> </br> </br> </br> </br> **Lemmatization**: Getting the meaningful stem of the word ] --- layout: true .footer[ ] --- ## Speeches at the UNGD .centering[] <center><small>Photo taken by me (October 2019)</small></center> Speeches are based on the data set by [**Mikhaylov et al. (2017)**](https://doi.org/10.7910/DVN/0TJX8Y) --- layout: true .footer[bit.ly/nlp-talk] --- class: inverse background-image: url("overview_new.jpg") background-size: 600px --- class: inverse background-image: url("overview_new_1.jpg") background-size: 600px --- class: inverse background-image: url("overview_new_2.jpg") background-size: 600px --- class: inverse background-image: url("overview_new_3.jpg") background-size: 600px --- class: inverse background-image: url("overview_new_4.jpg") background-size: 600px --- class: inverse background-image: url("overview_new_5.jpg") background-size: 600px --- class: inverse, center, middle # Quanteda --- ## Quanteda - `quanteda`: **qu**antitative **an**alysis of **te**xtual **da**ta - Fully-featured package that allows you to easily perform NLP tasks - What are alternatives? - `tm`: simpler grammar but fewer features - `tidytext`: good integration with the `tidyverse` - `koRpus`: good for part-of-speech tagging - ... --- ## How do we use quanteda? 1) **Import** the data 2) Build a **corpus** 3) **Pre-process your data** 4) Calculate a **document-feature matrix** (DFM) --  <center><small><a href="https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/#stm">Methods Bites (2019)</a></small></center> --- ## How do we use quanteda? 1) **Import** the data ```r # Load packages library(quanteda) # For NLP load("../data/UN-data.RData") ``` --- ## How do we use quanteda? 2) Build a **corpus** ```r # Generate a corpus from our data set mycorpus <- corpus(un_data) # Assigns a unique identifier to each text docvars(mycorpus, "Textno") <- sprintf("%02d", 1:ndoc(mycorpus)) ``` --- ## How do we use quanteda? 3) **Preprocess the text** .pull-left[Steps in the pre-processing world: - Remove numbers - Remove punctuation - Remove symbols - Remove URLS - Remove hyphens - But keep doc vars ] .pull-right[ ```r # Create tokens token <- tokens( mycorpus, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_url = TRUE, split_hyphens = TRUE, include_docvars = TRUE ) ``` ] --- ## How do we use quanteda? ```r # Clean tokens created by OCR token_ungd <- tokens_select( token, c("[\\d-]", "[[:punct:]]", "^.{1,2}$"), selection = "remove", valuetype = "regex", verbose = TRUE ) ``` ``` ## removed 1,619 features ``` --- layout: true .footer[ ] --- ## How do we use quanteda? 4) Calculate a **document-feature matrix** (DFM) .pull-left[Here, we also: - Lower the text - Stem the words - remove stop words ] .pull-right[ ```r mydfm <- dfm( token_ungd, tolower = TRUE, stem = TRUE, remove = stopwords("english") ) ``` ] -- ```r head(mydfm) ``` -- ``` Document-feature matrix of: 6 documents, 19,265 features (96.6% sparse) and 4 docvars. features docs way assembl hall inform suprem state council islam AFG_55_2000.txt 1 8 1 2 1 14 9 16 AGO_55_2000.txt 5 2 0 0 0 2 5 0 ALB_55_2000.txt 2 1 0 0 0 1 2 0 AND_55_2000.txt 2 2 1 1 0 7 1 0 ARE_55_2000.txt 0 2 0 1 0 10 6 4 ARG_55_2000.txt 4 4 0 1 0 11 7 0 ``` --- layout: true .footer[bit.ly/nlp-talk] --- ## How do we use quanteda? Trim the words: remove all the words that appear less than 7.5% of the time and more than 90% of the time ```r mydfm.trim <- dfm_trim( mydfm, min_docfreq = 0.075, # min 7.5% max_docfreq = 0.90, # max 90% docfreq_type = "prop" ) ``` --- ## Visualizing text data .pull-left[ ```r quanteda::textplot_wordcloud( mydfm, min_count = 3, max_words = 500, color = "darkblue" ) ``` ] .pull-right[  ] --- class: inverse, center, middle # Practice 💻 --- class: inverse, center, middle # Supervised approaches --- ## Supervised approach:</br></br>Dictionary-based approach </br></br> and </br></br> sentiment analysis  --- ### Supervised approach:</br>Dictionary-based approach  --- ### Supervised approach:</br>Dictionary-based approach  --- ### Supervised approach:</br>Dictionary-based approach  --- ### Supervised approach:</br>Dictionary-based approach .pull-left[  ] .pull-right[ ```r # Define dictionary dict <- dictionary(file = "dict") # Apply the dictionary to our DFM dfm_dict <- dfm(mydfm.trim, groups = "country", dictionary = dict) ``` ] --- layout: true .footer[ ] --- ### Supervised approach:</br>Dictionary-based approach .pull-left[  ] .pull-right[ ```r # Define dictionary dict <- dictionary(file = "dict") # Apply the dictionary to our DFM dfm_dict <- dfm(mydfm.trim, groups = "country", dictionary = dict) ``` ] ```r head(dfm_dict) ``` ``` Document-feature matrix of: 6 documents, 28 features (35.7% sparse) and 1 docvar. features docs macroeconomics civil_rights healthcare agriculture AFG 3 14 13 0 AGO 11 4 10 0 ``` --- layout: true .footer[bit.ly/nlp-talk] --- class: inverse, center, middle # Practice 💻 --- ### Supervised approach:</br>Sentiment analysis Works often in a similar way as the dictionary-based approach  --- ### Supervised approach:</br>Sentiment analysis Works often in a similar way as the dictionary-based approach  --- ### Supervised approach:</br>Sentiment analysis Works often in a similar way as the dictionary-based approach  --- ### Supervised approach:</br>Sentiment analysis Works often in a similar way as the dictionary-based approach  --- layout: true .footer[ ] --- ### Supervised approach:</br>Sentiment analysis .pull-left[  ] .pull-right[ ```r # Call a dictionary dict <- data_dictionary_LSD2015 dfmat_lsd <- dfm(mydfm.trim, dictionary = dict[1:2]) ``` ] --- ### Supervised approach:</br>Sentiment analysis .pull-left[  ] .pull-right[ ```r # Call a dictionary dict <- data_dictionary_LSD2015 dfmat_lsd <- dfm(mydfm.trim, dictionary = dict[1:2]) ``` ] ```r head(dfmat_lsd, 2) ``` ``` Document-feature matrix of: 2 documents, 2 features (0.0% sparse) and 4 docvars. features docs negative positive AFG_55_2000.txt 84 68 AGO_55_2000.txt 88 95 ``` --- class: inverse, center, middle # Practice 💻 --- class: inverse, center, middle # Unsupervised approaches --- layout: true .footer[bit.ly/nlp-talk] --- ## Unsupervised approach:</br>Topic models  --- ## Unsupervised approach:</br>Topic models  --- ## Unsupervised approach:</br>Topic models  --- ## Unsupervised approach:</br>Topic models  --- ## Unsupervised approach:</br>Topic models  --- ## Unsupervised approach:</br>Topic models - [**Latent Dirichlet Allocation (LDA)**](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158) - [**Structural Topic Models**](http://www.luigicurini.com/uploads/6/7/9/8/67985527/stm_paper_ajps12103.pdf): extension of LDA but takes other context variables (e.g., information about the document) into account --- ## Unsupervised approach:</br>Topic models .pull-left[ ] .pull-right[ ```r # Load package library(stm) # Assigns the number of topics topic.count <- 5 # Convert DFM to a STM object dfm2stm <- convert(mydfm.trim, to = "stm") # Run the topic model model.stm <- stm( dfm2stm$documents, dfm2stm$vocab, K = topic.count, prevalence = ~ country, data = dfm2stm$meta, init.type = "Spectral" ) ``` ] --- class: inverse, center, middle # Practice 💻 --- class: inverse, center, middle # How can you use NLP in your life? --- ## Build your own search engine... [](https://cosima-meyer.shinyapps.io/coro2vid-19-shinyapp/) --- ## ... or your own chat bot [](https://medium.com/@kumaramanjha2901/building-a-chatbot-in-python-using-chatterbot-and-deploying-it-on-web-7a66871e1d9b) --- class: inverse background-image: url("computer2.png") background-size: 200px background-position: 95% 8% <br><br><br> # More resources .pull-left[.small[ - Quanteda - [Kohei Watanabe and Stefan Müller: Quanteda Tutorials](https://tutorials.quanteda.io) - [Quanteda Cheat Sheet](https://muellerstefan.net/files/quanteda-cheatsheet.pdf) - More on text mining and NLP - [Cosima Meyer and Cornelius Puschmann: Advancing Text Mining with R and quanteda](https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/) - [Justin Grimmer and Brandon Stewart: Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts](https://www.cambridge.org/core/journals/political-analysis/article/text-as-data-the-promise-and-pitfalls-of-automatic-content-analysis-methods-for-political-texts/F7AAC8B2909441603FEB25C156448F20) - [Dan Jurafsky and James H. Martin: Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) ]] .pull-right[.small[ - Sentiment analysis - [sentimentr](https://github.com/trinker/sentimentr) - [Hammerschmidt/Meyer 2020: Money Makes the World Go Frowned - Analyzing the Impact of Chinese Foreign Aid on States' Sentiment Using Natural Language Processing](https://www.tectum-shop.de/titel/chinas-rolle-in-einer-neuen-weltordnung-id-97867/) - More general resources - [Data Science & Society](https://dssoc.github.io/schedule/) - [RegEx Cheat Sheet](https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf) - [Stringr Cheat Sheet](https://github.com/rstudio/cheatsheets/blob/master/strings.pdf) - Model validation - [oolong: Validation of dictionary approaches and topic models](https://cran.r-project.org/web/packages/oolong/index.html) - [stminsights](https://github.com/cschwem2er/stminsights) ]] --- layout: true .footer[ ] --- class: inverse, middle, center <img src="computer2.png" width=200 height=150> <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> [@cosima_meyer](https://twitter.com/cosima_meyer) <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"/></svg> [cosimameyer](https://www.linkedin.com/in/cosimameyer/) <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 496 512"><path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"/></svg> [cosimameyer.rbind.io](http://cosimameyer.rbind.io) <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 496 512"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> Code and slides at </br></br> [bit.ly/nlp-talk](http://bit.ly/nlp-talk) --- .footer[Illustrations are either created by myself or provided by www.canva.com]