I had the great pleasure to talk about NLP at R-Ladies Bergen yesterday. Thanks to everyone for making this event so much fun!
The talk covers both unsupervised and supervised approaches and introduces you to
quanteda, an R package that allows you to perform NLP tasks.
Here are some further insights into the talk:
# Plot a word cloud quanteda::textplot_wordcloud( # Load the DFM object mydfm, # Define the minimum number the words have to occur min_count = 3, # Define the maximum number the words can occur max_words = 500, # Define a color color = wes_palette("Darjeeling1") )
# This code is heavily inspired by Julia Silge's blog post # (https://juliasilge.com/blog/sherlock-holmes-stm/)
# This code is heavily inspired by this blog post: # (https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/)
data %>% # Generate the country name for each country using the # `countrycode()` command dplyr::mutate(countryname = countrycode(ccode, "iso3c", "country.name")) %>% # Filter and only select specific countries that we want to compare dplyr::filter(countryname %in% c( "Germany", "France", "United Kingdom", "Norway", "Spain", "Sweden" )) %>% # Now comes the plotting part :-) ggplot() + # We do a bar plot that has the years on the x-axis and the level of the # net-sentiment on the y-axis # We also color it so that all the net-sentiments greater 0 get a # different color geom_col(aes( x = year, y = net_perc, fill = (net_perc > 0) )) + # Here we define the colors as well as the labels and title of the legend scale_fill_manual( name = "Sentiment", labels = c("Negative", "Positive"), values = c("#C93312", "#446455") ) + # Now we add the axes labels xlab("Time") + ylab("Net sentiment") + # And do a facet_wrap by country to get a more meaningful visualization facet_wrap(~ countryname)
# Inspired here: https://bit.ly/37MCEHg # Get the 30 top features from the DFM freq_feature <- topfeatures(mydfm, 30) # Create a data.frame for ggplot data <- data.frame(list( term = names(freq_feature), frequency = unname(freq_feature) )) # Plot the plot data %>% # Call ggplot ggplot() + # Add geom_segment (this will give us the lines of the lollipops) geom_segment(aes( x = reorder(term, frequency), xend = reorder(term, frequency), y = 0, yend = frequency ), color = "grey") + # Call a point plot with the terms on the x-axis # and the frequency on the y-axis geom_point(aes(x = reorder(term, frequency), y = frequency)) + # Flip the plot coord_flip() + # Add labels for the axes xlab("") + ylab("Absolute frequency of the features")
data %>% # Generate the continent for each country using the `countrycode()` command dplyr::mutate(continent = countrycode(ccode, "iso3c", "continent", custom_match = c("YUG" = "Europe"))) %>% # We group by continent and year to generate the average sentiment by # continent and and year group_by(continent, year) %>% dplyr::mutate(avg = mean(net_perc)) %>% # We now plot it ggplot() + # Using a line chart with year on the x-axis, the average sentiment # by continent on the y-axis and colored by continent geom_line(aes(x = year, y = avg, col = continent)) + # Define the colors scale_color_manual(name = "", values = wes_palette("Darjeeling1")) + # Label the axes xlab("Time") + ylab("Average net sentiment")
These figures above show the output of more basic supervised and unsupervised models in NLP that you can use and that we covered during the talk. And as you work more and more with textual data, you will see that there is so much more in the field of NLP including document similarity, text generation or even chat bots that you can create using your knowledge and starting with the same simple steps that I presented in the talk 👩🏼💻
If you want more resources, you can access them here:
More on text mining and NLP
More general resource