The following blog post is based on a workshop that I delivered as part of Women in Data Science 2023.

To access additional material of the workshop, please check the GitHub repository to run the code yourself.

You can also find the slide deck to flip through here:

What can you take away from this blog post?

  1. This blog post shows one version of a typical data science workflow.
  2. It will cover how you can use both Python and R in your data workflows.
  3. It shows how both languages can complement each other to successfully perform your daily tasks.
  4. If you’re new to Python or R and know a little about the other language, this post will also help to compare how each step is performed in the other language.

Throughout the post, we follow this workflow:

small_image

Alternative text Image showing the data science workflow from question, to data access, wrangling, data viz over ML & stats to communication

Where the snake appears, we will be using Python, the letter R stands for examples where we use R.

It covers the following steps:


Logistics 👷‍♀️

Before we get started, we need to make sure that all required libraries are installed.

The standard package indices in Python is called PyPi and in R it is CRAN.

  • Python 🐍
  • R 🔵

For Python, we first install the packages using pip. If you’re working in a Jupyter notebook, this can happen directly in a separate code chunk using the following code:

!pip install sweetviz
!pip install pandas
!pip install rpy2==3.5.1
!pip install countrycode
!pip install plotnine
!pip install patchworklib
!pip install ydata-profiling

Once installed, we load the packages in Python:

import pandas as pd
import sweetviz as sv
import patchworklib as pw
from ydata_profiling import ProfileReport
from countrycode import countrycode
from plotnine import *

There are different ways how to use R and Python in one project. We are using rpy2 here that allows us to call code chunks in R in Jupyter notebooks. Alternatively, you can also turn to Quarto that you can run in your local IDEs (for instance in RStudio Desktop or VS Code). Quarto is great - I also used it to generate the slides that accompanied the workshop.

%load_ext rpy2.ipython

If we are now calling R inside the Jupyter notebook, we always have to put %%R as cell magic at the beginning of the code chunk. We’ll do this here (but leave it out for the rest of the post to increase readability). As a side note: Within the Jupyter notebook, you cannot only use the languages in separate code cunks but you can also use objects generated by Python in R (and vice versa). For this, all you have to do is to put %%R -i object_name at the beginning of the cunk. object_name will be replaced by the object (and could be df_python) which is now ready to use for you in R.

But back to installing libraries: Similar as in Python, some packages are pre-installed and we don’t need to install them here.

%%R
install.packages("countrycode")
install.packages("skimr")
install.packages("patchwork")

We can now also load the packages:

%%R
# Package for data wrangling
library(dplyr)
# Package for exploratory data analysis
library(skimr)
# Package for converting country codes 
# (and also identify continents)
library(countrycode)
# Package for visualization
library(ggplot2)
# Package that allows to arrange 
# multiple plots the way you want
library(patchwork)

Question 🙋‍♀️

At the very beginning of every data science task usually stands a question. The question may be revised and adjusted throughout the process. To showcase the data science process, we will follow the following question:

Are there differences across European countries when it comes to requesting mental health treatment?


Data access 📖

Green snake sitting on a book, wearing glasses and holding a light bulb.

To study the question, we work with data from a Mental Health Survey in Tech, provided by Kaggle. Before we dive into the data, we will first load it. There are also ways to directly access a Kaggle dataset in Google Colab using access tokens but we will go another way and load it as if it was on our local machines. This helps us also to understand how we were to load data from our local machines.

  • Python 🐍
  • R 🔵

We store the URL of the data in an object called path. In a next step, we use pandas read_csv function to open the data and store it in df_python.

path = "https://raw.githubusercontent.com/cosimameyer/r-python-talk/main/data/survey.csv"

df_python = pd.read_csv(path)

We can also use R to load the data. Here, multiple options are possible. We will be using read.csv from base R. If you prefer the tidyverse, you can also use readr::read_csv. We store the data frame in df_r.

path <- "https://raw.githubusercontent.com/cosimameyer/r-python-talk/main/data/survey.csv"

data <- read.csv(path)

EDA & data wrangling 🛠️

Exploratory data analysis with built-in functions 🕵‍♀️

Green snake holding up tools (pliers, hammer and wrench) in its tail.

Now that we loaded our data, we can look at it. This step is also called exploratory data analysis (or EDA). It is an essential step that won’t only happen at the beginning of every data analysis but you will often come back to it throughout the data science process.

If you click on the Kaggle link, you will learn more about the data themselves including a list of all variables and a data explorer. This gives you already a first overview of the data.

But also both Python and R have built-in functionalities to get a first understanding of your data.

  • Python 🐍
  • R 🔵

In Python, we can use shape. This way, we access the dimensions of the data frame. This gives us a good understanding of the number of rows (1,259) and columns (27).

df_python.shape
(1259, 27)

In your own notebook, a next step would be to “print” the head of the data - that means to look at the first lines of the data frame:

df_python.head()

small_image

In a next step, we use df_python.info() to get a general overview of the dataset:

df_python.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness_program           1259 non-null   object
 15  seek_help                  1259 non-null   object
 16  anonymity                  1259 non-null   object
 17  leave                      1259 non-null   object
 18  mental_health_consequence  1259 non-null   object
 19  phys_health_consequence    1259 non-null   object
 20  coworkers                  1259 non-null   object
 21  supervisor                 1259 non-null   object
 22  mental_health_interview    1259 non-null   object
 23  phys_health_interview      1259 non-null   object
 24  mental_vs_physical         1259 non-null   object
 25  obs_consequence            1259 non-null   object
 26  comments                   164 non-null    object
dtypes: int64(1), object(26)
memory usage: 265.7+ KB

This tells us a lot about the data. We see the column names (= our features/variables), whether there are NAs (Non-Null Count) and we also learn more about the data type (Dtype).

We can, of course ☺️, also do these steps in R. There are also similar approaches in the R-universe. To get the number of rows and columns, we call dim():

dim(df_r)
[1] 1259   27

To print the head, R has head():

head(df_r)
 Timestamp Age Gender        Country state self_employed
1 2014-08-27 11:29:31  37 Female  United States    IL          <NA>
2 2014-08-27 11:29:37  44      M  United States    IN          <NA>
3 2014-08-27 11:29:44  32   Male         Canada  <NA>          <NA>
4 2014-08-27 11:29:46  31   Male United Kingdom  <NA>          <NA>
5 2014-08-27 11:30:22  31   Male  United States    TX          <NA>
6 2014-08-27 11:31:22  33   Male  United States    TN          <NA>
  family_history treatment work_interfere   no_employees remote_work
1             No       Yes          Often           6-25          No
2             No        No         Rarely More than 1000          No
3             No        No         Rarely           6-25          No
4            Yes       Yes          Often         26-100          No
5             No        No          Never        100-500         Yes
6            Yes        No      Sometimes           6-25          No
  tech_company   benefits care_options wellness_program  seek_help  anonymity
1          Yes        Yes     Not sure               No        Yes        Yes
2           No Don't know           No       Don't know Don't know Don't know
3          Yes         No           No               No         No Don't know
4          Yes         No          Yes               No         No         No
5          Yes        Yes           No       Don't know Don't know Don't know
6          Yes        Yes     Not sure               No Don't know Don't know
               leave mental_health_consequence phys_health_consequence
1      Somewhat easy                        No                      No
2         Don't know                     Maybe                      No
3 Somewhat difficult                        No                      No
4 Somewhat difficult                       Yes                     Yes
5         Don't know                        No                      No
6         Don't know                        No                      No
     coworkers supervisor mental_health_interview phys_health_interview
1 Some of them        Yes                      No                 Maybe
2           No         No                      No                    No
3          Yes        Yes                     Yes                   Yes
4 Some of them         No                   Maybe                 Maybe
5 Some of them        Yes                     Yes                   Yes
6          Yes        Yes                      No                 Maybe
  mental_vs_physical obs_consequence comments
1                Yes              No     <NA>
2         Don't know              No     <NA>
3                 No              No     <NA>
4                 No             Yes     <NA>
5         Don't know              No     <NA>
6         Don't know              No     <NA>

str() and summary() help us to understand the general structure of our data.

str(df_r)
'data.frame':	1259 obs. of  27 variables:
 $ Timestamp                : chr  "2014-08-27 11:29:31" "2014-08-27 11:29:37" "2014-08-27 11:29:44" "2014-08-27 11:29:46" ...
 $ Age                      : num  37 44 32 31 31 33 35 39 42 23 ...
 $ Gender                   : chr  "Female" "M" "Male" "Male" ...
 $ Country                  : chr  "United States" "United States" "Canada" "United Kingdom" ...
 $ state                    : chr  "IL" "IN" NA NA ...
 $ self_employed            : chr  NA NA NA NA ...
 $ family_history           : chr  "No" "No" "No" "Yes" ...
 $ treatment                : chr  "Yes" "No" "No" "Yes" ...
 $ work_interfere           : chr  "Often" "Rarely" "Rarely" "Often" ...
 $ no_employees             : chr  "6-25" "More than 1000" "6-25" "26-100" ...
 $ remote_work              : chr  "No" "No" "No" "No" ...
 $ tech_company             : chr  "Yes" "No" "Yes" "Yes" ...
 $ benefits                 : chr  "Yes" "Don't know" "No" "No" ...
 $ care_options             : chr  "Not sure" "No" "No" "Yes" ...
 $ wellness_program         : chr  "No" "Don't know" "No" "No" ...
 $ seek_help                : chr  "Yes" "Don't know" "No" "No" ...
 $ anonymity                : chr  "Yes" "Don't know" "Don't know" "No" ...
 $ leave                    : chr  "Somewhat easy" "Don't know" "Somewhat difficult" "Somewhat difficult" ...
 $ mental_health_consequence: chr  "No" "Maybe" "No" "Yes" ...
 $ phys_health_consequence  : chr  "No" "No" "No" "Yes" ...
 $ coworkers                : chr  "Some of them" "No" "Yes" "Some of them" ...
 $ supervisor               : chr  "Yes" "No" "Yes" "No" ...
 $ mental_health_interview  : chr  "No" "No" "Yes" "Maybe" ...
 $ phys_health_interview    : chr  "Maybe" "No" "Yes" "Maybe" ...
 $ mental_vs_physical       : chr  "Yes" "Don't know" "No" "No" ...
 $ obs_consequence          : chr  "No" "No" "No" "Yes" ...
 $ comments                 : chr  NA NA NA NA ...

And the summary:

summary(df_r)
Timestamp              Age                Gender            Country         
 Length:1259        Min.   :-1.726e+03   Length:1259        Length:1259       
 Class :character   1st Qu.: 2.700e+01   Class :character   Class :character  
 Mode  :character   Median : 3.100e+01   Mode  :character   Mode  :character  
                    Mean   : 7.943e+07                                        
                    3rd Qu.: 3.600e+01                                        
                    Max.   : 1.000e+11                                        
    state           self_employed      family_history      treatment        
 Length:1259        Length:1259        Length:1259        Length:1259       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 work_interfere     no_employees       remote_work        tech_company      
 Length:1259        Length:1259        Length:1259        Length:1259       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   benefits         care_options       wellness_program    seek_help        
 Length:1259        Length:1259        Length:1259        Length:1259       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  anonymity            leave           mental_health_consequence
 Length:1259        Length:1259        Length:1259              
 Class :character   Class :character   Class :character         
 Mode  :character   Mode  :character   Mode  :character         
                                                                
                                                                
                                                                
 phys_health_consequence  coworkers          supervisor       
 Length:1259             Length:1259        Length:1259       
 Class :character        Class :character   Class :character  
 Mode  :character        Mode  :character   Mode  :character  
                                                              
                                                              
                                                              
 mental_health_interview phys_health_interview mental_vs_physical
 Length:1259             Length:1259           Length:1259       
 Class :character        Class :character      Class :character  
 Mode  :character        Mode  :character      Mode  :character  
                                                                 
                                                                 
                                                                 
 obs_consequence      comments        
 Length:1259        Length:1259       
 Class :character   Class :character  
 Mode  :character   Mode  :character  

Exploratory data analysis with external libraries 🔎

While the built-in functions are already an excellent starting point, both languages have more libraries that help you to get even more out of your data.

  • Python 🐍
  • R 🔵

To get an even better idea of the data (with more information and also some visualizations), we use a small (but powerful) package inside the notebook. It’s called sweetviz and helps to generate nice EDA reports with just two lines of code!

We see a general overview of the dataframe in the top (including rows, duplicates, the number of variables (= features), the distribution of the variable types (categorical, numerical, and text), and the size of the dataset). If you then scroll down, you will see a visual representation of each variable including the distribution, the number of distinct values as well as missings. If you click on a single tab, it will expand and give you even more information.

This is an excellent starting point for every data analysis.

report = sv.analyze(df_python)
report.show_notebook(layout="vertical", w=800, h=700, scale=0.8)

small_image

To access the full functionality, try it out yourself using the Jupyter notebook.

There are more libraries like this. If you are curious, you can also try autoviz, pandas-profiling or dtale.

But they don’t provide info on the number of missing data. Here, the library skimr can help. It provides you with a more detailed output. If you want to know what else is out there for exploratory data analysis in R, have a look at the recent publication which compares more packages.

skimr::skim(df_r)
── Data Summary ────────────────────────
                           Values
Name                       df_r  
Number of rows             1259  
Number of columns          27    
_______________________          
Column type frequency:           
  character                26    
  numeric                  1     
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────────────────────────
   skim_variable             n_missing complete_rate min  max empty n_unique
 1 Timestamp                         0         1      19   19     0     1246
 2 Gender                            0         1       1   46     0       49
 3 Country                           0         1       5   22     0       48
 4 state                           515         0.591   2    2     0       45
 5 self_employed                    18         0.986   2    3     0        2
 6 family_history                    0         1       2    3     0        2
 7 treatment                         0         1       2    3     0        2
 8 work_interfere                  264         0.790   5    9     0        4
 9 no_employees                      0         1       3   14     0        6
10 remote_work                       0         1       2    3     0        2
11 tech_company                      0         1       2    3     0        2
12 benefits                          0         1       2   10     0        3
13 care_options                      0         1       2    8     0        3
14 wellness_program                  0         1       2   10     0        3
15 seek_help                         0         1       2   10     0        3
16 anonymity                         0         1       2   10     0        3
17 leave                             0         1       9   18     0        5
18 mental_health_consequence         0         1       2    5     0        3
19 phys_health_consequence           0         1       2    5     0        3
20 coworkers                         0         1       2   12     0        3
21 supervisor                        0         1       2   12     0        3
22 mental_health_interview           0         1       2    5     0        3
23 phys_health_interview             0         1       2    5     0        3
24 mental_vs_physical                0         1       2   10     0        3
25 obs_consequence                   0         1       2    3     0        2
26 comments                       1095         0.130   1 3548     0      160
   whitespace
 1          0
 2          0
 3          0
 4          0
 5          0
 6          0
 7          0
 8          0
 9          0
10          0
11          0
12          0
13          0
14          0
15          0
16          0
17          0
18          0
19          0
20          0
21          0
22          0
23          0
24          0
25          0
26          1

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate      mean          sd    p0 p25 p50 p75
1 Age                   0             1 79428148. 2818299443. -1726  27  31  36
         p100 hist 
1 99999999999 ▇▁▁▁▁

Once we have an idea of the data, we continue and subset the data frame to the variables that we need - in our case Country and treatment. Looking at the Kaggle codebook, we learn that:

  • Country stands for the country where the respondent lives in
  • treatment provides the answer to the question “Have you sought treatment for a mental health condition?”

Data wrangling 👩‍💻

  • Python 🐍
  • R 🔵

To wrangle the data, we again rely on pandas - a go-to library that makes working with data easy.

df_python_small = df_python[['Country', 'treatment']]

# We then again print the head to understand the changes 
# in our data frame
df_python_small.head()                
        Country          Treatment 
 0      United States    Yes       
 1      United States    No        
 2      Canada           No        
 3      United Kingdom   Yes       
 4      United States    No        

We see that only the column Country and the column treatment remain. Since we are only interested in countries in Europe, we need to identify a way to select these countries. It would be possible to generate a list with countries in Europe ourselves - but that would take up much of our time. Luckily other people thought the same and developed a tool that does the work for us. The library countrycode is a wrapper for the R library. While the syntax is slightly different, the logic remains the same:

# Get the region
df_python_small['region'] = countrycode.countrycode(
    codes=df_python_small['Country'], 
    origin='country_name', 
    target='region'
)

# Get an ISO3 country code
df_python_small['country'] = countrycode.countrycode(
    codes=df_python_small['Country'], 
    origin='country_name', 
    target='iso3c'
)

# Print the head again
df_python_small.head()
        Country          Treatment  Region             Country Code 
 0      United States    Yes        Northern America   USA          
 1      United States    No         Northern America   USA          
 2      Canada           No         Northern America   CAN          
 3      United Kingdom   Yes        Northern Europe    GBR          
 4      United States    No         Northern America   USA          

In a next step, we subset the data and only keep those countries that are located within Europe.

# For this, we define the range of the regions
region_value = ["Northern Europe", "Western Europe", 
                              "Eastern Europe", "Southern Europe"]

# Use the information from `region_value` and keep only those 
# countries where the region is in `region_value`
df_python_small2 = df_python_small[
    df_python_small['region'].isin(region_value)
]

# Last but not least, we again print the head:
df_python_small2.head()             
        Country          Treatment  Region           Country Code 
 3      United Kingdom   Yes        Northern Europe  GBR          
 11     Bulgaria         No         Eastern Europe   BGR          
 16     United Kingdom   Yes        Northern Europe  GBR          
 19     France           No         Western Europe   FRA          
 29     United Kingdom   No         Northern Europe  GBR          

In a last step, we count the number of treatment occurences (“Yes” vs “No”) by country and store the result in df_python_clean.

# Count the distinct treatment answer by country
df_python_clean = df_python_small2.value_counts(
    ['country', 'treatment']
).reset_index(name='n')

# Print the head
df_python_clean.head()
        Country  Treatment  N
 0      GBR      Yes        93 
 1      GBR      No         92 
 2      DEU      No         24 
 3      DEU      Yes        21 
 4      NLD      No         18 

In a next step, we calculate the sum of n (this is the count that we generated above) by country. We will use this information later to calculate the shared percentage.

# Calculate the total answers by country 
# (irrespective of the distinct answer)
df_python_clean2 = df_python_clean.groupby(['country']).n. \
                    agg('sum').reset_index(name='total')

# Print the head
df_python_clean2.head()
        Country Total 
 0      AUT      3     
 1      BEL      6     
 2      BGR      4     
 3      BIH      1     
 4      CHE      7     

We have two data frames now (df_python_clean2 and df_python_clean3). To bring them back together, we use a merge.

# Merge df_python_clean2 and df_python_clean3
df_python_cleaned = df_python_clean2.merge(
    df_python_clean, 
    on='country'
)

# Print the head
df_python_cleaned.head()
        Country  Total  Treatment  N 
 0      AUT      3      No         3 
 1      BEL      6      No         5 
 2      BEL      6      Yes        1 
 3      BGR      4      No         2 
 4      BGR      4      Yes        2 

In a last step, we then calculate the percentage share.

# Calculate the percentage share
df_python_cleaned['percent'] = round(
    (df_python_cleaned['n'] / df_python_cleaned['total']) * 100, 
    2
)

# Print the head
df_python_cleaned.head()
        Country  Total  Treatment  N   Percent 
 0      AUT      3      No         3   100.00  
 1      BEL      6      No         5   83.33   
 2      BEL      6      Yes        1   16.67   
 3      BGR      4      No         2   50.00   
 4      BGR      4      Yes        2   50.00   

While we executed single steps here, we can also introduce chains and execute them in one longer step:

# Define regions
region_value = ["Northern Europe", "Western Europe", 
                              "Eastern Europe", "Southern Europe"]

# Execute the chain
df_clean1 = (
    df_python[['Country', 'treatment']]
    .assign(
        region=lambda d: countrycode.countrycode(
            codes=d['Country'], 
            origin='country_name', 
            target='region'
        )
    )
    .assign(
        country=lambda d: countrycode.countrycode(
            codes=d['Country'], 
            origin='country_name', 
            target='iso3c'
        )
    )
    .loc[lambda x: x['region'].isin(region_value)]
    .value_counts(['country', 'treatment'])
    .reset_index(name='n')
)

# Calculate the total answers by country 
# (irrespective of the distinct answer)
df_clean2 = df_clean1.groupby(['country']).n. \
                    agg('sum').reset_index(name='total')

# Merge df_python_clean2 and df_python_clean3
df_python_cleaned = df_clean2.merge(df_clean1, on='country')

# Calculate the percentage share
df_python_cleaned['percent'] = round(
    (df_python_cleaned['n'] / df_python_cleaned['total']) * 100, 
    2
)

# And again print the head
df_python_cleaned.head()
        Country  Total  Treatment  N   Percent 
 0      AUT      3      No         3   100.00  
 1      BEL      6      No         5   83.33   
 2      BEL      6      Yes        1   16.67   
 3      BGR      4      No         2   50.00   
 4      BGR      4      Yes        2   50.00   

In a next step, we perform some data wrangling. The term data wrangling is rather broad and can include various things - depending on the use case and the goal of your wranglin. It can mean that you need to pre-process text data if you work with it or that you need to reshape the data completely to make them fit your purpose. We’ll repeat the steps that we have performed in Python in R and create the df_r_clean dataset. For this, we first create new country odes using the countrycode package, then filter for the required regions (based on the World Bank). In a next step, we then select country and treatment. We group by these variables and generate a count, group again - only by country this time. In the last step we then generate the percentage and store it in percent.

To do this, the package dplyr is a gem! It makes data wrangling easy and allows to write easily readable code. You can see an example below.

R has a special operator - is called a pipe (%>%) and come from the package magrittr that is a dependency of dplyr.

df_r_clean <- df_r %>%
  # Generate the ISO3 country name (`country`)
  # and the region (`region`)
  dplyr::mutate(
    country = countrycode::countrycode(
      Country, 'country.name', 'iso3c'
    ),
    region = countrycode::countrycode(
      Country, 'country.name', 'region23'
    )
  ) %>%
  # We now filter for countries in Europe
  dplyr::filter(
    region %in% c(
      "Northern Europe", "Western Europe",
      "Eastern Europe", "Southern Europe"
    )
  ) %>%
  # Select the required variables
  dplyr::select(country, treatment) %>%
  # Group by them...
  dplyr::group_by(country, treatment) %>%
  # ... and generate the count
  count() %>%
  # Now we group again by country
  dplyr::group_by(country) %>%
  # Calculate the sum of the count (`n`) by country
  # and use this information to calculate the
  # percentage share for each category by country
  dplyr::mutate(
    total = sum(n),
    percent = (n / total) * 100
  )

df_r_clean
# A tibble: 39 × 5
# Groups:   country [27]
   country treatment     n total percent
   <chr>   <chr>     <int> <int>   <dbl>
 1 AUT     No            3     3   100  
 2 BEL     No            5     6    83.3
 3 BEL     Yes           1     6    16.7
 4 BGR     No            2     4    50  
 5 BGR     Yes           2     4    50  
 6 BIH     No            1     1   100  
 7 CHE     No            4     7    57.1
 8 CHE     Yes           3     7    42.9
 9 CZE     No            1     1   100  
10 DEU     No           24    45    53.3
# … with 29 more rows
# ℹ Use `print(n = ...)` to see more rows

Printing the data set again, we see that percentage can be often misleading if we don’t know the total share by country. This is an important information that we need to keep in mind when later interpreting the data.


Visualization 👩‍🎨

A blue R with color spots and an artist hat on the head.

As you have already seen, both R and Python are extremely versatile and are both good candidates for the tasks. It often comes down to a matter of personal taste, what your collaborators use, and/or of course a certain path dependency - may it be what you have learned first, what the company/team you are working for and with prefer to use, etc.

For getting an inspiration for visualizations, my go-to resource is data-to-viz.com. It’s a comprehensive website with an easily accessible overview that provides code snippets in both R and Python.

We’ll switch the order here and start with R to visualize data. If you then switch to the Python tab, you’ll find a way how to mimic these steps in Python using the plotnine library as well as Python’s equivalent to patchwork called patchworklib.

  • R 🔵
  • Python 🐍

When it comes to visulizations, my personal preference is ggplot2 (from R). The preference may of course differ, depending on who you ask 🤓

To visualize our data, we create a function. This is not required and we can also copy-paste the code as often as we want, but: writing functions increases replicability and is good practice 👍

Copy-pasting your code increases the chance of running into “copy-paste” errors and (believe me) this can be really frustrating to maintain and debug 👎

While it may seem intimidating at first, it’s often not too difficult and a great asset in the long-run!

generate_plot <- function(df, value, region_value) {
  
  plot <- df %>%
    # Generate the region again (lost during wrangling)
    dplyr::mutate(
      region = countrycode::countrycode(
        country, 'iso3c', 'region23'
      )
    ) %>%
    # Filter by region_value and keep only "Yes"
    dplyr::filter(
      region == !! ensym(region_value),
      !! ensym(value) == "Yes"
    ) %>%
    # Now the plotting magic begins 🔮
    ggplot2::ggplot(aes(x = country, y = percent)) +
    ggplot2::geom_segment(
      aes(
        x = reorder(country, percent),
        xend = country, y = 0, yend = percent
      ),
      color = "#a7a9ac"
    ) +
    ggplot2::geom_point(
      color = "#88398a", size = 4, alpha = 0.8
    ) +
    # Flip coordinates
    ggplot2::coord_flip() +
    # Apply a minimalist theme
    ggplot2::theme_minimal() +
    ggplot2::theme(
      plot.title = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank()
    )
  
  return(plot)
}

In the next step, we then iterate over the regions in Europe (“Western Europe”, “Eastern Europe”, “Southern Europe”, and “Northern Europe”) and apply our function generate_plot:

# Store the regions in the object `region`
regions <- c("Western Europe", "Eastern Europe", 
             "Southern Europe", "Northern Europe")

# Iterate over regions and generate plots
for (region_name in regions) {
    # This allows us to generate a short version of 
    # the regions name
    name <- gsub("([A-Za-z]+).*", "\\1", tolower(region_name))
    # Generate the plot and store it in an object 
    # that is called "western",
    # "eastern", "southern", and "northern"
    assign(name,
           generate_plot(df_r_new, value = treatment,
                         region_value=region_name))
}

We have now four objects (called northern, southern, western and eastern). We could call each object and look at the plot separately:

northern

small_image

But this way it’s not easy to compare them and wouldn’t it be great to have them all in one plot? This is where patchwork comes in. Patchwork is a fantastic packages that allows you to add single plots in a very convenient way. Here’s one example:

# We combine the plots
(northern + southern) / (western + eastern) +
# And add an annotation
plot_annotation(title = 'Requesting mental health treatment',
                subtitle = 'Percentage of persons having thought for mental health treatment',
                caption = 'Data reference: Kaggle')

small_image

Python also offers visualization libraries. The most famous libraries are matplotlib and seaborn (which builds upon the backbone of matplotlib). If you are looking for a ggplot2 equivalent, plotnine could become your new best friend!

# Similar to R, we add the region
df_python_cleaned['region'] = countrycode.countrycode(codes=df_python_cleaned['country'],
                                                      origin='iso3c',
                                                      target='region')

# Keep only those where the answer is "Yes"
df_python_reduced = df_python_cleaned[df_python_cleaned['treatment'] == "Yes"] \
                    [['country', 'percent', 'region']]

# Print the head of the data
df_python_reduced.head()
        Country  Percent  Region          
 2      BEL      16.67    Western Europe  
 4      BGR      50.00    Eastern Europe  
 7      CHE      42.86    Western Europe  
 10     DEU      46.67    Western Europe  
 11     DNK      100.00   Northern Europe 

In a next step, we finally create the plots:

# And now the plotting magic begins 🔮
plot = {}

for region in region_value:
    # Subset data to the required region only
    df = df_python_reduced[df_python_reduced['region'] == region]
    # Plot the data and store the plot into a dictionary
    plot[region] = (ggplot(df, aes(x='country',y='percent'))+
    geom_segment(mapping=aes(x='reorder(country,percent)',xend='country', y=0, yend='percent'),
                  color="#a7a9ac")+
    geom_point(color="#88398a", size=4, alpha=0.8)+
    # Flip the coordinates
    coord_flip()+
    # And twist the theme a bit
    theme_minimal()+
    labs(y="", x="")
    )

And here also patchworklib (a wrapper for patchwork that we used in R) helps a lot.

g1 = pw.load_ggplot(plot['Northern Europe'], figsize=(2,2))
g2 = pw.load_ggplot(plot['Southern Europe'], figsize=(2,2))
g3 = pw.load_ggplot(plot['Western Europe'], figsize=(2,2))
g4 = pw.load_ggplot(plot['Eastern Europe'], figsize=(2,2))

(g1 | g2) / (g3 | g4)

small_image

Unlike the features provided by the original patchwork packages in R, here we cannot easily add (and define) the title, subtitles as well as captions (yet). But that’s often the case if we use a wrapper.

While the plots may be good looking, their function is more important 🤓 What did we learn about the data from the data wrangling process until plotting the visualization?

Let us summarize what we know:

  • We only included European countries
  • The sample size is limited (and skewed). Some countries have definitely more information than others.
  • We further filtered only for “Yes” (this means we naturally introduce a bias in our visualization). An alternative may be to also include the “No"s - but this still doesn’t solve the “small and skewed sample” dilemma.

And here the “data science circle” starts - we are now asked to think about possible solutions, make different decisions based on our newly gained knowledge and to (possibly) re-iterate the steps made above. Data science processes are often not linear - but that’s what we already discussed in the beginning ✨

small_image

Alternative text Image showing two arrows (one straight and one going circles). Each arrow starts at a question mark and ends at sparkling elements (symbolizing the ‘magic’ output that will be created).


More resources