The following blog post is based on a workshop that I delivered as part of Women in Data Science 2023.
To access additional material of the workshop, please check the GitHub repository to run the code yourself.
You can also find the slide deck to flip through here:
What can you take away from this blog post?
Throughout the post, we follow this workflow:
Alternative text
Image showing the data science workflow from question, to data access, wrangling, data viz over ML & stats to communication
Where the snake appears, we will be using Python, the letter R stands for examples where we use R.
It covers the following steps:
Before we get started, we need to make sure that all required libraries are installed.
The standard package indices in Python is called PyPi and in R it is CRAN.
For Python, we first install the packages using pip
. If you’re working in a Jupyter notebook, this can happen directly in a separate code chunk using the following code:
!pip install sweetviz
!pip install pandas
!pip install rpy2==3.5.1
!pip install countrycode
!pip install plotnine
!pip install patchworklib
!pip install ydata-profiling
Once installed, we load the packages in Python:
import pandas as pd
import sweetviz as sv
import patchworklib as pw
from ydata_profiling import ProfileReport
from countrycode import countrycode
from plotnine import *
There are different ways how to use R and Python in one project. We are using rpy2
here that allows us to call code chunks in R in Jupyter notebooks. Alternatively, you can also turn to Quarto that you can run in your local IDEs (for instance in RStudio Desktop or VS Code). Quarto is great - I also used it to generate the slides that accompanied the workshop.
%load_ext rpy2.ipython
If we are now calling R inside the Jupyter notebook, we always have to put %%R
as cell magic at the beginning of the code chunk. We’ll do this here (but leave it out for the rest of the post to increase readability). As a side note: Within the Jupyter notebook, you cannot only use the languages in separate code cunks but you can also use objects generated by Python in R (and vice versa). For this, all you have to do is to put %%R -i object_name
at the beginning of the cunk. object_name
will be replaced by the object (and could be df_python
) which is now ready to use for you in R.
But back to installing libraries: Similar as in Python, some packages are pre-installed and we don’t need to install them here.
%%R
install.packages("countrycode")
install.packages("skimr")
install.packages("patchwork")
We can now also load the packages:
%%R
# Package for data wrangling
library(dplyr)
# Package for exploratory data analysis
library(skimr)
# Package for converting country codes
# (and also identify continents)
library(countrycode)
# Package for visualization
library(ggplot2)
# Package that allows to arrange
# multiple plots the way you want
library(patchwork)
At the very beginning of every data science task usually stands a question. The question may be revised and adjusted throughout the process. To showcase the data science process, we will follow the following question:
Are there differences across European countries when it comes to requesting mental health treatment?
To study the question, we work with data from a Mental Health Survey in Tech, provided by Kaggle. Before we dive into the data, we will first load it. There are also ways to directly access a Kaggle dataset in Google Colab using access tokens but we will go another way and load it as if it was on our local machines. This helps us also to understand how we were to load data from our local machines.
We store the URL of the data in an object called path
. In a next step, we use pandas
read_csv function to open the data and store it in df_python
.
path = "https://raw.githubusercontent.com/cosimameyer/r-python-talk/main/data/survey.csv"
df_python = pd.read_csv(path)
We can also use R to load the data. Here, multiple options are possible. We will be using read.csv
from base R. If you prefer the tidyverse, you can also use readr::read_csv
. We store the data frame in df_r
.
path <- "https://raw.githubusercontent.com/cosimameyer/r-python-talk/main/data/survey.csv"
data <- read.csv(path)
Now that we loaded our data, we can look at it. This step is also called exploratory data analysis (or EDA). It is an essential step that won’t only happen at the beginning of every data analysis but you will often come back to it throughout the data science process.
If you click on the Kaggle link, you will learn more about the data themselves including a list of all variables and a data explorer. This gives you already a first overview of the data.
But also both Python and R have built-in functionalities to get a first understanding of your data.
In Python, we can use shape
. This way, we access the dimensions of the data frame. This gives us a good understanding of the number of rows (1,259) and columns (27).
df_python.shape
(1259, 27)
In your own notebook, a next step would be to “print” the head of the data - that means to look at the first lines of the data frame:
df_python.head()
In a next step, we use df_python.info()
to get a general overview of the dataset:
df_python.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Timestamp 1259 non-null object
1 Age 1259 non-null int64
2 Gender 1259 non-null object
3 Country 1259 non-null object
4 state 744 non-null object
5 self_employed 1241 non-null object
6 family_history 1259 non-null object
7 treatment 1259 non-null object
8 work_interfere 995 non-null object
9 no_employees 1259 non-null object
10 remote_work 1259 non-null object
11 tech_company 1259 non-null object
12 benefits 1259 non-null object
13 care_options 1259 non-null object
14 wellness_program 1259 non-null object
15 seek_help 1259 non-null object
16 anonymity 1259 non-null object
17 leave 1259 non-null object
18 mental_health_consequence 1259 non-null object
19 phys_health_consequence 1259 non-null object
20 coworkers 1259 non-null object
21 supervisor 1259 non-null object
22 mental_health_interview 1259 non-null object
23 phys_health_interview 1259 non-null object
24 mental_vs_physical 1259 non-null object
25 obs_consequence 1259 non-null object
26 comments 164 non-null object
dtypes: int64(1), object(26)
memory usage: 265.7+ KB
This tells us a lot about the data. We see the column names (= our features/variables), whether there are NAs (Non-Null Count
) and we also learn more about the data type (Dtype
).
We can, of course ☺️, also do these steps in R. There are also similar approaches in the R-universe. To get the number of rows and columns, we call dim():
dim(df_r)
[1] 1259 27
To print the head, R has head()
:
head(df_r)
Timestamp Age Gender Country state self_employed
1 2014-08-27 11:29:31 37 Female United States IL <NA>
2 2014-08-27 11:29:37 44 M United States IN <NA>
3 2014-08-27 11:29:44 32 Male Canada <NA> <NA>
4 2014-08-27 11:29:46 31 Male United Kingdom <NA> <NA>
5 2014-08-27 11:30:22 31 Male United States TX <NA>
6 2014-08-27 11:31:22 33 Male United States TN <NA>
family_history treatment work_interfere no_employees remote_work
1 No Yes Often 6-25 No
2 No No Rarely More than 1000 No
3 No No Rarely 6-25 No
4 Yes Yes Often 26-100 No
5 No No Never 100-500 Yes
6 Yes No Sometimes 6-25 No
tech_company benefits care_options wellness_program seek_help anonymity
1 Yes Yes Not sure No Yes Yes
2 No Don't know No Don't know Don't know Don't know
3 Yes No No No No Don't know
4 Yes No Yes No No No
5 Yes Yes No Don't know Don't know Don't know
6 Yes Yes Not sure No Don't know Don't know
leave mental_health_consequence phys_health_consequence
1 Somewhat easy No No
2 Don't know Maybe No
3 Somewhat difficult No No
4 Somewhat difficult Yes Yes
5 Don't know No No
6 Don't know No No
coworkers supervisor mental_health_interview phys_health_interview
1 Some of them Yes No Maybe
2 No No No No
3 Yes Yes Yes Yes
4 Some of them No Maybe Maybe
5 Some of them Yes Yes Yes
6 Yes Yes No Maybe
mental_vs_physical obs_consequence comments
1 Yes No <NA>
2 Don't know No <NA>
3 No No <NA>
4 No Yes <NA>
5 Don't know No <NA>
6 Don't know No <NA>
str()
and summary()
help us to understand the general structure of our data.
str(df_r)
'data.frame': 1259 obs. of 27 variables:
$ Timestamp : chr "2014-08-27 11:29:31" "2014-08-27 11:29:37" "2014-08-27 11:29:44" "2014-08-27 11:29:46" ...
$ Age : num 37 44 32 31 31 33 35 39 42 23 ...
$ Gender : chr "Female" "M" "Male" "Male" ...
$ Country : chr "United States" "United States" "Canada" "United Kingdom" ...
$ state : chr "IL" "IN" NA NA ...
$ self_employed : chr NA NA NA NA ...
$ family_history : chr "No" "No" "No" "Yes" ...
$ treatment : chr "Yes" "No" "No" "Yes" ...
$ work_interfere : chr "Often" "Rarely" "Rarely" "Often" ...
$ no_employees : chr "6-25" "More than 1000" "6-25" "26-100" ...
$ remote_work : chr "No" "No" "No" "No" ...
$ tech_company : chr "Yes" "No" "Yes" "Yes" ...
$ benefits : chr "Yes" "Don't know" "No" "No" ...
$ care_options : chr "Not sure" "No" "No" "Yes" ...
$ wellness_program : chr "No" "Don't know" "No" "No" ...
$ seek_help : chr "Yes" "Don't know" "No" "No" ...
$ anonymity : chr "Yes" "Don't know" "Don't know" "No" ...
$ leave : chr "Somewhat easy" "Don't know" "Somewhat difficult" "Somewhat difficult" ...
$ mental_health_consequence: chr "No" "Maybe" "No" "Yes" ...
$ phys_health_consequence : chr "No" "No" "No" "Yes" ...
$ coworkers : chr "Some of them" "No" "Yes" "Some of them" ...
$ supervisor : chr "Yes" "No" "Yes" "No" ...
$ mental_health_interview : chr "No" "No" "Yes" "Maybe" ...
$ phys_health_interview : chr "Maybe" "No" "Yes" "Maybe" ...
$ mental_vs_physical : chr "Yes" "Don't know" "No" "No" ...
$ obs_consequence : chr "No" "No" "No" "Yes" ...
$ comments : chr NA NA NA NA ...
And the summary:
summary(df_r)
Timestamp Age Gender Country
Length:1259 Min. :-1.726e+03 Length:1259 Length:1259
Class :character 1st Qu.: 2.700e+01 Class :character Class :character
Mode :character Median : 3.100e+01 Mode :character Mode :character
Mean : 7.943e+07
3rd Qu.: 3.600e+01
Max. : 1.000e+11
state self_employed family_history treatment
Length:1259 Length:1259 Length:1259 Length:1259
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
work_interfere no_employees remote_work tech_company
Length:1259 Length:1259 Length:1259 Length:1259
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
benefits care_options wellness_program seek_help
Length:1259 Length:1259 Length:1259 Length:1259
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
anonymity leave mental_health_consequence
Length:1259 Length:1259 Length:1259
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
phys_health_consequence coworkers supervisor
Length:1259 Length:1259 Length:1259
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
mental_health_interview phys_health_interview mental_vs_physical
Length:1259 Length:1259 Length:1259
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
obs_consequence comments
Length:1259 Length:1259
Class :character Class :character
Mode :character Mode :character
While the built-in functions are already an excellent starting point, both languages have more libraries that help you to get even more out of your data.
To get an even better idea of the data (with more information and also some visualizations), we use a small (but powerful) package inside the notebook. It’s called sweetviz
and helps to generate nice EDA reports with just two lines of code!
We see a general overview of the dataframe in the top (including rows, duplicates, the number of variables (= features), the distribution of the variable types (categorical, numerical, and text), and the size of the dataset). If you then scroll down, you will see a visual representation of each variable including the distribution, the number of distinct values as well as missings. If you click on a single tab, it will expand and give you even more information.
This is an excellent starting point for every data analysis.
report = sv.analyze(df_python)
report.show_notebook(layout="vertical", w=800, h=700, scale=0.8)
To access the full functionality, try it out yourself using the Jupyter notebook.
There are more libraries like this. If you are curious, you can also try autoviz
, pandas-profiling
or dtale
.
But they don’t provide info on the number of missing data. Here, the library skimr
can help. It provides you with a more detailed output. If you want to know what else is out there for exploratory data analysis in R, have a look at the recent publication which compares more packages.
skimr::skim(df_r)
── Data Summary ────────────────────────
Values
Name df_r
Number of rows 1259
Number of columns 27
_______________________
Column type frequency:
character 26
numeric 1
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique
1 Timestamp 0 1 19 19 0 1246
2 Gender 0 1 1 46 0 49
3 Country 0 1 5 22 0 48
4 state 515 0.591 2 2 0 45
5 self_employed 18 0.986 2 3 0 2
6 family_history 0 1 2 3 0 2
7 treatment 0 1 2 3 0 2
8 work_interfere 264 0.790 5 9 0 4
9 no_employees 0 1 3 14 0 6
10 remote_work 0 1 2 3 0 2
11 tech_company 0 1 2 3 0 2
12 benefits 0 1 2 10 0 3
13 care_options 0 1 2 8 0 3
14 wellness_program 0 1 2 10 0 3
15 seek_help 0 1 2 10 0 3
16 anonymity 0 1 2 10 0 3
17 leave 0 1 9 18 0 5
18 mental_health_consequence 0 1 2 5 0 3
19 phys_health_consequence 0 1 2 5 0 3
20 coworkers 0 1 2 12 0 3
21 supervisor 0 1 2 12 0 3
22 mental_health_interview 0 1 2 5 0 3
23 phys_health_interview 0 1 2 5 0 3
24 mental_vs_physical 0 1 2 10 0 3
25 obs_consequence 0 1 2 3 0 2
26 comments 1095 0.130 1 3548 0 160
whitespace
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 1
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75
1 Age 0 1 79428148. 2818299443. -1726 27 31 36
p100 hist
1 99999999999 ▇▁▁▁▁
Once we have an idea of the data, we continue and subset the data frame to the variables that we need - in our case Country
and treatment
. Looking at the Kaggle codebook, we learn that:
Country
stands for the country where the respondent lives intreatment
provides the answer to the question “Have you sought treatment for a mental health condition?”To wrangle the data, we again rely on pandas
- a go-to library that makes working with data easy.
df_python_small = df_python[['Country', 'treatment']]
# We then again print the head to understand the changes
# in our data frame
df_python_small.head()
Country Treatment
0 United States Yes
1 United States No
2 Canada No
3 United Kingdom Yes
4 United States No
We see that only the column Country
and the column treatment
remain. Since we are only interested in countries in Europe, we need to identify a way to select these countries. It would be possible to generate a list with countries in Europe ourselves - but that would take up much of our time. Luckily other people thought the same and developed a tool that does the work for us. The library countrycode
is a wrapper for the R library. While the syntax is slightly different, the logic remains the same:
# Get the region
df_python_small['region'] = countrycode.countrycode(
codes=df_python_small['Country'],
origin='country_name',
target='region'
)
# Get an ISO3 country code
df_python_small['country'] = countrycode.countrycode(
codes=df_python_small['Country'],
origin='country_name',
target='iso3c'
)
# Print the head again
df_python_small.head()
Country Treatment Region Country Code
0 United States Yes Northern America USA
1 United States No Northern America USA
2 Canada No Northern America CAN
3 United Kingdom Yes Northern Europe GBR
4 United States No Northern America USA
In a next step, we subset the data and only keep those countries that are located within Europe.
# For this, we define the range of the regions
region_value = ["Northern Europe", "Western Europe",
"Eastern Europe", "Southern Europe"]
# Use the information from `region_value` and keep only those
# countries where the region is in `region_value`
df_python_small2 = df_python_small[
df_python_small['region'].isin(region_value)
]
# Last but not least, we again print the head:
df_python_small2.head()
Country Treatment Region Country Code
3 United Kingdom Yes Northern Europe GBR
11 Bulgaria No Eastern Europe BGR
16 United Kingdom Yes Northern Europe GBR
19 France No Western Europe FRA
29 United Kingdom No Northern Europe GBR
In a last step, we count the number of treatment occurences (“Yes” vs “No”) by country
and store the result in df_python_clean
.
# Count the distinct treatment answer by country
df_python_clean = df_python_small2.value_counts(
['country', 'treatment']
).reset_index(name='n')
# Print the head
df_python_clean.head()
Country Treatment N
0 GBR Yes 93
1 GBR No 92
2 DEU No 24
3 DEU Yes 21
4 NLD No 18
In a next step, we calculate the sum of n
(this is the count that we generated above) by country. We will use this information later to calculate the shared percentage.
# Calculate the total answers by country
# (irrespective of the distinct answer)
df_python_clean2 = df_python_clean.groupby(['country']).n. \
agg('sum').reset_index(name='total')
# Print the head
df_python_clean2.head()
Country Total
0 AUT 3
1 BEL 6
2 BGR 4
3 BIH 1
4 CHE 7
We have two data frames now (df_python_clean2
and df_python_clean3
). To bring them back together, we use a merge
.
# Merge df_python_clean2 and df_python_clean3
df_python_cleaned = df_python_clean2.merge(
df_python_clean,
on='country'
)
# Print the head
df_python_cleaned.head()
Country Total Treatment N
0 AUT 3 No 3
1 BEL 6 No 5
2 BEL 6 Yes 1
3 BGR 4 No 2
4 BGR 4 Yes 2
In a last step, we then calculate the percentage share.
# Calculate the percentage share
df_python_cleaned['percent'] = round(
(df_python_cleaned['n'] / df_python_cleaned['total']) * 100,
2
)
# Print the head
df_python_cleaned.head()
Country Total Treatment N Percent
0 AUT 3 No 3 100.00
1 BEL 6 No 5 83.33
2 BEL 6 Yes 1 16.67
3 BGR 4 No 2 50.00
4 BGR 4 Yes 2 50.00
While we executed single steps here, we can also introduce chains and execute them in one longer step:
# Define regions
region_value = ["Northern Europe", "Western Europe",
"Eastern Europe", "Southern Europe"]
# Execute the chain
df_clean1 = (
df_python[['Country', 'treatment']]
.assign(
region=lambda d: countrycode.countrycode(
codes=d['Country'],
origin='country_name',
target='region'
)
)
.assign(
country=lambda d: countrycode.countrycode(
codes=d['Country'],
origin='country_name',
target='iso3c'
)
)
.loc[lambda x: x['region'].isin(region_value)]
.value_counts(['country', 'treatment'])
.reset_index(name='n')
)
# Calculate the total answers by country
# (irrespective of the distinct answer)
df_clean2 = df_clean1.groupby(['country']).n. \
agg('sum').reset_index(name='total')
# Merge df_python_clean2 and df_python_clean3
df_python_cleaned = df_clean2.merge(df_clean1, on='country')
# Calculate the percentage share
df_python_cleaned['percent'] = round(
(df_python_cleaned['n'] / df_python_cleaned['total']) * 100,
2
)
# And again print the head
df_python_cleaned.head()
Country Total Treatment N Percent
0 AUT 3 No 3 100.00
1 BEL 6 No 5 83.33
2 BEL 6 Yes 1 16.67
3 BGR 4 No 2 50.00
4 BGR 4 Yes 2 50.00
In a next step, we perform some data wrangling. The term data wrangling is rather broad and can include various things - depending on the use case and the goal of your wranglin. It can mean that you need to pre-process text data if you work with it or that you need to reshape the data completely to make them fit your purpose. We’ll repeat the steps that we have performed in Python in R and create the df_r_clean
dataset. For this, we first create new country odes using the countrycode
package, then filter for the required regions (based on the World Bank). In a next step, we then select country
and treatment
. We group by these variables and generate a count, group again - only by country
this time. In the last step we then generate the percentage and store it in percent
.
To do this, the package dplyr
is a gem! It makes data wrangling easy and allows to write easily readable code. You can see an example below.
R has a special operator - is called a pipe (%>%
) and come from the package magrittr
that is a dependency of dplyr
.
df_r_clean <- df_r %>%
# Generate the ISO3 country name (`country`)
# and the region (`region`)
dplyr::mutate(
country = countrycode::countrycode(
Country, 'country.name', 'iso3c'
),
region = countrycode::countrycode(
Country, 'country.name', 'region23'
)
) %>%
# We now filter for countries in Europe
dplyr::filter(
region %in% c(
"Northern Europe", "Western Europe",
"Eastern Europe", "Southern Europe"
)
) %>%
# Select the required variables
dplyr::select(country, treatment) %>%
# Group by them...
dplyr::group_by(country, treatment) %>%
# ... and generate the count
count() %>%
# Now we group again by country
dplyr::group_by(country) %>%
# Calculate the sum of the count (`n`) by country
# and use this information to calculate the
# percentage share for each category by country
dplyr::mutate(
total = sum(n),
percent = (n / total) * 100
)
df_r_clean
# A tibble: 39 × 5
# Groups: country [27]
country treatment n total percent
<chr> <chr> <int> <int> <dbl>
1 AUT No 3 3 100
2 BEL No 5 6 83.3
3 BEL Yes 1 6 16.7
4 BGR No 2 4 50
5 BGR Yes 2 4 50
6 BIH No 1 1 100
7 CHE No 4 7 57.1
8 CHE Yes 3 7 42.9
9 CZE No 1 1 100
10 DEU No 24 45 53.3
# … with 29 more rows
# ℹ Use `print(n = ...)` to see more rows
Printing the data set again, we see that percentage can be often misleading if we don’t know the total share by country. This is an important information that we need to keep in mind when later interpreting the data.
As you have already seen, both R and Python are extremely versatile and are both good candidates for the tasks. It often comes down to a matter of personal taste, what your collaborators use, and/or of course a certain path dependency - may it be what you have learned first, what the company/team you are working for and with prefer to use, etc.
For getting an inspiration for visualizations, my go-to resource is data-to-viz.com. It’s a comprehensive website with an easily accessible overview that provides code snippets in both R and Python.
We’ll switch the order here and start with R to visualize data. If you then switch to the Python tab, you’ll find a way how to mimic these steps in Python using the plotnine
library as well as Python’s equivalent to patchwork
called patchworklib
.
When it comes to visulizations, my personal preference is ggplot2
(from R). The preference may of course differ, depending on who you ask 🤓
To visualize our data, we create a function. This is not required and we can also copy-paste the code as often as we want, but: writing functions increases replicability and is good practice 👍
Copy-pasting your code increases the chance of running into “copy-paste” errors and (believe me) this can be really frustrating to maintain and debug 👎
While it may seem intimidating at first, it’s often not too difficult and a great asset in the long-run!
generate_plot <- function(df, value, region_value) {
plot <- df %>%
# Generate the region again (lost during wrangling)
dplyr::mutate(
region = countrycode::countrycode(
country, 'iso3c', 'region23'
)
) %>%
# Filter by region_value and keep only "Yes"
dplyr::filter(
region == !! ensym(region_value),
!! ensym(value) == "Yes"
) %>%
# Now the plotting magic begins 🔮
ggplot2::ggplot(aes(x = country, y = percent)) +
ggplot2::geom_segment(
aes(
x = reorder(country, percent),
xend = country, y = 0, yend = percent
),
color = "#a7a9ac"
) +
ggplot2::geom_point(
color = "#88398a", size = 4, alpha = 0.8
) +
# Flip coordinates
ggplot2::coord_flip() +
# Apply a minimalist theme
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
return(plot)
}
In the next step, we then iterate over the regions in Europe (“Western Europe”, “Eastern Europe”, “Southern Europe”, and “Northern Europe”) and apply our function generate_plot
:
# Store the regions in the object `region`
regions <- c("Western Europe", "Eastern Europe",
"Southern Europe", "Northern Europe")
# Iterate over regions and generate plots
for (region_name in regions) {
# This allows us to generate a short version of
# the regions name
name <- gsub("([A-Za-z]+).*", "\\1", tolower(region_name))
# Generate the plot and store it in an object
# that is called "western",
# "eastern", "southern", and "northern"
assign(name,
generate_plot(df_r_new, value = treatment,
region_value=region_name))
}
We have now four objects (called northern
, southern
, western
and eastern
). We could call each object and look at the plot separately:
northern
But this way it’s not easy to compare them and wouldn’t it be great to have them all in one plot? This is where patchwork
comes in. Patchwork is a fantastic packages that allows you to add single plots in a very convenient way. Here’s one example:
# We combine the plots
(northern + southern) / (western + eastern) +
# And add an annotation
plot_annotation(title = 'Requesting mental health treatment',
subtitle = 'Percentage of persons having thought for mental health treatment',
caption = 'Data reference: Kaggle')
Python also offers visualization libraries. The most famous libraries are matplotlib
and seaborn
(which builds upon the backbone of matplotlib
). If you are looking for a ggplot2
equivalent, plotnine
could become your new best friend!
# Similar to R, we add the region
df_python_cleaned['region'] = countrycode.countrycode(codes=df_python_cleaned['country'],
origin='iso3c',
target='region')
# Keep only those where the answer is "Yes"
df_python_reduced = df_python_cleaned[df_python_cleaned['treatment'] == "Yes"] \
[['country', 'percent', 'region']]
# Print the head of the data
df_python_reduced.head()
Country Percent Region
2 BEL 16.67 Western Europe
4 BGR 50.00 Eastern Europe
7 CHE 42.86 Western Europe
10 DEU 46.67 Western Europe
11 DNK 100.00 Northern Europe
In a next step, we finally create the plots:
# And now the plotting magic begins 🔮
plot = {}
for region in region_value:
# Subset data to the required region only
df = df_python_reduced[df_python_reduced['region'] == region]
# Plot the data and store the plot into a dictionary
plot[region] = (ggplot(df, aes(x='country',y='percent'))+
geom_segment(mapping=aes(x='reorder(country,percent)',xend='country', y=0, yend='percent'),
color="#a7a9ac")+
geom_point(color="#88398a", size=4, alpha=0.8)+
# Flip the coordinates
coord_flip()+
# And twist the theme a bit
theme_minimal()+
labs(y="", x="")
)
And here also patchworklib
(a wrapper for patchwork
that we used in R) helps a lot.
g1 = pw.load_ggplot(plot['Northern Europe'], figsize=(2,2))
g2 = pw.load_ggplot(plot['Southern Europe'], figsize=(2,2))
g3 = pw.load_ggplot(plot['Western Europe'], figsize=(2,2))
g4 = pw.load_ggplot(plot['Eastern Europe'], figsize=(2,2))
(g1 | g2) / (g3 | g4)
Unlike the features provided by the original patchwork
packages in R, here we cannot easily add (and define) the title, subtitles as well as captions (yet). But that’s often the case if we use a wrapper.
While the plots may be good looking, their function is more important 🤓 What did we learn about the data from the data wrangling process until plotting the visualization?
Let us summarize what we know:
And here the “data science circle” starts - we are now asked to think about possible solutions, make different decisions based on our newly gained knowledge and to (possibly) re-iterate the steps made above. Data science processes are often not linear - but that’s what we already discussed in the beginning ✨
Alternative text
Image showing two arrows (one straight and one going circles). Each arrow starts at a question mark and ends at sparkling elements (symbolizing the ‘magic’ output that will be created).
dovpanda
- directions overlays in pandassiuba
- tidy-like data wrangling in Python