Python and R: combined workflows

tools
Published

January 20, 2023

Objective

Many articles abound on when to choose Python or R for data science. I’m summarizing here how I use and plan to use both languages together.

Data collection

As most of the time, to have a clear idea of the situation it is better to have some data: I started with R in Datacamp in mid 2018 and estimate to be coding professorially in R at least 1/3 of my time since 2020. As my working year is 200 * 8h = 1’600 h/year this makes approximately 500h / year x 3.

Status end 2022:

89h - datacamp data scientist with R
650h - epfl data science with R
1500h - on the job in R
2239h - total in R

66h - datacamp data analyst with Python
66h - total in Python

Status

The volume of hours on R is still much higher than Python but this doesn’t tell all the story. I’ve been learning and using other languages such as bash, javascript and some C++ and got well documented in the history of programming languages. I’m now very familiar to many programming concepts making the learning of python much easier. Interesting also to see that the experience is not new for me as I’ve learned Spanish and French after having learned Portuguese and have learned some German after having learned English and all the time it feels as an extension of previous knowledge, the underlying structures being the same.

Now the more I learn Python the more I tend (or try) to see R as a domain language for statistics and Python as a generic language. There is a very big overlap though and it is very hard to draw the borders as R provides all sorts of tools for os handling, web applications and everything we can think of. I’ve been experimenting with tidymodels and other R machine learning packages but am not entirely satisfied. # Next steps

As I enter now in the machine learning domain much deeper and also professionally I’ve decided to move to Python as my first language.

This means that for typical Data Science tasks like loading data, wrangling dataframes and so on I will be using Pandas and Matplotlib instead of the tidyverse. I don’t see a big effort in the transition as the concepts are the same and in any case it is hard to know the syntax and I keep refering to the tidyverse cheatsheets anyway (github copilot may one day change this). R will become useful when specific statistics tasks are needed such as analysing Designs of Experiments.

In this sense I expect Python to become my primary programming language. With Posit quarto notebooks I have a working tool where I can freely combine both. I see my workflow going more and more in the direction of doing most things in Python and reserving R for some advanced statistical analysis.

Summary

Below a short summary my workflow domains, current and future (a star identifies my selected packages / approaches):

Domain R Python Comments
Statistics *stats statsmodels Adoption of Python under evaluation
Modelling *stats, mle4 statsmodels Adoption of Python under evaluation
MSA *SixSigma May remain in R as very specific
DoE *FrF2 May remain in R as very specific
Process Capability Custom functions in R Adoption of Python only after Statistics
Machine Learning tidymodels, mlr3 *scikit learn Selected initially tidymodels but still not satisfied
Data science tidyverse, ggplot2 *numpy, pandas, matplotlib, seaborn Python becomes inevitable with the choice of scikit learn
Dashboards *shiny flask Strong investment in Shiny already. Consider embedding Python on a need basis on the back if ever.
Text stringr *str, bultins
NLP, Cartography, APIs - - If ever needed will use the teams language
OS - - Directly linux

References

https://www.oreilly.com/library/view/python-and-r/9781492093398/