Authoring with Quarto: set up for combined python and R

tools
Author

Joao Ramalho

Published

April 9, 2022

Introduction

Support for utilization of python in an R setup has now reached maturity. A couple of years ago it was still necessary to setup python manually by creating a virtual environment either with venv or conda and properly setting up the system paths. The RStudio team has greatly facilitated this in particular with the packages {renv} and {reticulate}. In this article which is written in Rmarkdown there’s an example of how to do it and a short data analysis in python confirming the libraries availability.

The benefit is clear: any R native data scientist can easily explore and adopt full or partial code snippets and packages developed in python in his work in a simple way. Making a comparison with human language, this is the same as learning the basics of a foreign language and being able to travel abroad and being able to ask directions, order food and potentially start new interactions with people who wouldn’t know yours or a general language as English.

Setup

A dedicated R project for the setup needs to be created. The packages reticulate and renv need to be installed and renv activated. Once renv activated we can trigger the python setup with renv::use_python. This function will create inside the R project folder a specific virtual environment and folders for the python libraries and python engines. This needs to be done only once. The configuration can be checked with the function py_discover_config.

renv::init()
renv::install("reticulate")
renv::install("here")
renv::install("languageserver")
renv::install("httpgd")
renv::use_python(type = "conda")
reticulate::py_discover_config()

After some testing I’ve opted for the conda approach. Virtual enviroments can be configured with pip and venv too but the conda has been the one providing stronger reproducibility. Python packages are installed with conda_install which is the equivalent of the R install.packages function. The new python packages installed locally in the project python conda environment. Again the configuration can be checked with py_discover_config. The chunk closes with the installation of the R tidyverse packages and the renv::snapshot to write the configuration in a system file.

library(here)
reticulate::conda_install("numpy", envname = here("renv/python/condaenvs/renv-python"))
reticulate::conda_install(c("pandas", "matplotlib", "seaborn"), envname = here("renv/python/condaenvs/renv-python"))
reticulate::conda_install("sqlalchemy", envname = here("renv/python/condaenvs/renv-python"))
reticulate::conda_install("requests", envname = here("renv/python/condaenvs/renv-python"))
reticulate::py_discover_config()
renv::install("tidyverse")
renv::snapshot()

Data analysis with Python

For regular utilization it is sufficient to run the chunk below that loads the configuration setup.^

library(reticulate)
library(here)
here() starts at /home/joao/JR-IA
#use_condaenv(condaenv = here("../../renv/python/condaenvs/renv-python"))
py_discover_config()
python:         /home/joao/JR-IA/renv/python/condaenvs/renv-python/bin/python
libpython:      /home/joao/JR-IA/renv/python/condaenvs/renv-python/lib/libpython3.7m.so
pythonhome:     /home/joao/JR-IA/renv/python/condaenvs/renv-python:/home/joao/JR-IA/renv/python/condaenvs/renv-python
version:        3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0]
numpy:          /home/joao/JR-IA/renv/python/condaenvs/renv-python/lib/python3.7/site-packages/numpy
numpy_version:  1.21.6

NOTE: Python version was forced by RETICULATE_PYTHON
py_module_available("pandas")
[1] TRUE

To test the python installation and its packages some data analysis is run in the chunks below using the approach of reticulate in Rmarkdown. Python is simply indicated in the chunk metadata and the chunks can be run just like R chunks in the RStudio IDE.

A first step is to import the required python packages for this analysis:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The data used for this example comes from ourworldindata.org, specifically from an article by Ritchie and Roser (2018). Data is loaded and cleaned for plotting using pandas:

raw_data = pd.read_csv("./data/plastic-waste-polymer.csv")
clean_data = raw_data.rename(columns={"Primary plastic waste generation (million tonnes)":"Weight"})
clean_data["MillionTons"] = clean_data["Weight"] / 1000000
plot_data = clean_data.drop([0,1,2,3,4,6,9,16,17,18])
plot_data = plot_data.sort_values(by = "MillionTons", ascending = False)

The R package reticulate keeps working in the background, making data available in R if needed. It can be called via the object py$ Reticulate offers many additional functionalities, from calling R objects back in python, object type conversion and a full set of important IDE features such as inspecting objects, plotting and so on.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
py$plot_data %>% 
  glimpse()
Rows: 9
Columns: 5
$ Entity      <chr> "LD, LDPE", "PP", "PP&A fibers", "HDPE", "PET", "PS", "PUT…
$ Code        <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN
$ Year        <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015
$ Weight      <dbl> 5.7e+07, 5.5e+07, 4.2e+07, 4.0e+07, 3.2e+07, 1.7e+07, 1.6e…
$ MillionTons <dbl> 57, 55, 42, 40, 32, 17, 16, 15, 11

In this final chunk the plotting functionality is tested:

plt.figure()
x = plot_data["Entity"]
y = plot_data["MillionTons"]
plt.bar(x,y)
<BarContainer object of 9 artists>
plt.title("Primary plastic generation by polymer, 2015")
plt.ylabel("Million tons")
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5, 6, 7, 8], <a list of 9 Text major ticklabel objects>)
plt.tight_layout()
plt.show()

References

References

Ritchie, Hannah, and Max Roser. 2018. “Plastic Pollution.” Our World in Data.