Authoring with Quarto: qmd python template workbook

programming
Published

August 4, 2022

Objective

This article serves as a template for Python only in a quarto workbook. It uses a .qmd file and a conda installation that has been configured previously (see article from the 2022-04-09). Differently from the Python + R template, this workbook doesn’t use reticulate to activate the virtual environment. To run the code in the chunks indicate the virtual environment for the workbook (in visual studio code use Ctrl+Shift+P and then Python: Select Interpreter and put the path - here is /home/joao/JR-IA/renv/python/condaenvs/renv-python). Important to note not to install IPkernel as it is often recommended by the pop ups in vsc.

Introduction

In this article we’re exploring another open data source, this time the World Bank. The topic selected for testing the data collection and visualization is the share of renewable energy for selected countries in the last 25 years. The article also confirms the compatibility of the python programming language in this website which is build with the R programming language.

Setup

%conda activate /home/joao/JR-IA/renv/python/condaenvs/renv-python
%conda env list

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.


Note: you may need to restart the kernel to use updated packages.
# conda environments:
#
base                     /home/joao/.local/share/r-miniconda
r-reticulate             /home/joao/.local/share/r-miniconda/envs/r-reticulate
                      *  /home/joao/JR-IA/renv/python/condaenvs/renv-python
                         /home/joao/datascience/python/cookbookpy/renv/python/condaenvs/renv-python
                         /home/joao/datascience/python/datacamp_exercises/exercises/renv/python/condaenvs/renv-python
                         /home/joao/jr/renv/python/condaenvs/renv-python
                         /home/joao/miniconda3
                         /home/joao/miniconda3/envs/mlenv
                         /home/joao/miniconda3/envs/r-reticulate
                         /home/joao/miniconda3/envs/renv

Note: you may need to restart the kernel to use updated packages.

Install wbdata

The World Bank website suggests several methods to obtain the data from its databases. After some reading we’ve opted here for the python module wbdata. The possibilities to install the module are described in wbdata documentation. In our website the python configuration has been done with conda which is not available. We’ve then opted for a direct download from github and saved the library folder directly in our project folder. This has been enough to make it available. The chunk below confirms that python sees the module.

We can then import it in a python chunk directly as below:

import wbdata

Load modules

Besides loading wbdata we also need to load the common python libraries for data wrangling and plotting:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import json

Now we’re ready for data loading and exploration.

Renewable electricity output

Search country by code

As in similar articles we’ve opted to call the API as least as possible and to store the data objects locally. This avoids loading the entity servers and ultimately to see one’s access blocked. Our starting point has been the selection of countries for our analysis. The wbdata API provides a method to get countries called get_country. In the next chunk we show how we have queried once the data base for all the countries and save it as a json object.

renewables_country_list=wbdata.get_country()
with open('data/renewables_country_list.json', 'w') as json_file:
  json.dump(renewables_country_list, json_file)

Now we can load anytime we like from our local folder:

with open('data/renewables_country_list.json', 'r') as json_file:
  country_list = json.load(json_file)

The returned object is a list of dictionaries. We can check the entry for Switzerland with:

country_list[47]
{'id': 'CHE',
 'iso2Code': 'CH',
 'name': 'Switzerland',
 'region': {'id': 'ECS', 'iso2code': 'Z7', 'value': 'Europe & Central Asia'},
 'adminregion': {'id': '', 'iso2code': '', 'value': ''},
 'incomeLevel': {'id': 'HIC', 'iso2code': 'XD', 'value': 'High income'},
 'lendingType': {'id': 'LNX', 'iso2code': 'XX', 'value': 'Not classified'},
 'capitalCity': 'Bern',
 'longitude': '7.44821',
 'latitude': '46.948'}

Later in our query for renewable energy we will only need the country id. It can be obtained from the list by sub setting as follows:

country_list[47]['id']
'CHE'

To extract all countries codes we can run a for loop as follows:

country_codes = []
for i in range(0,len(country_list)):
  country_id = country_list[i]['id']
  country_codes.append(country_id)

print(country_codes[0:9])
['ABW', 'AFE', 'AFG', 'AFR', 'AFW', 'AGO', 'ALB', 'AND', 'ARB']

This was a possible approach but it requires knowing the countries codes upfront. In case they’re not know there’s another country search method that accepts various arguments the search_countries method shown below.

Search country by name

We’re using here the keyword argument to provide country names. We start by creating a list with the country names that we feed in the loop. Using a similar sub setting approach with the index [0] and then the [‘id’] we can extract the code and assign it to a list.

target_countries = ['United States', 'Portugal', 'Switzerland']

country_codes = []
for country in target_countries:
  country_entry = wbdata.search_countries(country)
  country_code = country_entry[0]['id']
  country_codes.append(country_code)

print(country_codes)
['USA', 'PRT', 'CHE']

This country_codes list is the one that we will use later in our query for renewable energy figures.

Search indicators

Now that our countries are selected we’re going to identify an indicator of interest on which to make the analysis. As before we do this only once and store the result to a json object.

indicators_list = wbdata.search_indicators('renewable energy')
with open('data/renewables_indicators_list.json', 'w') as json_file:
  json.dump(indicators_list, json_file)

Now for the purposes of analysis and investigation we reload the indicators object as often as we need:

with open('data/renewables_indicators_list.json', 'r') as json_file:
  indicators_list = json.load(json_file)

For the sake of testing we’re exploring here a different way of looking into the json file. Less intuitive but easier than the for loop is to use the pandas method json_normalize:

indicators = pd.json_normalize(indicators_list)
indicators[['id', 'name']]
id name
0 2.1_SHARE.TOTAL.RE.IN.TFEC Renewable energy consumption(% in TFEC)
1 3.1_RE.CONSUMPTION Renewable energy consumption (TJ)
2 4.1.2_REN.ELECTRICITY.OUTPUT Renewable energy electricity output (GWh)
3 EG.FEC.RNEW.ZS Renewable energy consumption (% of total final...

Using this information we prepare the variable for the final query. In this case it is sufficient to do it manually.

indicators = {'EG.FEC.RNEW.ZS':'renewable_energy_perc_total'}

Get data

There are various methods to get the data from the world bank database. Again there is a json approach and one directly with pandas. As we have only one indicator and three countries we’re opting for a simpler approach with pandas. This should be sufficient as we don’t expect deeply nested data which is one of the main benefits of the json format.

We query the database once and store the result in a local csv file.

renewables=wbdata.get_dataframe(indicators, country = country_codes, convert_date = True)
renewables.to_csv("data/renewables.csv")

We load it for our analysis:

renewables=pd.read_csv("data/renewables.csv")
print(renewables.head())
       country        date  renewable_energy_perc_total
0  Switzerland  2021-01-01                          NaN
1  Switzerland  2020-01-01                          NaN
2  Switzerland  2019-01-01                        24.76
3  Switzerland  2018-01-01                        24.20
4  Switzerland  2017-01-01                        24.99

Although not strictly needed for seaborn we’re removing the missing values, which we consider to be a good general practice:

renewables.dropna(axis=0,subset=['renewable_energy_perc_total'], inplace=True)
print(renewables.head(5))
       country        date  renewable_energy_perc_total
2  Switzerland  2019-01-01                        24.76
3  Switzerland  2018-01-01                        24.20
4  Switzerland  2017-01-01                        24.99
5  Switzerland  2016-01-01                        24.10
6  Switzerland  2015-01-01                        23.94

And a final step is to convert the date to the datetime format which will open several possibilities and ensure a much better looking plot:

renewables['date'].dtype
renewables['date'] = pd.to_datetime(renewables['date'], infer_datetime_format=True)
renewables['date'].dtype
dtype('<M8[ns]')

Plot

We’ve opted to use seaborn. In particular when we want to use the hue this is a good approach. We assign the plot to a variable to allow to call directly some matplotlib methods for axis configuration:

sns.set_theme(style="whitegrid")
g=sns.relplot(x='date', y='renewable_energy_perc_total', hue='country', data=renewables, kind='line')
g.set(xlabel = 'Time', ylabel = 'Renewable energy consumption \n(% of total final energy consumption)')
<seaborn.axisgrid.FacetGrid at 0x7fb00e703510>

g.fig.suptitle('Renewable energy in selected countries')
plt.tight_layout()
plt.show()
<Figure size 672x480 with 0 Axes>