Skip to article frontmatterSkip to article content
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

the data

URL = "http://www-sop.inria.fr/members/Arnaud.Legout/formationPython/Exos/covid-hospit-incid-reg-2021-06-08-19h09.csv"

# we extract only 3 columns
try:
    hos = pd.read_csv(URL)
except UnicodeDecodeError as e:
    print(f"decoding exception: {e}")
decoding exception: 'utf-8' codec can't decode byte 0xe9 in position 35814: invalid continuation byte

There is a decoding error, wich means the csv is not encoded in utf-8. This is unfortunate and the real encoding is not specified on the site. Instead of making trial and fail attempts, we will use chardet to detect the correct encoding.

%pip install chardet

import chardet
import urllib.request

rawdata = urllib.request.urlopen(URL).read()
chardet.detect(rawdata)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: chardet in /usr/lib/python3/dist-packages (5.2.0)
Note: you may need to restart the kernel to use updated packages.
{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

The encoding seems to be ISO-8859-1 with 0.73% confidence. Let’s try it

try:
    hos = pd.read_csv(URL, encoding="ISO-8859-1")
except UnicodeDecodeError as e:
    print(f"decoding exception: {e}")

hos.head(2)
Loading...

It works, but the csv file was not correctly parsed because the separator is a ;

# your code
hos = ...

What is the dtype of each columns?

# your code

Convert

jour and nomReg are object. It will be better to have jour as a DatetimeIndex and nomReg as a category.

# your code

Compute the sum of incid_rea weekly and plot using a bar plot (hint: use resample)

# your code

It works, but the x-axis date representation is messy.

It is an issue specific to the bar plot in pandas. With a regular line plot, the x-axis is automatically optimized.

Let us see the solution together (found on stackoverflow...)

resampled_hos.plot()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 resampled_hos.plot()

NameError: name 'resampled_hos' is not defined

To solve the issue with bar plots, we need to work with matplotlib

from matplotlib.dates import AutoDateLocator, ConciseDateFormatter, AutoDateFormatter

locator = AutoDateLocator()
# ConciseDateFormatter will infer the most compact date representation
formatter = ConciseDateFormatter(locator)

# AutoDaAutoDateFormatter gives another representation
# formatter = AutoDateFormatter(locator)
ax=plt.gca()
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)

resampled_hos = hos.resample('w')['incid_rea'].sum()
# we need to plot directly with matplotlib otherwise dates representation will not be taken into account.
ax.bar(resampled_hos.index, resampled_hos, width=5)

# To uncomment if we use the AutoAutoDateFormatter
# fig = plt.gcf()
# fig.autofmt_xdate()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[10], line 13
     10 ax.xaxis.set_major_locator(locator)
     11 ax.xaxis.set_major_formatter(formatter)
---> 13 resampled_hos = hos.resample('w')['incid_rea'].sum()
     14 # we need to plot directly with matplotlib otherwise dates representation will not be taken into account.
     15 ax.bar(resampled_hos.index, resampled_hos, width=5)

AttributeError: 'ellipsis' object has no attribute 'resample'
<Figure size 640x480 with 1 Axes>

Rolling

Compute now a rolling average on 14 days of incid_rea (hint: use rolling) and plot it using a line plot.

# your code