import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsthe data¶
URL = "http://www-sop.inria.fr/members/Arnaud.Legout/formationPython/Exos/covid-hospit-incid-reg-2021-06-08-19h09.csv"
# we extract only 3 columns
try:
hos = pd.read_csv(URL)
except UnicodeDecodeError as e:
print(f"decoding exception: {e}")decoding exception: 'utf-8' codec can't decode byte 0xe9 in position 35814: invalid continuation byte
There is a decoding error, wich means the csv is not encoded in utf-8. This is unfortunate and the real encoding is not specified on the site. Instead of making trial and fail attempts, we will use chardet to detect the correct encoding.
%pip install chardet
import chardet
import urllib.request
rawdata = urllib.request.urlopen(URL).read()
chardet.detect(rawdata)Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: chardet in /usr/lib/python3/dist-packages (5.2.0)
Note: you may need to restart the kernel to use updated packages.
{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}The encoding seems to be ISO-8859-1 with 0.73% confidence. Let’s try it
try:
hos = pd.read_csv(URL, encoding="ISO-8859-1")
except UnicodeDecodeError as e:
print(f"decoding exception: {e}")
hos.head(2)It works, but the csv file was not correctly parsed because the separator is a ;¶
# your code
hos = ...What is the dtype of each columns?¶
# your codeConvert¶
jour and nomReg are object. It will be better to have jour as a DatetimeIndex and nomReg as a category.
# your codeCompute the sum of incid_rea weekly and plot using a bar plot (hint: use resample)¶
# your codeIt works, but the x-axis date representation is messy.¶
It is an issue specific to the bar plot in pandas. With a regular line plot, the x-axis is automatically optimized.
Let us see the solution together (found on stackoverflow...)
resampled_hos.plot()---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 resampled_hos.plot()
NameError: name 'resampled_hos' is not definedTo solve the issue with bar plots, we need to work with matplotlib
from matplotlib.dates import AutoDateLocator, ConciseDateFormatter, AutoDateFormatter
locator = AutoDateLocator()
# ConciseDateFormatter will infer the most compact date representation
formatter = ConciseDateFormatter(locator)
# AutoDaAutoDateFormatter gives another representation
# formatter = AutoDateFormatter(locator)
ax=plt.gca()
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)
resampled_hos = hos.resample('w')['incid_rea'].sum()
# we need to plot directly with matplotlib otherwise dates representation will not be taken into account.
ax.bar(resampled_hos.index, resampled_hos, width=5)
# To uncomment if we use the AutoAutoDateFormatter
# fig = plt.gcf()
# fig.autofmt_xdate()---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[10], line 13
10 ax.xaxis.set_major_locator(locator)
11 ax.xaxis.set_major_formatter(formatter)
---> 13 resampled_hos = hos.resample('w')['incid_rea'].sum()
14 # we need to plot directly with matplotlib otherwise dates representation will not be taken into account.
15 ax.bar(resampled_hos.index, resampled_hos, width=5)
AttributeError: 'ellipsis' object has no attribute 'resample'
Rolling¶
Compute now a rolling average on 14 days of incid_rea (hint: use rolling) and plot it using a line plot.
# your code