grouping by period and category

in this TP we work on

data that represents periods and not just one timestamp
checking for overlaps
grouping by period (week, month, year..)
then later on, grouping by period and category
and some simple visualization tools

here’s an example of the outputs we will obtain

imports¶

import pandas as pd
import matplotlib.pyplot as plt

make sure to use matplotlib in interactive mode - aka ipympl

# your code

optional: setup itables, so that we can have scrollable tables

# your code

the data¶

we have a table of events, each with a begin (beg) and end time; in addition each is attached to a country

leases = pd.read_csv("data/leases.csv")
leases.head(10)

adapt the type of each columns¶

surely the columns dtypes need some care

# your code

# check it

leases.dtypes

beg        object
end        object
country    object
dtype: object

raincheck¶

check that the data is well-formed, i.e. the end timestamp happens after beg

# your code

are there any overlapping events ?¶

it turns out there are no event overlap, but write a code that checks that this is true

# your code

timespan¶

What is the timespan covered by the dataset (earliest and latest events, and duration in-between) ?

# your code

aggregated duration¶

so, given that there is no overlap, we can assume this corresponds to “reservations” attached to a unique resource (hence the term lease)
write a code that computes the overall reservation time, as well as the average usage ratio over the overall timespan

# your code

visualization - grouping by period¶

usage by period¶

grouping by periods: by week, by month or by year, display the total usage in that period
(when ambiguous, use the beg column to determine if a lease is in a period or the other)

hint

There are at least 2 options to do this grouping, based on resample() and to_period()
advanced users may wish to write them both and to comment on their respective pros and cons

for now, just get the grouping right

you should produce something like e.g.

we’ll make cosmetic improvements below, and the final results look like this, but let’s not get ahead of ourselves

# your code

improve the title and bottom ticks¶

add a title to your visualisations

also, and particularly relevant in the case of the per-week visu, we don’t get to read the labels on the horizontal axis, because there are too many of them
to improve this, you can use matplotlib’s set_xticks() function; you can either figure out by yourself, or read the few tips below

a few tips

the object that receives the set_xticks() method is an instance of Axes (one X&Y axes system),
which is not the figure itself (a figure may contain several Axes)
ask google or chatgpt to find the way you can spot the Axes instance in your figure
it is not that clear in the docs, but all you need to do is to pass set_xticks a list of indices (integers)
i.e. if you have, say, a hundred bars, you could pass [0, 10, 20, ..., 100] and you will end up with one tick every 10 bars.
there are also means to use smaller fonts, which may help see more relevant info

# let's say as arule of thumb
LEGEND = {
    'W': "week",
    'M': "month",
    'Y': "year",
}

SPACES = {
    'W': 12,   # in the per-week visu, show one tick every 12 - so about one every 3 months
    'M': 3,    # one every 3 months
    'Y': 1,    # on all years
}

# your code

a function to convert to hours¶

you are to write a function that converts a pd.Timedelta into a number of hours

read and understand the test code for the details of what is expected
use it to test your own implementation

# your code

def convert_timedelta_to_hours(timedelta: pd.Timedelta) -> int:
    pass

# test it

# if an hour has started even by one second, it is counted
test_cases = ( 
    # input in seconds, expected result in hours
    (0, 0), 
    (1, 1),     (3599, 1),     (3600, 1), 
    (3601, 2),  (7199, 2),     (7200, 2), 
    # 2 hours + 1s -> 3 hours
    (7201, 3),  
    # 3 hours + 2 minutes -> 4 hours
    (pd.Timedelta(3, 'h') + pd.Timedelta(2, 'm'), 4),
    # 2 days -> 48 hours
    (pd.Timedelta(2, 'D'), 48),
)

def test_convert_timedelta_to_hours():
    for seconds, exp in test_cases:
        # convert into pd.Timedelta if not already one
        if not isinstance(seconds, pd.Timedelta):
            timedelta = pd.Timedelta(seconds=seconds)
        else:
            timedelta = seconds
        # compute and compare
        got = convert_timedelta_to_hours(timedelta)
        print(f"with {timedelta=} we get {got} and expected {exp} -> {got == exp}")

test_convert_timedelta_to_hours()

with timedelta=Timedelta('0 days 00:00:00') we get None and expected 0 -> False
with timedelta=Timedelta('0 days 00:00:01') we get None and expected 1 -> False
with timedelta=Timedelta('0 days 00:59:59') we get None and expected 1 -> False
with timedelta=Timedelta('0 days 01:00:00') we get None and expected 1 -> False
with timedelta=Timedelta('0 days 01:00:01') we get None and expected 2 -> False
with timedelta=Timedelta('0 days 01:59:59') we get None and expected 2 -> False
with timedelta=Timedelta('0 days 02:00:00') we get None and expected 2 -> False
with timedelta=Timedelta('0 days 02:00:01') we get None and expected 3 -> False
with timedelta=Timedelta('0 days 03:02:00') we get None and expected 4 -> False
with timedelta=Timedelta('2 days 00:00:00') we get None and expected 48 -> False

# for debugging; this should return 48

convert_timedelta_to_hours(pd.Timedelta(2, 'D'))

use it to display totals in hours¶

keep the same visu, but display the Y axis in hours
btw, what was the unit in the graphs above ?

# your code

grouping by period and region¶

the following table allows you to map each country into a region

# load it

countries = pd.read_csv("data/countries.csv")
countries.head(3)

a glimpse on regions¶

what’s the most effective way to see how many regions and how many countries per region ?

# your code

attach a region to each lease¶

your mission is to now show the same graphs, but we want to reflect the relative usage of each region, so we want to split each bar into several colors, one per region see expected result below

most likely your first move is to tag all leases with a region column

# your code

visu by period by region¶

you can now produce the target figures, again they look like this

# your code

pandas TPs

les données covid

pandas HOWTOs

groupements et décomptes