import numpy as np
import pandas as pddataset 1¶
we have the following dataset, these are the results of a survey there is one line per answer, and each person participating has answered 3 questions
what is the overall rating of the trademark - from 1 to 4
what are the most-liked and least-liked sites (among a finite list)
df1 = pd.read_excel('data/groupby-ratings.xlsx')
df1.head()objective¶
we want to compute, for each rating, the number of occurrences of each site among the favorite and least-liked columns
# here the dataset is small
grouped = df1.groupby('rating')
grouped.size()rating
1 4
2 5
3 5
4 4
dtype: int64this is a good opportunity to see that when calling GroupBy.aggregrate, one can pass a dictionary, that says how to deal with each column
# if you need to see the doc, uncomment this line
# grouped.aggregate?# note that this is rather fragile though,
# any variation around that theme is likely to break
# e.g. replacing one value_counts with something like 'sum' issues a RuntimeWarning...
counts = grouped.aggregate({
# this refers to Series.values_count()
'favorite': ['value_counts'],
'least-liked': ['value_counts'],
})
countshow to understand this ? e.g. the first line means that:
among the people who gave a rating of 1:
2 have mentioned oxford as their favorite,
and 1 have mentioned oxford as their least-liked
# here again, the resulting dataframe has a multi-index for both horizontal and vertical dimensions
len(counts.index.levels), len(counts.columns.levels)(2, 2)# what's the performance of 'london' among the people who gave a rating of 3
counts.loc[(3, 'london')]favorite value_counts 1.0
least-liked value_counts 4.0
Name: (3, london), dtype: float64dataset 2¶
df2 = pd.DataFrame(
{
"A": ["foo", "far", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.arange(1, 9),
"D": 10-np.arange(1, 9),
})
df2simple groupings¶
we consider two groupings
by
Aonlyby
AandB
for each of these groupings, compute
the mean and min for C
the max for D
# single criteria
grouped_a = df2.groupby('A')
# dual criteria
grouped_ab = df2.groupby(['A', 'B'])here again, calling aggregate with a dictionary comes in handy
# remember that aggregate can be shortened into just 'agg'
grouped_a.agg({'C': ['mean', 'min'], 'D': 'max'})# same on the dual-grouping
grouped_ab.agg({'C': ['mean', 'min'], 'D': 'max'})inspecting a Groupby object¶
# how many items in each group
grouped_a.size()A
bar 2
far 1
foo 5
dtype: int64# more details of what's in each group; only the indexes show up
grouped_a.groups{'bar': [3, 5], 'far': [1], 'foo': [0, 2, 4, 6, 7]}# here the keys are 2-tuples because we have grouped on 2 criteria
grouped_ab.groups{('bar', 'three'): [3], ('bar', 'two'): [5], ('far', 'one'): [1], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}# one value for the key gives access to one dataframe
grouped_a.get_group('foo')# same here
grouped_ab.get_group(('foo', 'two'))grouping from a Series¶
for advanced users
# we could even group based on finer-grained criteria
# here just on the first letter of the 'A' column
# so the 'foo' and 'far' values are grouped together
grouped_a1c = df2.groupby(df2['A'].str[0])grouped_a1c.groups{'b': [3, 5], 'f': [0, 1, 2, 4, 6, 7]}# explanation: here the parameter to groupby is a Series
df2['A'].str[0]0 f
1 f
2 f
3 b
4 f
5 b
6 f
7 f
Name: A, dtype: objectusing a function to decide on the grouping¶
this means we can group arbitrarily; for example we want to group in two categories, whether the ‘B’ column contains a o or not
df2.B.unique()array(['one', 'two', 'three'], dtype=object)so this should give two groups, one the one hand the rows where B is among one and two, on the other hand the ones where B is three
# here too this is a finer-grained criteria
# and this time we define it from a function
# here we group together the lines where the 'B' cell has a 'o'
# which results in 2 groups, 'one' and 'two' in group1, and 'three' in group2
grouped_bo = df2.groupby(df2['B'].apply(lambda b: 'o' in b))
# the way we have written this, 'True' mean 'o' is in the 'B' column
grouped_bo.size()B
False 2
True 6
dtype: int64using 2 functions to decide on the grouping¶
this can be extended arbitrarily; we could add a grouping in a separate dimension, whether the ‘A’ columns contains 'f’or not
df2.A.unique()array(['foo', 'far', 'bar'], dtype=object)criteria = df2.apply(lambda row: ('f' in row.A, 'o' in row.B), axis=1)grouped_af_bo = df2.groupby(criteria)# not very legible
grouped_af_bo.size()(False, False) 1
(False, True) 1
(True, False) 1
(True, True) 5
dtype: int64# a little nicer
grouped_af_bo = df2.groupby(
df2.apply(
lambda row: ('f in A' if 'f' in row.A else 'nope', 'o in B' if 'o' in row.B else 'nope'), axis=1))# again
grouped_af_bo.size()(f in A, nope) 1
(f in A, o in B) 5
(nope, nope) 1
(nope, o in B) 1
dtype: int64groupby using buckets¶
# we could even group by buckets
# so here the values in C are split in 3 groups which should have the same size
# (but because we have 8 lines, not a multiple of 3, it ends up in 3 + 2 + 3
grouped_auto_buckets = df2.groupby(
pd.qcut(x=df2['C'], q=3, labels=['low', 'mid', 'high']))
grouped_auto_buckets.size()/tmp/ipykernel_2633/2557907554.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
grouped_auto_buckets = df2.groupby(
C
low 3
mid 2
high 3
dtype: int64# or we could define the boundaries ourselves
grouped_buckets = df2.groupby(pd.cut(df2['C'], [0, 3.5, 6.5, 10.]))
grouped_buckets.size()/tmp/ipykernel_2633/2978047166.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
grouped_buckets = df2.groupby(pd.cut(df2['C'], [0, 3.5, 6.5, 10.]))
C
(0.0, 3.5] 3
(3.5, 6.5] 3
(6.5, 10.0] 2
dtype: int64