Methods for Groups and Aggregations¶
Most data operations are done on groups defined by columns. GroupByStep
takes an existing DataFrame and converts it into a Pandas DataFrameGroupBy
where aggregation/summarization/mutation operations are performed "by group".
A group by operation involves some combination of:
- Splitting the object:
GroupByStep()
- Applying functions:
SummarizeStep()
orMutateStep()
- And combining the results into a DataFrame.
Summarizations / Aggregations¶
In order to create one or more numeric variables summarizing the columns of an existing group created by GroupByStep()
you need to use SummarizeStep()
that will result in one row in the output for each group.
On a simple example:
recipe = Recipe([
# Group the data by client_id and branch columns
GroupByStep(['client_id', 'branch_id']),
# Let's summarize the data:
SummarizeStep({
# The total sales are the sum of the ticket_total
'total_sales': AggSum('ticket_total'),
# Number of sales to the client in that branch
'number_of_sales': AggCount('ticket_id')
})
])
client_id branch_id total_sales number_of_sales
0 2121 3AX 3453.4 12
1 2122 3AX 202 2
1 1034 4BA 25345 42
The most common aggregations are:
- AggMean: Calculate the mean
- AggMedian: Calculate the median
- AggMax: Calculate the maximum
- AggMin: Calculate the minimum
- AggCount: Count occurrences
- AggCountDistinct: Count unique occurrences
AggMean¶
yeast.aggregations.AggMean
(column)Calculate the mean of the grouped numeric column
AggMedian¶
yeast.aggregations.AggMedian
(column)AggMax¶
yeast.aggregations.AggMax
(column)Calculate the max of the grouped numeric column
AggMin¶
yeast.aggregations.AggMin
(column)Calculate the min of the grouped numeric column
AggCount¶
yeast.aggregations.AggCount
(column)Calculate the count/size of the grouped numeric column
AggCountDistinct¶
yeast.aggregations.AggCountDistinct
(column)Calculate the unique count of the grouped numeric column