Methods for Groups and Aggregations

Most data operations are done on groups defined by columns. GroupByStep takes an existing DataFrame and converts it into a Pandas DataFrameGroupBy where aggregation/summarization/mutation operations are performed "by group".

A group by operation involves some combination of:

  • Splitting the object: GroupByStep()
  • Applying functions: SummarizeStep() or MutateStep()
  • And combining the results into a DataFrame.

Summarizations / Aggregations

In order to create one or more numeric variables summarizing the columns of an existing group created by GroupByStep() you need to use SummarizeStep() that will result in one row in the output for each group.

On a simple example:

recipe = Recipe([
    # Group the data by client_id and branch columns
    GroupByStep(['client_id', 'branch_id']),
    # Let's summarize the data:
    SummarizeStep({
      # The total sales are the sum of the ticket_total
      'total_sales': AggSum('ticket_total'),
      # Number of sales to the client in that branch
      'number_of_sales': AggCount('ticket_id')
    })
])
   client_id  branch_id  total_sales  number_of_sales  
0       2121        3AX       3453.4               12
1       2122        3AX          202                2
1       1034        4BA        25345               42

The most common aggregations are:

AggMean

class yeast.aggregations.AggMean(column)

Calculate the mean of the grouped numeric column

AggMedian

class yeast.aggregations.AggMedian(column)

AggMax

class yeast.aggregations.AggMax(column)

Calculate the max of the grouped numeric column

AggMin

class yeast.aggregations.AggMin(column)

Calculate the min of the grouped numeric column

AggCount

class yeast.aggregations.AggCount(column)

Calculate the count/size of the grouped numeric column

AggCountDistinct

class yeast.aggregations.AggCountDistinct(column)

Calculate the unique count of the grouped numeric column

What's next?

Join two DataFrames together