5 Minute Introduction¶
Eager to get started? This part of the documentation, begins with some background information about Yeast, then focuses on step-by-step instructions for Recipe development.
A Minimal Recipe¶
The idea of a Recipe
is to define a blueprint of data transformations that are executed
sequentially to process your data. A Recipe executes an ordered list of Steps
that are used to
bake
/transform your data. A minimal Yeast Recipe looks something like this:
from Yeast import Recipe, steps
recipe = Recipe([
# Convert all columns names to Snake Case:
steps.CleanColumnNamesStep('snake'),
# Drop duplicated rows
steps.DropDuplicatesStep(role="training")
])
# Prepare the recipe:
recipe = recipe.prepare(train)
# Process the training and testing set
baked_train = recipe.bake(train, role="training")
baked_test = recipe.bake(test, role="testing")
So what did that code do?
- First we imported the
Recipe
class and thesteps
module. An instance of a recipe will be our blueprint while the steps module contains a collection of well tested methods to simplify the job. - Next we created an instance of
Recipe
that will receive a list of steps that will be executed sequentially in order to process and clean the data. As an exampleCleanColumnNamesStep
will rename all columns to match the snake case convention. - Then we used
prepare(data)
to prepare the recipe before baking (processing). Some steps require preparation before be baked (executed). For example: calculate the mean before imputation. - Finally, we called
bake(data)
to execute all the steps on thedata
(a Pandas DataFrame).
What about the roles?
Roles are strings that can be used to define execution flows. By default all steps and recipes contains
the role all
, meaning that all steps are going to be executed during bake
. However, there are times
where we would like to circumvent some steps while baking on new data, for example, when some variables
are not available on the test set. The correct approach for this is to define a role in the step, in this
example was DropDuplicatesStep
with role="training"
indicating that this step will be executed when bake
has the role all
or training
. All other roles will skip the execution of this step, for example, testing
.
The Steps¶
Steps are a collection of well-tested instruments that you can use without too much study in your data processing flows. Each one has its own signature and specifics that you can discover on the API reference. This library will be evolving to cover more use cases. Some example:
from Yeast.steps import *
recipe = Recipe([
# Convert all columns names to Snake Case:
CleanColumnNamesStep('snake'),
# Create new columns/variables:
MutateStep({"name": StrToLower('name')}),
# Aggregate features based on groups
GroupByStep('city'), SummarizeStep({'total_sales': AggSum('sales')})
])
Selectors¶
Selectors are convenience classes that you can use to select columns based on some conditions, like their data type or name:
from Yeast.selectors import *
recipe = Recipe([
# Shortcut to keep only the numerical columns
SelectColumnsStep(AllNumeric()),
# Shortcut to keep only the numerical columns
SelectColumnsStep([AllString(), AllMatching('ed$')])
])
Filtering Data¶
In almost all of your data analysis you will need a method to filter out observations based on some
conditions. FilterStep
provides you a straightforward approach to define those rules and execute them:
from Yeast.selectors import *
recipe = Recipe([
# Filter out all ages less than 15 and where closed_account are 1
FilterStep('age < 15 & closed_account == 1')
])
Transforming Variables¶
Transformers are convenience classes that you can use to transform your data or create new variables based on those transformations. Some examples:
from Yeast.transformers import *
recipe = Recipe([
MutateStep({
# Replace all occurrences of "--" with "-"
'job_title': StrReplaceAll('--', '-'),
# Convert the name into Title Case
'fullname': StrToTitle(),
# Remove trail/leading whitespaces from the text
'description': StrTrim()
})
])
Aggregations¶
Most data operations are done on groups defined by columns. GroupByStep
takes an existing dataset
and converts it into a Pandas DataFrameGroupBy where aggregation/summarization/mutation operations are
performed "by group" using SummarizeStep
:
recipe = Recipe([
# Group the data by client_id and branch columns
GroupByStep(['client_id', 'branch_id']),
# Let's summarize the data:
SummarizeStep({
# The total sales are the sum of the ticket_total
'total_sales': AggSum('ticket_total'),
# Number of sales to the client in that branch
'number_of_sales': AggCount('ticket_id')
})
])
Merging Two Datasets¶
In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. Yeast supports four types of joins: left, right, inner and full:
recipe = Recipe([
# Left join with the `sales_df` dataset the current recipe
LeftJoinStep(sales_df, by="client_id")
])