Steps Reference¶
This module exposes a collection of well-tested steps that you can directly use them on your data processing pipelines.
Columns Operations
Execute operations over columns or predictors:
- SelectColumnsStep / SelectStep: Select a group of columns
- MutateStep: Create or transform column values
- RenameColumnsStep: Rename column names
- CastColumnsStep / CastStep: Cast the columns data types
- DropColumnsStep: Drop/Remove columns
- DropZVColumnsStep: Drop all columns that contain only a single value
- CleanColumnNamesStep: Clean all column names
- ReplaceNAStep: Replace missing values
- OrdinalEncoderStep: Encode discrete features as integer numbers
Row Operations
Execute operations over rows or values:
- FilterRowsStep / FilterStep: Filter values based on a expression
- SortRowsStep / SortStep: Sort values based on columns
Aggregations
Aggregate or Summarize data:
- GroupByStep: Group by rows based on columns
- SummarizeStep: Summarize the group by data
- DropDuplicateRowsStep / DropDuplicatesStep: Drop duplicate rows
Imputation
Impute missing data:
- MeanImputeStep: Impute numeric data using the mean value
- MedianImputeStep: Impute numeric data using the median value
- ConstantImputeStep: Impute data using a constant value
WorkFlows
Arrange and merge workflows and recipes
- LeftJoinStep: Left Join with a DataFrame or Recipe
- RightJoinStep: Right Join with a DataFrame or Recipe
- InnerJoinStep: Inner Join with a DataFrame or Recipe
- FullJoinStep: Full Outer Join with a DataFrame or Recipe
Extensions
Customize Yeast behavior for our project:
- CustomStep: Step to add your own functionality
Columns Operations¶
SelectColumnsStep¶
yeast.steps.SelectColumnsStep
(columns, role='all')Step in charge of keep columns based on their names or selectors
Parameters:
columns
: list of string column names, selectors or combinations to keep --role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# ['A', 'B', 'C', 'D']
SelectColumnsStep('B')
# ['B']
# ['A', 'B', 'C', 'D']
SelectColumnsStep(['B', 'C'])
# ['B', 'C']
# ['A', 'B', 'C', 'D']
SelectColumnsStep(AllNumeric())
# ['B']
# ['AA', 'AB', 'C', 'D']
SelectColumnsStep([AllMatching('^A'), 'D'])
# ['AA', 'AB', 'D']
Raises:
YeastValidationError
: if any column does not exist or any column name is invalid.
MutateStep¶
yeast.steps.MutateStep
(transformers, role='all')Create or transform variables mantaining the number of rows appliyng a list of transformers. If more than one transformer is passed to a column, they will be executed in order. New variables overwrite existing variables of the same name.
Parameters:
transformers
: Dictionary of transformers using keys as column names and values as transformers. E.g:{ column_name: Transformer }
. It also support lambda functions. E.g:{var : lambda df: df}
and a list of transforers (lambda or Transformer:{var: [tx1, tx2, ...]}
role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Create a variable using a lambda function
Recipe([
MutateStep({
'total_sales': lambda df: df.sales + df.fee
})
])
# Create or update variables using Transformers
Recipe([
MutateStep({
"name": StrToLower('name'),
"uid": : StrToUpper('uid')
})
])
# If the output column is the same as the input column you don't need to
# set the column name. The result will be the same
Recipe([
MutateStep({
"name": StrToLower(), # name will be transformed to lower case
"uid": : StrToUpper() # uid will be transformed to upper case
})
])
# Create or update variables using mutiple Transformers
# You can use Transformers or Lambda functions
Recipe([
MutateStep({
"name": [
StrReplace('-1', ''),
StrToTitle('name')
]
})
])
# Create or update variables using Group Transformers
Recipe([
# Create or update a variable
GroupByStep('client_id'),
MutateStep({
"row_number": RowNumber(),
"lag_sales": NumericLag('sales'),
"lead_sales": NumericLead('sales')
})
])
# Create or update a variable using a custom function
def new_variable(df):
return df.sales / 1e6
Recipe([
MutateStep({
'mean_sales': new_variable,
})
])
Raises:
YeastBakeError
: If there was an error executing any transformerYeastValidationError
: xxx
RenameColumnsStep¶
yeast.steps.RenameColumnsStep
(mapping, role='all')Step in charge of renaming columns based on a mapping dictionary. Columns that don't exist are ignored.
Parameters:
mapping
: rename mapping as { 'old_name': 'new_name', ... }role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
RenameColumnsStep({
'old_column_name': 'new_column_name'
})
Raises:
YeastValidationError
: if any column old or new is not a string.
CastColumnsStep¶
yeast.steps.CastColumnsStep
(mapping, role='all')Step in charge of casting columns to a type based on a mapping dictionary.
Available Types:
category
string
,str
boolean
,bool
integer
,int64
,int32
float
,float64
,float32
date
,datetime
,datetime64
Parameters:
mapping
: Casting mapping as { 'column_name': 'type', ... }role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
CastColumnsStep({
'title': 'string',
'year': 'integer',
'aired': 'datetime',
})
Raises:
YeastValidationError
: if any column or type is not correct.
DropColumnsStep¶
yeast.steps.DropColumnsStep
(columns, role='all')Step in charge of drop columns based on their names.
Parameters:
columns
: list of string column names to drop or a selector.role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# DataFrame columns: ['A', 'B', 'C', 'D']
DropColumnsStep(['B', 'C'])
# DataFrame result columns: ['A', 'D']
Raises:
YeastValidationError
: if any column does not exist or any column name is invalid.
DropZVColumnsStep¶
yeast.steps.DropZVColumnsStep
(selector=None, naomit=False, role='all')Drop all columns that contain only a single value (Zero Variance).
Notes:
The parameter naomit
is used to indicate if NA
should be considered as a value.
If naomit=False
then [NA, 'a']
will contain two values and it will not be removed.
If naomit=True
then [NA, 'a']
will contain only one value and will be filtered because NA
was not considered.
Parameters:
selector
: string list of column names or a selector to impute.naomit
: True if NA is not considered a value, False otherwise.role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Remove all columns that contains zero variance:
recipe = Recipe([
DropZVColumnsStep()
])
# Remove all numerical columns that contains zero variance:
recipe = Recipe([
DropZVColumnsStep(AllNumeric())
])
CleanColumnNamesStep¶
yeast.steps.CleanColumnNamesStep
(case='snake', role='all')Step in charge of clean all columns names.
Available cases:
snake
fromcolumn Name
tocolumn_name
lower_camel
fromcolumn Name
tocolumnName
upper_camel
fromcolumn Name
toColumnName
Parameters:
case
: case that will be used on the columns,snake
by default.role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# DataFrame columns: ['TheName', 'B', 'the name', 'Other Name']
CleanColumnNamesStep('snake')
# DataFrame result columns: ['the_name', 'b', 'the_name', 'other_name']
Raises:
YeastValidationError
: if the case is not available.
ReplaceNAStep¶
yeast.steps.ReplaceNAStep
(mapping, value=0, role='all')Replace missing values
Parameters:
mapping
: replacing mapping as{'column': replacement_value, ...}
or string column name.value
: value to replace the NAs if mapping is a string column name. Default:0
role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Replace NA values in one column
ReplaceNAStep('factor', 1.0)
# Replace NA values on several columns
ReplaceNAStep({
'factor': 1.00,
'pending': 0.00,
'category': 'other'
})
Raises:
YeastValidationError
: if a column does not exist on the dataframe
OrdinalEncoderStep¶
yeast.steps.OrdinalEncoderStep
(selector, role='all')Encode categorical/string discrete features as integer numbers (0 to n - 1).
Parameters:
selector
: List of columns, column name or selector.role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
recipe([
# Ordinal Encode the gender column
OrdinalEncoderStep('gender')
])
# Extract the categories and the values on the prepare
recipe = recipe.prepare(train_data)
# Encode on new data without changing the values
test_data = recipe.bake(test_data)
# Example:
# Gender: 'Male', 'Female', 'Male', None, 'Male', 'Female'
# Encoded: 0, 1, 0, NA, 0, 1
Raises:
YeastValidationError
: If the column was not found
Row Operations¶
FilterRowsStep¶
yeast.steps.FilterRowsStep
(expression, role='all', **kwargs)Step in charge of filtering out rows based on boolean conditions.
Operators:
&
,|
,and
,or
,(
,)
in
,not in
,==
,!=
,>
,<
,<=
,>=
+
,~
,not
Notes:
- You can refer to column names with spaces or operators by surrounding them in backticks.
- You can refer to variables in the environment by prefixing them with an ‘@’ like
@a + b
.
Parameters:
expression
: The query string to evaluate.role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Subset a DataFrame based on a numeric variable
FilterRowsStep('age > 20')
# Subset a DataFrame based on a categorical / string variable
FilterRowsStep('category == "Sci-Fi"')
# Subset a DataFrame comparing two columns
FilterRowsStep('seasons > rating')
# Subset based on Multiple comparisons
FilterRowsStep('(watched == True) and seasons in [2, 7]')
# Subset referencing local variables inside the filter
was_watched = False
FilterRowsStep('watched == @was_watched')
# Subset referencing a column name that contain spaces with backtick:
FilterRowsStep('`episode title` == "Hello"')
Raises:
YeastValidationError
: if the expression is an empty string.
SortRowsStep¶
yeast.steps.SortRowsStep
(columns, ascending=True, role='all')Step in charge of sorting rows based on columns.
Parameters:
columns
: list of string column names to sort byascending
: boolean flag wo sort ascending vs. descendingrole
: String name of the role to control baking flows on new data. Default:all
.
Usage:
SortRowsStep(['B', 'C'])
Raises:
YeastValidationError
: if any column does not exist or any column name is invalid.
DropDuplicateRowsStep¶
yeast.steps.DropDuplicateRowsStep
(columns=None, keep='first', role='all')Step in charge of remove duplicate rows, optionally only considering certain columns.
Parameters:
columns
: list of string column names to look for duplicates or a selectorkeep
(first, last, none
): Determines which duplicates (if any) to keep.first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.none
: Drop all duplicates.role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Remove duplicates considering all columns, keep the first occurence
DropDuplicatesStep()
# Remove duplicates considering columnc B and C
DropDuplicatesStep(['B', 'C'], keep="none")
# Removing duplicates considering all columns starting with id_
DropDuplicatesStep(AllMatching('^id_'), keep="first")
Raises:
YeastValidationError
: if any column does not exist or any column name is invalid.
Aggregations¶
GroupByStep¶
yeast.steps.GroupByStep
(columns, role='all')Most data operations are done on groups defined by columns. GroupByStep takes an existing DataFrame and converts it into a Pandas DataFrameGroupBy where aggregation/summarization/mutation operations are performed "by group".
A groupby operation involves some combination of:
- Splitting the object: GroupByStep()
- Applying functions: SummarizeStep()
or MutateStep()
- And combining the results into a DataFrame.
Parameters:
columns
: list of string column names to group by or a selectorrole
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Basic Group By and an Aggregation
recipe = Recipe([
GroupByStep(['category', 'year']),
SummarizeStep({
'average_rating': AggMean('rating'),
'unique_titles': AggCountDistinct('title')
})
])
Raises:
YeastValidationError
: if a column does not exist on the DataFrame
SummarizeStep¶
yeast.steps.SummarizeStep
(aggregations, role='all')Create one or more numeric variables summarizing the columns of an existing group created by
GroupByStep() resulting in one row in the output for each group. Please refer to the
Aggregations documentation to see the complete list of supported aggregations.
The most used ones are: AggMean
, AggMedian
, AggCount
, AggMax
, AggMin
Parameters:
aggregations
: dictionary with the aggregations to perform. The key is the new column name where the value is the specification of the aggregation to perform. For example:{'new_column_name': AggMean('column')}
role
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Basic Summarization on a Group
recipe = Recipe([
GroupByStep(['category', 'year']),
SummarizeStep({
'average_rating': AggMean('rating'),
'unique_titles': AggCountDistinct('title')
})
])
Raises:
YeastValidationError
: If there was not a GroupByStep before
Imputation¶
MeanImputeStep¶
yeast.steps.MeanImputeStep
(selector, role='all')Impute numeric data using the mean
MeanImputeStep estimates the variable mean from the prepare data then replace the NA values on new data sets using the calculated mean values.
Parameters:
selector
: string list of column names or a selector to imputerole
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Impute the age and size columns using the mean from the training set
# Age : 20, 31, 65, NA, 45, 23, NA
# Size: 2, 5, 9, 3, 4, NA, NA
# to
# Age : 20, 31, 65, 36.8, 45, 23, 36.8 (mean=36.8)
# Size: 2, 5, 9, 3, 4, 4.6, 4.6 (mean=4.6)
MeanImputeStep(['age', 'size'])
# You can also use selectors:
MeanImputeStep(AllNumeric())
Raises:
YeastValidationError
: if a column does not exist
MedianImputeStep¶
yeast.steps.MedianImputeStep
(selector, role='all')Impute numeric data using the median
MedianImputeStep estimates the variable median from the prepare data then replace the NA values on new data sets using the calculated median values.
Parameters:
selector
: string list of column names or a selector to imputerole
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Impute the age and size columns using the mean from the training set
# Age : 20, 31, 65, NA, 45, 23, NA
# Size: 2, 5, 9, 3, 4, NA, NA
# to
# Age : 20, 31, 65, 31, 45, 23, 31 (median=31)
# Size: 2, 5, 9, 3, 4, 4, 4 (median=4)
MedianImputeStep(['age', 'size'])
# You can also use selectors:
MedianImputeStep(AllNumeric())
Raises:
YeastValidationError
: if a column does not exist
ConstantImputeStep¶
yeast.steps.ConstantImputeStep
(selector, value, role='all')Impute data using a constant value
ConstantImputeStep replaces all NA values in the columns by a constant value. This step does not validate the column data type before impute, so you can generate mixed types on a column.
Parameters:
selector
: string list of column names or a selector to imputevalue
: constant value to replace withrole
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Numerical:
# Impute the age and size columns using zero as value
# Age : 20, 31, 65, NA, 45, 23, NA
# Size: 2, 5, 9, 3, 4, NA, NA
# to
# Age : 20, 31, 65, 0, 45, 23, 0
# Size: 2, 5, 9, 3, 4, 0, 0
ConstantImputeStep(['age', 'size'], value=0)
# Categorical:
# Impute the security column with "other"
# security: 'stock', 'bond', 'etf', 'mf', NA
# to
# security: 'stock', 'bond', 'etf', 'mf', 'other'
ConstantImputeStep(['security'], value='other')
# You can also use selectors:
ConstantImputeStep(AllNumeric(), value=0)
Raises:
YeastValidationError
: if a column does not exist
Workflows¶
LeftJoinStep¶
yeast.steps.LeftJoinStep
(y, by=None, df=None, role='all')Left Join two DataFrames together
Return all rows from x
, and all columns from x
and y
.
Rows in x
with no match in y
will have NA values in the new columns.
If there are multiple matches between x
and y
all combinations of the matches
are returned.
Parameters:
y
: DataFrame or Recipe to merge with.by
: optional colum name list to merge by. Default:None
df
: optional df to be used as input ify
is a Reciperole
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Left Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
LeftJoinStep(sales_df, by="client_id")
]).bake(client_df)
# Left join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
RenameStep({'client_id': 'cid'})
])
client_recipe = Recipe([
LeftJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])
client_recipe.prepare(client_df).bake(client_df)
Raises:
YeastValidationError
: if any of the validations is not correct.
RightJoinStep¶
yeast.steps.RightJoinStep
(y, by=None, df=None, role='all')Right Join two DataFrames together
Return all rows from y
, and all columns from x
and y
. Rows in y
with no match in x
will have NA
values in the new columns. If there are multiple matches between x
and y
,
all combinations of the matches are returned.
Parameters:
y
: DataFrame or Recipe to merge with.by
: optional colum name list to merge by. Default:None
df
: optional df to be used as input ify
is a Reciperole
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Right Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
RightJoinStep(sales_df, by="client_id")
]).bake(client_df)
# Right join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
RenameStep({'client_id': 'cid'})
])
client_recipe = Recipe([
RightJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])
client_recipe.prepare(client_df).bake(client_df)
Raises:
YeastValidationError
: if any of the validations is not correct.
InnerJoinStep¶
yeast.steps.InnerJoinStep
(y, by=None, df=None, role='all')Inner Join two DataFrames together
Return all rows from x
where there are matching values in y
, and all columns from x
and
y
. If there are multiple matches between x
and y
, all combination of the matches are
returned.
Parameters:
y
: DataFrame or Recipe to merge with.by
: optional colum name list to merge by. Default:None
df
: optional df to be used as input ify
is a Reciperole
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Inner Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
InnerJoinStep(sales_df, by="client_id")
]).bake(client_df)
# Inner join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
RenameStep({'client_id': 'cid'})
])
client_recipe = Recipe([
InnerJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])
client_recipe.prepare(client_df).bake(client_df)
Raises:
YeastValidationError
: if any of the validations is not correct.
FullJoinStep¶
yeast.steps.FullJoinStep
(y, by=None, df=None, role='all')Full Join two DataFrames together
Return all rows and all columns from both x
and y
.
Where there are not matching values, returns NA
for the one missing.
Parameters:
y
: DataFrame or Recipe to merge with.by
: optional colum name list to merge by. Default:None
df
: optional df to be used as input ify
is a Reciperole
: String name of the role to control baking flows on new data. Default:all
.
Usage:
# Full Outer Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
FullJoinStep(sales_df, by="client_id")
]).bake(client_df)
# Full Outer join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
RenameStep({'client_id': 'cid'})
])
client_recipe = Recipe([
FullJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])
client_recipe.prepare(client_df).bake(client_df)
Raises:
YeastValidationError
: if any of the validations is not correct.
Extensions¶
CustomStep¶
yeast.steps.CustomStep
(to_prepare=None, to_bake=None, to_validate=None, role='all')Custom Step was designed to extend all the power of Yeast Pipelines and cover all scenarios where the Yeast steps are not adequate. You might need to define your own operations. You could define your custom transformations, business rules or extend to third-party libraries. The usage is quite straightforward and designed to avoid spending too much time on the implementation. It expects between 1 and 3 arguments, all functions and optional:
to_validate(step, df)
to_prepare(step, df)
: returnsdf
to_bake(step, df)
: returnsdf
Please notice that to_prepare
and to_bake
must return a DataFrame
to continue the pipeline
execution in further steps. CustomStep
enables you to structure and document your code and
business rules in Steps that could be shared across Recipes.
Parameters:
to_validate
: perform validations on the data. Raise YeastValidationError on a problem.to_prepare
: prepare the step before bake, like train or calculate aggregations.to_bake
: execute the bake (processing). This is the core method.role
: String name of the role to control baking flows on new data. Default:all
.
Inline Usage:
recipe = Recipe([
# Custom Business Rules:
CustomStep(to_bake=lambda step, df: df['sales'].fillna(0))
])
Custom rules:
def my_bake(step, df):
# Calculate total sales or anything you need:
df['total_sales'] = df['sales'] + df['fees']
return df
recipe = Recipe([
# Custom Business Rules:
CustomStep(to_bake=my_bake)
])
Custom Checks and Validations:
def my_validate(step, df):
if 'sales' not in df.columns:
raise YeastValidationError('sales column not found')
if 'fees' not in df.columns:
raise YeastValidationError('fees colum not found')
recipe = Recipe([
CustomStep(to_validate=my_validate, to_bake=my_bake)
])
Define the Estimation/Preparation procedure:
def my_preparation(step, df):
step.mean_sales = df['sales'].mean()
def my_bake(step, df):
df['sales_deviation'] = df['sales'] - step.mean_sales
return df
recipe = Recipe([
CustomStep(to_prepare=my_preparation, to_bake=my_bake)
])
Creating a custom step inheriting from CustomStep:
class MyCustomStep(CustomStep):
def do_validate(self, df):
# Some validations that could raise YeastValidationError
pass
def do_prepare(self, df):
# Prepare the step if needed
return df
def do_bake(self, df):
# Logic to process the df
return df
recipe = Recipe([
MyCustomStep()
])
Raises:
YeastValidationError
: if any of the parameters is defined but not callable.