Steps Reference¶

This module exposes a collection of well-tested steps that you can directly use them on your data processing pipelines.

Columns Operations

Execute operations over columns or predictors:

SelectColumnsStep / SelectStep: Select a group of columns
MutateStep: Create or transform column values
RenameColumnsStep: Rename column names
CastColumnsStep / CastStep: Cast the columns data types
DropColumnsStep: Drop/Remove columns
DropZVColumnsStep: Drop all columns that contain only a single value
CleanColumnNamesStep: Clean all column names
ReplaceNAStep: Replace missing values
OrdinalEncoderStep: Encode discrete features as integer numbers

Row Operations

Execute operations over rows or values:

FilterRowsStep / FilterStep: Filter values based on a expression
SortRowsStep / SortStep: Sort values based on columns

Aggregations

Aggregate or Summarize data:

GroupByStep: Group by rows based on columns
SummarizeStep: Summarize the group by data
DropDuplicateRowsStep / DropDuplicatesStep: Drop duplicate rows

Imputation

Impute missing data:

MeanImputeStep: Impute numeric data using the mean value
MedianImputeStep: Impute numeric data using the median value
ConstantImputeStep: Impute data using a constant value

WorkFlows

Arrange and merge workflows and recipes

LeftJoinStep: Left Join with a DataFrame or Recipe
RightJoinStep: Right Join with a DataFrame or Recipe
InnerJoinStep: Inner Join with a DataFrame or Recipe
FullJoinStep: Full Outer Join with a DataFrame or Recipe

Extensions

Customize Yeast behavior for our project:

CustomStep: Step to add your own functionality

Columns Operations¶

SelectColumnsStep¶

class yeast.steps.SelectColumnsStep(columns, role='all')

Step in charge of keep columns based on their names or selectors

Parameters:

columns: list of string column names, selectors or combinations to keep -- role: String name of the role to control baking flows on new data. Default: all.

Usage:

# ['A', 'B', 'C', 'D']
SelectColumnsStep('B')
# ['B']

# ['A', 'B', 'C', 'D']
SelectColumnsStep(['B', 'C'])
# ['B', 'C']

# ['A', 'B', 'C', 'D']
SelectColumnsStep(AllNumeric())
# ['B']

# ['AA', 'AB', 'C', 'D']
SelectColumnsStep([AllMatching('^A'), 'D'])
# ['AA', 'AB', 'D']

Raises:

YeastValidationError: if any column does not exist or any column name is invalid.

MutateStep¶

class yeast.steps.MutateStep(transformers, role='all')

Create or transform variables mantaining the number of rows appliyng a list of transformers. If more than one transformer is passed to a column, they will be executed in order. New variables overwrite existing variables of the same name.

Parameters:

transformers: Dictionary of transformers using keys as column names and values as transformers. E.g: { column_name: Transformer }. It also support lambda functions. E.g: {var : lambda df: df} and a list of transforers (lambda or Transformer: {var: [tx1, tx2, ...]}
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Create a variable using a lambda function
Recipe([
    MutateStep({
        'total_sales': lambda df: df.sales + df.fee
    })
])

# Create or update variables using Transformers
Recipe([
    MutateStep({
        "name": StrToLower('name'),
        "uid": : StrToUpper('uid')
    })
])

# If the output column is the same as the input column you don't need to
# set the column name. The result will be the same
Recipe([
    MutateStep({
        "name": StrToLower(),  # name will be transformed to lower case
        "uid": : StrToUpper()   # uid will be transformed to upper case
    })
])

# Create or update variables using mutiple Transformers
# You can use Transformers or Lambda functions
Recipe([
    MutateStep({
        "name": [
            StrReplace('-1', ''),
            StrToTitle('name')
        ]
    })
])

# Create or update variables using Group Transformers
Recipe([
    # Create or update a variable
    GroupByStep('client_id'),
    MutateStep({
        "row_number": RowNumber(),
        "lag_sales": NumericLag('sales'),
        "lead_sales": NumericLead('sales')
    })
])

# Create or update a variable using a custom function
def new_variable(df):
    return df.sales / 1e6

Recipe([
    MutateStep({
        'mean_sales': new_variable,
    })
])

Raises:

YeastBakeError: If there was an error executing any transformer
YeastValidationError: xxx

RenameColumnsStep¶

class yeast.steps.RenameColumnsStep(mapping, role='all')

Step in charge of renaming columns based on a mapping dictionary. Columns that don't exist are ignored.

Parameters:

mapping: rename mapping as { 'old_name': 'new_name', ... }
role: String name of the role to control baking flows on new data. Default: all.

Usage:

RenameColumnsStep({
    'old_column_name': 'new_column_name'
})

Raises:

YeastValidationError: if any column old or new is not a string.

CastColumnsStep¶

class yeast.steps.CastColumnsStep(mapping, role='all')

Step in charge of casting columns to a type based on a mapping dictionary.

Available Types:

category
string, str
boolean, bool
integer, int64, int32
float, float64, float32
date, datetime, datetime64

Parameters:

mapping: Casting mapping as { 'column_name': 'type', ... }
role: String name of the role to control baking flows on new data. Default: all.

Usage:

CastColumnsStep({
    'title': 'string',
    'year': 'integer',
    'aired': 'datetime',
})

Raises:

YeastValidationError: if any column or type is not correct.

DropColumnsStep¶

class yeast.steps.DropColumnsStep(columns, role='all')

Step in charge of drop columns based on their names.

Parameters:

columns: list of string column names to drop or a selector.
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# DataFrame columns: ['A', 'B', 'C', 'D']
DropColumnsStep(['B', 'C'])
# DataFrame result columns: ['A', 'D']

Raises:

YeastValidationError: if any column does not exist or any column name is invalid.

DropZVColumnsStep¶

class yeast.steps.DropZVColumnsStep(selector=None, naomit=False, role='all')

Drop all columns that contain only a single value (Zero Variance).

Notes:

The parameter naomit is used to indicate if NA should be considered as a value. If naomit=False then [NA, 'a'] will contain two values and it will not be removed. If naomit=True then [NA, 'a'] will contain only one value and will be filtered because NA was not considered.

Parameters:

selector: string list of column names or a selector to impute.
naomit: True if NA is not considered a value, False otherwise.
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Remove all columns that contains zero variance:
recipe = Recipe([
    DropZVColumnsStep()
])

# Remove all numerical columns that contains zero variance:
recipe = Recipe([
    DropZVColumnsStep(AllNumeric())
])

CleanColumnNamesStep¶

class yeast.steps.CleanColumnNamesStep(case='snake', role='all')

Step in charge of clean all columns names.

Available cases:

snake from column Name to column_name
lower_camel from column Name to columnName
upper_camel from column Name to ColumnName

Parameters:

case: case that will be used on the columns, snake by default.
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# DataFrame columns: ['TheName', 'B', 'the name', 'Other Name']
CleanColumnNamesStep('snake')
# DataFrame result columns: ['the_name', 'b', 'the_name', 'other_name']

Raises:

YeastValidationError: if the case is not available.

ReplaceNAStep¶

class yeast.steps.ReplaceNAStep(mapping, value=0, role='all')

Replace missing values

Parameters:

mapping: replacing mapping as {'column': replacement_value, ...} or string column name.
value: value to replace the NAs if mapping is a string column name. Default: 0
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Replace NA values in one column
ReplaceNAStep('factor', 1.0)

# Replace NA values on several columns
ReplaceNAStep({
    'factor': 1.00,
    'pending': 0.00,
    'category': 'other'
})

Raises:

YeastValidationError: if a column does not exist on the dataframe

OrdinalEncoderStep¶

class yeast.steps.OrdinalEncoderStep(selector, role='all')

Encode categorical/string discrete features as integer numbers (0 to n - 1).

Parameters:

selector: List of columns, column name or selector.
role: String name of the role to control baking flows on new data. Default: all.

Usage:

recipe([
    # Ordinal Encode the gender column
    OrdinalEncoderStep('gender')
])

# Extract the categories and the values on the prepare
recipe = recipe.prepare(train_data)

# Encode on new data without changing the values
test_data = recipe.bake(test_data)

# Example:
# Gender: 'Male', 'Female', 'Male', None, 'Male', 'Female'
# Encoded: 0, 1, 0, NA, 0, 1

Raises:

YeastValidationError: If the column was not found

Row Operations¶

FilterRowsStep¶

class yeast.steps.FilterRowsStep(expression, role='all', **kwargs)

Step in charge of filtering out rows based on boolean conditions.

Operators:

&, |, and, or, (, )
in, not in, ==, !=, >, <, <=, >=
+, ~, not

Notes:

You can refer to column names with spaces or operators by surrounding them in backticks.
You can refer to variables in the environment by prefixing them with an ‘@’ like @a + b.

Parameters:

expression: The query string to evaluate.
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Subset a DataFrame based on a numeric variable
FilterRowsStep('age > 20')

# Subset a DataFrame based on a categorical / string variable
FilterRowsStep('category == "Sci-Fi"')

# Subset a DataFrame comparing two columns
FilterRowsStep('seasons > rating')

# Subset based on Multiple comparisons
FilterRowsStep('(watched == True) and seasons in [2, 7]')

# Subset referencing local variables inside the filter
was_watched = False
FilterRowsStep('watched == @was_watched')

# Subset referencing a column name that contain spaces with backtick:
FilterRowsStep('`episode title` == "Hello"')

Raises:

YeastValidationError: if the expression is an empty string.

SortRowsStep¶

class yeast.steps.SortRowsStep(columns, ascending=True, role='all')

Step in charge of sorting rows based on columns.

Parameters:

columns: list of string column names to sort by
ascending: boolean flag wo sort ascending vs. descending
role: String name of the role to control baking flows on new data. Default: all.

Usage:

SortRowsStep(['B', 'C'])

Raises:

YeastValidationError: if any column does not exist or any column name is invalid.

DropDuplicateRowsStep¶

class yeast.steps.DropDuplicateRowsStep(columns=None, keep='first', role='all')

Step in charge of remove duplicate rows, optionally only considering certain columns.

Parameters:

columns: list of string column names to look for duplicates or a selector
keep (first, last, none): Determines which duplicates (if any) to keep. first : Drop duplicates except for the first occurrence. last : Drop duplicates except for the last occurrence. none : Drop all duplicates.
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Remove duplicates considering all columns, keep the first occurence
DropDuplicatesStep()

# Remove duplicates considering columnc B and C
DropDuplicatesStep(['B', 'C'], keep="none")

# Removing duplicates considering all columns starting with id_
DropDuplicatesStep(AllMatching('^id_'), keep="first")

Raises:

YeastValidationError: if any column does not exist or any column name is invalid.

Aggregations¶

GroupByStep¶

class yeast.steps.GroupByStep(columns, role='all')

Most data operations are done on groups defined by columns. GroupByStep takes an existing DataFrame and converts it into a Pandas DataFrameGroupBy where aggregation/summarization/mutation operations are performed "by group".

A groupby operation involves some combination of: - Splitting the object: GroupByStep() - Applying functions: SummarizeStep() or MutateStep() - And combining the results into a DataFrame.

Parameters:

columns: list of string column names to group by or a selector
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Basic Group By and an Aggregation
recipe = Recipe([
    GroupByStep(['category', 'year']),
    SummarizeStep({
        'average_rating': AggMean('rating'),
        'unique_titles': AggCountDistinct('title')
    })
])

Raises:

YeastValidationError: if a column does not exist on the DataFrame

SummarizeStep¶

class yeast.steps.SummarizeStep(aggregations, role='all')

Create one or more numeric variables summarizing the columns of an existing group created by GroupByStep() resulting in one row in the output for each group. Please refer to the Aggregations documentation to see the complete list of supported aggregations. The most used ones are: AggMean, AggMedian, AggCount, AggMax, AggMin

Parameters:

aggregations: dictionary with the aggregations to perform. The key is the new column name where the value is the specification of the aggregation to perform. For example: {'new_column_name': AggMean('column')}
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Basic Summarization on a Group
recipe = Recipe([
    GroupByStep(['category', 'year']),
    SummarizeStep({
        'average_rating': AggMean('rating'),
        'unique_titles': AggCountDistinct('title')
    })
])

Raises:

YeastValidationError: If there was not a GroupByStep before

Imputation¶

MeanImputeStep¶

class yeast.steps.MeanImputeStep(selector, role='all')

Impute numeric data using the mean

MeanImputeStep estimates the variable mean from the prepare data then replace the NA values on new data sets using the calculated mean values.

Parameters:

selector: string list of column names or a selector to impute
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Impute the age and size columns using the mean from the training set
# Age :  20,  31,  65,  NA,  45,  23,  NA
# Size:   2,   5,   9,   3,   4,  NA,  NA
# to
# Age :  20,  31,  65, 36.8, 45,  23, 36.8 (mean=36.8)
# Size:   2,   5,   9,    3,  4, 4.6,  4.6 (mean=4.6)
MeanImputeStep(['age', 'size'])

# You can also use selectors:
MeanImputeStep(AllNumeric())

Raises:

YeastValidationError: if a column does not exist

MedianImputeStep¶

class yeast.steps.MedianImputeStep(selector, role='all')

Impute numeric data using the median

MedianImputeStep estimates the variable median from the prepare data then replace the NA values on new data sets using the calculated median values.

Parameters:

selector: string list of column names or a selector to impute
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Impute the age and size columns using the mean from the training set
# Age :  20,  31,  65,  NA,  45,  23,  NA
# Size:   2,   5,   9,   3,   4,  NA,  NA
# to
# Age :  20,  31,  65,  31,  45,  23,  31 (median=31)
# Size:   2,   5,   9,   3,   4,   4,   4 (median=4)
MedianImputeStep(['age', 'size'])

# You can also use selectors:
MedianImputeStep(AllNumeric())

Raises:

YeastValidationError: if a column does not exist

ConstantImputeStep¶

class yeast.steps.ConstantImputeStep(selector, value, role='all')

Impute data using a constant value

ConstantImputeStep replaces all NA values in the columns by a constant value. This step does not validate the column data type before impute, so you can generate mixed types on a column.

Parameters:

selector: string list of column names or a selector to impute
value: constant value to replace with
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Numerical:
# Impute the age and size columns using zero as value
# Age :  20,  31,  65,  NA,  45,  23,  NA
# Size:   2,   5,   9,   3,   4,  NA,  NA
# to
# Age :  20,  31,  65,   0,  45,  23,   0
# Size:   2,   5,   9,   3,   4,   0,   0
ConstantImputeStep(['age', 'size'], value=0)

# Categorical:
# Impute the security column with "other"
# security: 'stock', 'bond', 'etf', 'mf', NA
# to
# security: 'stock', 'bond', 'etf', 'mf', 'other'
ConstantImputeStep(['security'], value='other')

# You can also use selectors:
ConstantImputeStep(AllNumeric(), value=0)

Raises:

YeastValidationError: if a column does not exist

Workflows¶

LeftJoinStep¶

class yeast.steps.LeftJoinStep(y, by=None, df=None, role='all')

Left Join two DataFrames together

Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y all combinations of the matches are returned.

Parameters:

y: DataFrame or Recipe to merge with.
by: optional colum name list to merge by. Default: None
df: optional df to be used as input if y is a Recipe
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Left Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
    LeftJoinStep(sales_df, by="client_id")
]).bake(client_df)

# Left join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
    RenameStep({'client_id': 'cid'})
])

client_recipe = Recipe([
    LeftJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])

client_recipe.prepare(client_df).bake(client_df)

Raises:

YeastValidationError: if any of the validations is not correct.

RightJoinStep¶

class yeast.steps.RightJoinStep(y, by=None, df=None, role='all')

Right Join two DataFrames together

Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

Parameters:

y: DataFrame or Recipe to merge with.
by: optional colum name list to merge by. Default: None
df: optional df to be used as input if y is a Recipe
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Right Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
    RightJoinStep(sales_df, by="client_id")
]).bake(client_df)

# Right join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
    RenameStep({'client_id': 'cid'})
])

client_recipe = Recipe([
    RightJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])

client_recipe.prepare(client_df).bake(client_df)

Raises:

YeastValidationError: if any of the validations is not correct.

InnerJoinStep¶

class yeast.steps.InnerJoinStep(y, by=None, df=None, role='all')

Inner Join two DataFrames together

Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

Parameters:

y: DataFrame or Recipe to merge with.
by: optional colum name list to merge by. Default: None
df: optional df to be used as input if y is a Recipe
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Inner Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
    InnerJoinStep(sales_df, by="client_id")
]).bake(client_df)

# Inner join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
    RenameStep({'client_id': 'cid'})
])

client_recipe = Recipe([
    InnerJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])

client_recipe.prepare(client_df).bake(client_df)

Raises:

YeastValidationError: if any of the validations is not correct.

FullJoinStep¶

class yeast.steps.FullJoinStep(y, by=None, df=None, role='all')

Full Join two DataFrames together

Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Parameters:

y: DataFrame or Recipe to merge with.
by: optional colum name list to merge by. Default: None
df: optional df to be used as input if y is a Recipe
role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Full Outer Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
    FullJoinStep(sales_df, by="client_id")
]).bake(client_df)

# Full Outer join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
    RenameStep({'client_id': 'cid'})
])

client_recipe = Recipe([
    FullJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])

client_recipe.prepare(client_df).bake(client_df)

Raises:

YeastValidationError: if any of the validations is not correct.

Extensions¶

CustomStep¶

class yeast.steps.CustomStep(to_prepare=None, to_bake=None, to_validate=None, role='all')

Custom Step was designed to extend all the power of Yeast Pipelines and cover all scenarios where the Yeast steps are not adequate. You might need to define your own operations. You could define your custom transformations, business rules or extend to third-party libraries. The usage is quite straightforward and designed to avoid spending too much time on the implementation. It expects between 1 and 3 arguments, all functions and optional:

to_validate(step, df)
to_prepare(step, df): returns df
to_bake(step, df): returns df

Please notice that to_prepare and to_bake must return a DataFrame to continue the pipeline execution in further steps. CustomStep enables you to structure and document your code and business rules in Steps that could be shared across Recipes.

Parameters:

to_validate: perform validations on the data. Raise YeastValidationError on a problem.
to_prepare: prepare the step before bake, like train or calculate aggregations.
to_bake: execute the bake (processing). This is the core method.
role: String name of the role to control baking flows on new data. Default: all.

Inline Usage:

recipe = Recipe([
    # Custom Business Rules:
    CustomStep(to_bake=lambda step, df: df['sales'].fillna(0))
])

Custom rules:

def my_bake(step, df):
    # Calculate total sales or anything you need:
    df['total_sales'] = df['sales'] + df['fees']
    return df

recipe = Recipe([
    # Custom Business Rules:
    CustomStep(to_bake=my_bake)
])

Custom Checks and Validations:

def my_validate(step, df):
    if 'sales' not in df.columns:
        raise YeastValidationError('sales column not found')
    if 'fees' not in df.columns:
        raise YeastValidationError('fees colum not found')

recipe = Recipe([
    CustomStep(to_validate=my_validate, to_bake=my_bake)
])

Define the Estimation/Preparation procedure:

def my_preparation(step, df):
    step.mean_sales = df['sales'].mean()

def my_bake(step, df):
    df['sales_deviation'] = df['sales'] - step.mean_sales
    return df

recipe = Recipe([
    CustomStep(to_prepare=my_preparation, to_bake=my_bake)
])

Creating a custom step inheriting from CustomStep:

class MyCustomStep(CustomStep):

    def do_validate(self, df):
        # Some validations that could raise YeastValidationError
        pass

    def do_prepare(self, df):
        # Prepare the step if needed
        return df

    def do_bake(self, df):
        # Logic to process the df
        return df

recipe = Recipe([
    MyCustomStep()
])

Raises:

YeastValidationError: if any of the parameters is defined but not callable.