Steps Reference

This module exposes a collection of well-tested steps that you can directly use them on your data processing pipelines.

Columns Operations

Execute operations over columns or predictors:

Row Operations

Execute operations over rows or values:

Aggregations

Aggregate or Summarize data:

Imputation

Impute missing data:

WorkFlows

Arrange and merge workflows and recipes

Extensions

Customize Yeast behavior for our project:

Columns Operations

SelectColumnsStep

class yeast.steps.SelectColumnsStep(columns, role='all')

Step in charge of keep columns based on their names or selectors

Parameters:

  • columns: list of string column names, selectors or combinations to keep -- role: String name of the role to control baking flows on new data. Default: all.

Usage:

# ['A', 'B', 'C', 'D']
SelectColumnsStep('B')
# ['B']

# ['A', 'B', 'C', 'D']
SelectColumnsStep(['B', 'C'])
# ['B', 'C']

# ['A', 'B', 'C', 'D']
SelectColumnsStep(AllNumeric())
# ['B']

# ['AA', 'AB', 'C', 'D']
SelectColumnsStep([AllMatching('^A'), 'D'])
# ['AA', 'AB', 'D']

Raises:

  • YeastValidationError: if any column does not exist or any column name is invalid.

MutateStep

class yeast.steps.MutateStep(transformers, role='all')

Create or transform variables mantaining the number of rows appliyng a list of transformers. If more than one transformer is passed to a column, they will be executed in order. New variables overwrite existing variables of the same name.

Parameters:

  • transformers: Dictionary of transformers using keys as column names and values as transformers. E.g: { column_name: Transformer }. It also support lambda functions. E.g: {var : lambda df: df} and a list of transforers (lambda or Transformer: {var: [tx1, tx2, ...]}
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Create a variable using a lambda function
Recipe([
    MutateStep({
        'total_sales': lambda df: df.sales + df.fee
    })
])

# Create or update variables using Transformers
Recipe([
    MutateStep({
        "name": StrToLower('name'),
        "uid": : StrToUpper('uid')
    })
])

# If the output column is the same as the input column you don't need to
# set the column name. The result will be the same
Recipe([
    MutateStep({
        "name": StrToLower(),  # name will be transformed to lower case
        "uid": : StrToUpper()   # uid will be transformed to upper case
    })
])

# Create or update variables using mutiple Transformers
# You can use Transformers or Lambda functions
Recipe([
    MutateStep({
        "name": [
            StrReplace('-1', ''),
            StrToTitle('name')
        ]
    })
])

# Create or update variables using Group Transformers
Recipe([
    # Create or update a variable
    GroupByStep('client_id'),
    MutateStep({
        "row_number": RowNumber(),
        "lag_sales": NumericLag('sales'),
        "lead_sales": NumericLead('sales')
    })
])

# Create or update a variable using a custom function
def new_variable(df):
    return df.sales / 1e6

Recipe([
    MutateStep({
        'mean_sales': new_variable,
    })
])

Raises:

  • YeastBakeError: If there was an error executing any transformer
  • YeastValidationError: xxx

RenameColumnsStep

class yeast.steps.RenameColumnsStep(mapping, role='all')

Step in charge of renaming columns based on a mapping dictionary. Columns that don't exist are ignored.

Parameters:

  • mapping: rename mapping as { 'old_name': 'new_name', ... }
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

RenameColumnsStep({
    'old_column_name': 'new_column_name'
})

Raises:

  • YeastValidationError: if any column old or new is not a string.

CastColumnsStep

class yeast.steps.CastColumnsStep(mapping, role='all')

Step in charge of casting columns to a type based on a mapping dictionary.

Available Types:

  • category
  • string, str
  • boolean, bool
  • integer, int64, int32
  • float, float64, float32
  • date, datetime, datetime64

Parameters:

  • mapping: Casting mapping as { 'column_name': 'type', ... }
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

CastColumnsStep({
    'title': 'string',
    'year': 'integer',
    'aired': 'datetime',
})

Raises:

  • YeastValidationError: if any column or type is not correct.

DropColumnsStep

class yeast.steps.DropColumnsStep(columns, role='all')

Step in charge of drop columns based on their names.

Parameters:

  • columns: list of string column names to drop or a selector.
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# DataFrame columns: ['A', 'B', 'C', 'D']
DropColumnsStep(['B', 'C'])
# DataFrame result columns: ['A', 'D']

Raises:

  • YeastValidationError: if any column does not exist or any column name is invalid.

DropZVColumnsStep

class yeast.steps.DropZVColumnsStep(selector=None, naomit=False, role='all')

Drop all columns that contain only a single value (Zero Variance).

Notes:

The parameter naomit is used to indicate if NA should be considered as a value. If naomit=False then [NA, 'a'] will contain two values and it will not be removed. If naomit=True then [NA, 'a'] will contain only one value and will be filtered because NA was not considered.

Parameters:

  • selector: string list of column names or a selector to impute.
  • naomit: True if NA is not considered a value, False otherwise.
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Remove all columns that contains zero variance:
recipe = Recipe([
    DropZVColumnsStep()
])

# Remove all numerical columns that contains zero variance:
recipe = Recipe([
    DropZVColumnsStep(AllNumeric())
])

CleanColumnNamesStep

class yeast.steps.CleanColumnNamesStep(case='snake', role='all')

Step in charge of clean all columns names.

Available cases:

  • snake from column Name to column_name
  • lower_camel from column Name to columnName
  • upper_camel from column Name to ColumnName

Parameters:

  • case: case that will be used on the columns, snake by default.
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# DataFrame columns: ['TheName', 'B', 'the name', 'Other Name']
CleanColumnNamesStep('snake')
# DataFrame result columns: ['the_name', 'b', 'the_name', 'other_name']

Raises:

  • YeastValidationError: if the case is not available.

ReplaceNAStep

class yeast.steps.ReplaceNAStep(mapping, value=0, role='all')

Replace missing values

Parameters:

  • mapping: replacing mapping as {'column': replacement_value, ...} or string column name.
  • value: value to replace the NAs if mapping is a string column name. Default: 0
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Replace NA values in one column
ReplaceNAStep('factor', 1.0)

# Replace NA values on several columns
ReplaceNAStep({
    'factor': 1.00,
    'pending': 0.00,
    'category': 'other'
})

Raises:

  • YeastValidationError: if a column does not exist on the dataframe

OrdinalEncoderStep

class yeast.steps.OrdinalEncoderStep(selector, role='all')

Encode categorical/string discrete features as integer numbers (0 to n - 1).

Parameters:

  • selector: List of columns, column name or selector.
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

recipe([
    # Ordinal Encode the gender column
    OrdinalEncoderStep('gender')
])

# Extract the categories and the values on the prepare
recipe = recipe.prepare(train_data)

# Encode on new data without changing the values
test_data = recipe.bake(test_data)

# Example:
# Gender: 'Male', 'Female', 'Male', None, 'Male', 'Female'
# Encoded: 0, 1, 0, NA, 0, 1

Raises:

  • YeastValidationError: If the column was not found

Row Operations

FilterRowsStep

class yeast.steps.FilterRowsStep(expression, role='all', **kwargs)

Step in charge of filtering out rows based on boolean conditions.

Operators:

  • &, |, and, or, (, )
  • in, not in, ==, !=, >, <, <=, >=
  • +, ~, not

Notes:

  • You can refer to column names with spaces or operators by surrounding them in backticks.
  • You can refer to variables in the environment by prefixing them with an ‘@’ like @a + b.

Parameters:

  • expression: The query string to evaluate.
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Subset a DataFrame based on a numeric variable
FilterRowsStep('age > 20')

# Subset a DataFrame based on a categorical / string variable
FilterRowsStep('category == "Sci-Fi"')

# Subset a DataFrame comparing two columns
FilterRowsStep('seasons > rating')

# Subset based on Multiple comparisons
FilterRowsStep('(watched == True) and seasons in [2, 7]')

# Subset referencing local variables inside the filter
was_watched = False
FilterRowsStep('watched == @was_watched')

# Subset referencing a column name that contain spaces with backtick:
FilterRowsStep('`episode title` == "Hello"')

Raises:

  • YeastValidationError: if the expression is an empty string.

SortRowsStep

class yeast.steps.SortRowsStep(columns, ascending=True, role='all')

Step in charge of sorting rows based on columns.

Parameters:

  • columns: list of string column names to sort by
  • ascending: boolean flag wo sort ascending vs. descending
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

SortRowsStep(['B', 'C'])

Raises:

  • YeastValidationError: if any column does not exist or any column name is invalid.

DropDuplicateRowsStep

class yeast.steps.DropDuplicateRowsStep(columns=None, keep='first', role='all')

Step in charge of remove duplicate rows, optionally only considering certain columns.

Parameters:

  • columns: list of string column names to look for duplicates or a selector
  • keep (first, last, none): Determines which duplicates (if any) to keep. first : Drop duplicates except for the first occurrence. last : Drop duplicates except for the last occurrence. none : Drop all duplicates.
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Remove duplicates considering all columns, keep the first occurence
DropDuplicatesStep()

# Remove duplicates considering columnc B and C
DropDuplicatesStep(['B', 'C'], keep="none")

# Removing duplicates considering all columns starting with id_
DropDuplicatesStep(AllMatching('^id_'), keep="first")

Raises:

  • YeastValidationError: if any column does not exist or any column name is invalid.

Aggregations

GroupByStep

class yeast.steps.GroupByStep(columns, role='all')

Most data operations are done on groups defined by columns. GroupByStep takes an existing DataFrame and converts it into a Pandas DataFrameGroupBy where aggregation/summarization/mutation operations are performed "by group".

A groupby operation involves some combination of: - Splitting the object: GroupByStep() - Applying functions: SummarizeStep() or MutateStep() - And combining the results into a DataFrame.

Parameters:

  • columns: list of string column names to group by or a selector
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Basic Group By and an Aggregation
recipe = Recipe([
    GroupByStep(['category', 'year']),
    SummarizeStep({
        'average_rating': AggMean('rating'),
        'unique_titles': AggCountDistinct('title')
    })
])

Raises:

  • YeastValidationError: if a column does not exist on the DataFrame

SummarizeStep

class yeast.steps.SummarizeStep(aggregations, role='all')

Create one or more numeric variables summarizing the columns of an existing group created by GroupByStep() resulting in one row in the output for each group. Please refer to the Aggregations documentation to see the complete list of supported aggregations. The most used ones are: AggMean, AggMedian, AggCount, AggMax, AggMin

Parameters:

  • aggregations: dictionary with the aggregations to perform. The key is the new column name where the value is the specification of the aggregation to perform. For example: {'new_column_name': AggMean('column')}
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Basic Summarization on a Group
recipe = Recipe([
    GroupByStep(['category', 'year']),
    SummarizeStep({
        'average_rating': AggMean('rating'),
        'unique_titles': AggCountDistinct('title')
    })
])

Raises:

  • YeastValidationError: If there was not a GroupByStep before

Imputation

MeanImputeStep

class yeast.steps.MeanImputeStep(selector, role='all')

Impute numeric data using the mean

MeanImputeStep estimates the variable mean from the prepare data then replace the NA values on new data sets using the calculated mean values.

Parameters:

  • selector: string list of column names or a selector to impute
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Impute the age and size columns using the mean from the training set
# Age :  20,  31,  65,  NA,  45,  23,  NA
# Size:   2,   5,   9,   3,   4,  NA,  NA
# to
# Age :  20,  31,  65, 36.8, 45,  23, 36.8 (mean=36.8)
# Size:   2,   5,   9,    3,  4, 4.6,  4.6 (mean=4.6)
MeanImputeStep(['age', 'size'])

# You can also use selectors:
MeanImputeStep(AllNumeric())

Raises:

  • YeastValidationError: if a column does not exist

MedianImputeStep

class yeast.steps.MedianImputeStep(selector, role='all')

Impute numeric data using the median

MedianImputeStep estimates the variable median from the prepare data then replace the NA values on new data sets using the calculated median values.

Parameters:

  • selector: string list of column names or a selector to impute
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Impute the age and size columns using the mean from the training set
# Age :  20,  31,  65,  NA,  45,  23,  NA
# Size:   2,   5,   9,   3,   4,  NA,  NA
# to
# Age :  20,  31,  65,  31,  45,  23,  31 (median=31)
# Size:   2,   5,   9,   3,   4,   4,   4 (median=4)
MedianImputeStep(['age', 'size'])

# You can also use selectors:
MedianImputeStep(AllNumeric())

Raises:

  • YeastValidationError: if a column does not exist

ConstantImputeStep

class yeast.steps.ConstantImputeStep(selector, value, role='all')

Impute data using a constant value

ConstantImputeStep replaces all NA values in the columns by a constant value. This step does not validate the column data type before impute, so you can generate mixed types on a column.

Parameters:

  • selector: string list of column names or a selector to impute
  • value: constant value to replace with
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Numerical:
# Impute the age and size columns using zero as value
# Age :  20,  31,  65,  NA,  45,  23,  NA
# Size:   2,   5,   9,   3,   4,  NA,  NA
# to
# Age :  20,  31,  65,   0,  45,  23,   0
# Size:   2,   5,   9,   3,   4,   0,   0
ConstantImputeStep(['age', 'size'], value=0)

# Categorical:
# Impute the security column with "other"
# security: 'stock', 'bond', 'etf', 'mf', NA
# to
# security: 'stock', 'bond', 'etf', 'mf', 'other'
ConstantImputeStep(['security'], value='other')

# You can also use selectors:
ConstantImputeStep(AllNumeric(), value=0)

Raises:

  • YeastValidationError: if a column does not exist

Workflows

LeftJoinStep

class yeast.steps.LeftJoinStep(y, by=None, df=None, role='all')

Left Join two DataFrames together

Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y all combinations of the matches are returned.

Parameters:

  • y: DataFrame or Recipe to merge with.
  • by: optional colum name list to merge by. Default: None
  • df: optional df to be used as input if y is a Recipe
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Left Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
    LeftJoinStep(sales_df, by="client_id")
]).bake(client_df)

# Left join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
    RenameStep({'client_id': 'cid'})
])

client_recipe = Recipe([
    LeftJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])

client_recipe.prepare(client_df).bake(client_df)

Raises:

  • YeastValidationError: if any of the validations is not correct.

RightJoinStep

class yeast.steps.RightJoinStep(y, by=None, df=None, role='all')

Right Join two DataFrames together

Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

Parameters:

  • y: DataFrame or Recipe to merge with.
  • by: optional colum name list to merge by. Default: None
  • df: optional df to be used as input if y is a Recipe
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Right Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
    RightJoinStep(sales_df, by="client_id")
]).bake(client_df)

# Right join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
    RenameStep({'client_id': 'cid'})
])

client_recipe = Recipe([
    RightJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])

client_recipe.prepare(client_df).bake(client_df)

Raises:

  • YeastValidationError: if any of the validations is not correct.

InnerJoinStep

class yeast.steps.InnerJoinStep(y, by=None, df=None, role='all')

Inner Join two DataFrames together

Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

Parameters:

  • y: DataFrame or Recipe to merge with.
  • by: optional colum name list to merge by. Default: None
  • df: optional df to be used as input if y is a Recipe
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Inner Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
    InnerJoinStep(sales_df, by="client_id")
]).bake(client_df)

# Inner join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
    RenameStep({'client_id': 'cid'})
])

client_recipe = Recipe([
    InnerJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])

client_recipe.prepare(client_df).bake(client_df)

Raises:

  • YeastValidationError: if any of the validations is not correct.

FullJoinStep

class yeast.steps.FullJoinStep(y, by=None, df=None, role='all')

Full Join two DataFrames together

Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Parameters:

  • y: DataFrame or Recipe to merge with.
  • by: optional colum name list to merge by. Default: None
  • df: optional df to be used as input if y is a Recipe
  • role: String name of the role to control baking flows on new data. Default: all.

Usage:

# Full Outer Join with another DataFrame
# sales_df and client_df are DataFrames, by argument is optional
Recipe([
    FullJoinStep(sales_df, by="client_id")
]).bake(client_df)

# Full Outer join with the DataFrame obtained from the execution of a Recipe
# sales_recipe will be executed using sales_df inside the client_recipe execution
sales_recipe = Recipe([
    RenameStep({'client_id': 'cid'})
])

client_recipe = Recipe([
    FullJoinStep(sales_recipe, by=["client_id", "region_id"], df=sales_df)
])

client_recipe.prepare(client_df).bake(client_df)

Raises:

  • YeastValidationError: if any of the validations is not correct.

Extensions

CustomStep

class yeast.steps.CustomStep(to_prepare=None, to_bake=None, to_validate=None, role='all')

Custom Step was designed to extend all the power of Yeast Pipelines and cover all scenarios where the Yeast steps are not adequate. You might need to define your own operations. You could define your custom transformations, business rules or extend to third-party libraries. The usage is quite straightforward and designed to avoid spending too much time on the implementation. It expects between 1 and 3 arguments, all functions and optional:

  • to_validate(step, df)
  • to_prepare(step, df): returns df
  • to_bake(step, df): returns df

Please notice that to_prepare and to_bake must return a DataFrame to continue the pipeline execution in further steps. CustomStep enables you to structure and document your code and business rules in Steps that could be shared across Recipes.

Parameters:

  • to_validate: perform validations on the data. Raise YeastValidationError on a problem.
  • to_prepare: prepare the step before bake, like train or calculate aggregations.
  • to_bake: execute the bake (processing). This is the core method.
  • role: String name of the role to control baking flows on new data. Default: all.

Inline Usage:

recipe = Recipe([
    # Custom Business Rules:
    CustomStep(to_bake=lambda step, df: df['sales'].fillna(0))
])

Custom rules:

def my_bake(step, df):
    # Calculate total sales or anything you need:
    df['total_sales'] = df['sales'] + df['fees']
    return df

recipe = Recipe([
    # Custom Business Rules:
    CustomStep(to_bake=my_bake)
])

Custom Checks and Validations:

def my_validate(step, df):
    if 'sales' not in df.columns:
        raise YeastValidationError('sales column not found')
    if 'fees' not in df.columns:
        raise YeastValidationError('fees colum not found')

recipe = Recipe([
    CustomStep(to_validate=my_validate, to_bake=my_bake)
])

Define the Estimation/Preparation procedure:

def my_preparation(step, df):
    step.mean_sales = df['sales'].mean()

def my_bake(step, df):
    df['sales_deviation'] = df['sales'] - step.mean_sales
    return df

recipe = Recipe([
    CustomStep(to_prepare=my_preparation, to_bake=my_bake)
])

Creating a custom step inheriting from CustomStep:

class MyCustomStep(CustomStep):

    def do_validate(self, df):
        # Some validations that could raise YeastValidationError
        pass

    def do_prepare(self, df):
        # Prepare the step if needed
        return df

    def do_bake(self, df):
        # Logic to process the df
        return df
recipe = Recipe([
    MyCustomStep()
])

Raises:

  • YeastValidationError: if any of the parameters is defined but not callable.