Methods for Creating and Transforming Variables¶
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions
of existing columns or modify values on rows. This is the job of MutateStep()
:
This steps uses a dictionary to list columns (keys) and transformers (values) that should be applied while you are allowed to refer to columns that you’ve just created:
The most basic signature is the following:
recipe = Recipe([
MutateStep({
# Column "fullname" from: "JONATHAN ARCHER" to "Jonathan Archer"
'fullname': StrToTitle()
})
])
While you can also pass a column name to transform:
# Column "fullname" from: "JONATHAN ARCHER" to "Jonathan Archer"
MutateStep({'fullname': StrToTitle('fullname')})
Moreover, you can extend to complex chains of transformations:
# Let's transform/create some variables:
MutateStep({
# Transform the "name" column
'name': [
# "JONATHAN ARCHER" to "Jonathan Archer"
StrToTitle('name'),
# " Data " to "Data"
StrTrim('name'),
# "Philippa Georgiou" to "Philippa Georgiou"
StrReplace(' ', ' ', 'name'),
# "Jean--Luc PICARD" to "Jean-Luc Picard"
StrReplaceAll('--', '-', 'name')
],
'rank': StrToTitle('rank')
})
Currently the transformers are categorized as:
- String Transformers: String Transformers provide a cohesive set of transformers designed to make working with strings as easy as possible.
- Rank Transformers: Returns the sample ranks of the values in a column.
General Transformers
- MapValues: Replace specified values with new values.
String Transformers
String Transformers provide a cohesive set of transformers designed to make working with strings as easy as possible:
- StrToUpper: Convert to UPPER CASE
- StrToLower: Convert to lower case
- StrToSentence: Convert to Sentence case
- StrToTitle: Convert to Title Case
- StrTrim: Remove whitespaces
- StrReplace: Replace first occurrence of pattern
- StrReplaceAll: Replace all occurrences of pattern
- StrPad: Pad a string
- StrSlice: Extract and replace substrings
- StrRemove: Remove first matched pattern
- StrRemoveAll: Remove all matched patterns
- StrContains: Test if pattern is contained on a string column
Rank Transformers
Returns the sample ranks of the values in a column:
- Rank / RankTransformer: Return the sample ranks of the values
- RowNumber: Return the row number
- RankFirst: Increasing rank values at each index
- RankMin: Return the minimum value
- RankMax: Return the maximum value
- RankDense: Like
RankMin
but with no gaps between ranks - RankMean: Return the mean/average value
- RankPercent: A number between 0 and 1 computed by rescaling
RankMin
to[0, 1]
Date Transformers
Returns components of a Date or DateTime column:
- DateYear: Get the year
- DateQuarter: Get the quarter
- DateMonth: Get the month
- DateWeek: Get the week
- DateDay: Get the day
- DateDayOfWeek: Get the day of the week where Monday=0 and Sunday=6.
- DateDayOfYear: Get the day of the year.
- DateHour: Get the hour
- DateMinute: Get the minute
- DateSecond: Get the second
General Transformers¶
MapValues¶
yeast.transformers.MapValues
(mapping, column=None)Replace specified values with new values.
# Map String/Categorical values
# Replace old_value
with new_value
MapValues({'old_value': 'new_value', ...})
# Map Numerical values
# Replace 90
with NaN
MapValues({90: np.NaN})
Parameters:
mapping
: Specify different replacement values for different existing values. For example:{'old': 'new'}
replace the valueold
withnew
.
String Transformers¶
StrToUpper¶
yeast.transformers.StrToUpper
(column=None)Convert case of a string to Upper case: ("Yeast" to "YEAST")
StrToLower¶
yeast.transformers.StrToLower
(column=None)Convert case of a string to Lower case: ("Yeast" to "yeast")
StrToSentence¶
yeast.transformers.StrToSentence
(column=None)Converts first character to uppercase and remaining to lowercase: ("yeast help" to "Yeast help")
StrToTitle¶
yeast.transformers.StrToTitle
(column=None)Converts first character of each word to uppercase and remaining to lowercase: ("yeast help" to "Yeast Help")
StrTrim¶
yeast.transformers.StrTrim
(column=None)Convert removing whitespaces from start and end of string: (" Yeast " to "Yeast")
StrReplace¶
yeast.transformers.StrReplace
(pattern, replacement, column=None)Replace first ocurrence of matched patterns in a string: 'Y' to 'X' ("YYY" to "XYY")
Parameters:
pattern
: Pattern or string to look for.replacement
: A string of replacements.
StrReplaceAll¶
yeast.transformers.StrReplaceAll
(pattern, replacement, column=None)Replace all ocurrences of matched patterns in a string: 'Y' to 'X' ("YYY" to "XXX")
Parameters:
pattern
: Pattern or string to look for.replacement
: A string of replacements.
StrPad¶
yeast.transformers.StrPad
(width, side='left', pad=' ', column=None)Pad a string: 'Y' to 4 chars, left and '0' ("Y" to "000Y")
Parameters:
width
: Minimum width of padded strings.side
: Side on which padding character is added (left, right or both).pad
: Single padding character (default is a space).
StrSlice¶
yeast.transformers.StrSlice
(start, stop, column=None)Extract and replace substrings from a string:
StrSlice("Yeast Help", start=6, end=10) # "Help"
Parameters:
start
: integer position of the first characterstop
: integer position of the last character
StrRemove¶
yeast.transformers.StrRemove
(pattern, column=None)Remove first matched pattern in a string
StrRemove("_temp") # "Yeast_temp" to "Yeast"
Parameters:
pattern
: Pattern or string to look for.
StrRemoveAll¶
yeast.transformers.StrRemoveAll
(pattern, column=None)Remove all matched patterns in a string
StrRemoveAll("_temp") # "Yeast_temp_temp" to "Yeast"
Parameters:
pattern
: Pattern or string to look for.
StrContains¶
yeast.transformers.StrContains
(pattern, column=None, case=True, regex=True)Test if pattern or regex is contained within a string column. Return a boolean variable ( True and False ).
MutateStep({
'feature': StrContains("_temp", column="text", case=True, regex=True)
}),
# You can convert to numerical (0 and 1) with:
CastStep({'feature': 'integer'})
Parameters:
pattern
: Pattern or string to look for.case
: If True, case sensitive.regex
: If True, assumes the pat is a regular expression. If False, treats the pat as a literal string.
Rank Transformers¶
yeast.transformers.RankTransformer
(column=None, ties_method='first', ascending=True, percentage=False)Returns the sample ranks of the values in the column. Ties (i.e., equal values) and missing values can be handled in several ways.
Ties Methods:
The first
method results in a permutation with increasing values at each index set of ties.
average
, replaces them by their mean, and max
and min
replaces them by their maximum and
minimum respectively. dense
is like min
, but with no gaps between ranks.
Parameters:
column
: name used to rank valuesties_method
: string specifying how ties are treated: {'average', 'min', 'max', 'first', 'dense'}ascending
: boolean with the order of the row numbers
RowNumber¶
yeast.transformers.RowNumber
(column=None, ascending=True)Creates/transforms a variable containg the row number.
Parameters:
ascending
: boolean with the order of the row numberscolumn
: used to sort/arrange and rank values
RankFirst¶
yeast.transformers.RankFirst
(column=None, ascending=True)Increasing rank values at each index set of ties
Parameters:
ascending
: boolean with the order of the row numberscolumn
: used to sort/arrange and rank values
RankMin¶
yeast.transformers.RankMin
(column=None, ascending=True)Replace by the minimum value
Parameters:
ascending
: boolean with the order of the row numberscolumn
: used to sort/arrange and rank values
RankMax¶
yeast.transformers.RankMax
(column=None, ascending=True)Replace by the maximum value
Parameters:
ascending
: boolean with the order of the row numberscolumn
: used to sort/arrange and rank values
RankDense¶
yeast.transformers.RankDense
(column=None, ascending=True)Replace by the minimum value like RankMin
, but with no gaps between ranks.
Parameters:
ascending
: boolean with the order of the row numberscolumn
: used to sort/arrange and rank values
RankMean¶
yeast.transformers.RankMean
(column=None, ascending=True)Replace by the mean/average value
Parameters:
ascending
: boolean with the order of the row numberscolumn
: used to sort/arrange and rank values
RankPercent¶
yeast.transformers.RankPercent
(column=None, ascending=True)A number between 0 and 1 computed by rescaling RankMin
to [0, 1]
Parameters:
ascending
: boolean with the order of the row numberscolumn
: used to sort/arrange and rank values
Date Transformers¶
DateYear¶
yeast.transformers.DateYear
(column=None)Extract the year of a Date column
DateQuarter¶
yeast.transformers.DateQuarter
(column=None)Extract the quarter of a Date column
DateMonth¶
yeast.transformers.DateMonth
(column=None)Extract the month of a Date column
DateWeek¶
yeast.transformers.DateWeek
(column=None)Extract the week of a Date column
DateDay¶
yeast.transformers.DateDay
(column=None)Extract the day of a Date column
DateDayOfWeek¶
yeast.transformers.DateDayOfWeek
(column=None)Extract the day of week of a Date column. The day of the week is Monday=0 and Sunday=6
DateDayOfYear¶
yeast.transformers.DateDayOfYear
(column=None)Extract the day of year of a Date column.
DateHour¶
yeast.transformers.DateHour
(column=None)Extract the hour of a Date column.
DateMinute¶
yeast.transformers.DateMinute
(column=None)Extract the minute of a Date column.
DateSecond¶
yeast.transformers.DateSecond
(column=None)Extract the seconds of a Date column.