Join two DataFrames together¶
It’s rare that a data analysis involves only a single DataFrame. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.
df_stocks
ticker name market
0 'APPL' 'Apple' 'NASDAQ'
...
df_prices
ticker date price
0 'APPL' '28-02-2020' 2219
1 'APPL' '30-03-2020' 2203
2 'APPL' '30-04-2020' 3322
...
recipe = Recipe([
LeftJoinStep(df_prices, by="ticker")
])
recipe.bake(df_stocks)
ticker name market date price
0 'APPL' 'Apple' 'NASDAQ' '28-02-2020' 2219
0 'APPL' 'Apple' 'NASDAQ' '30-03-2020' 2203
0 'APPL' 'Apple' 'NASDAQ' '30-04-2020' 3322
Yeast supports four types of joins: left
, right
, inner
and full
:
LeftJoinStep(y)
¶
Return all rows from x
, and all columns from x
and y
.
Rows in x
with no match in y
will have NA values in the new columns.
If there are multiple matches between x
and y
all combinations of the matches are returned.
RightJoinStep(y)
¶
Return all rows from y
, and all columns from x
and y
. Rows in y
with no match in x
will have NA
values in the new columns. If there are multiple matches between x
and y
,
all combinations of the matches are returned.
InnerJoinStep(y)
¶
Return all rows from x
where there are matching values in y
, and all columns from x
and y
.
If there are multiple matches between x
and y
, all combination of the matches are returned.
FullJoinStep(y)
¶
Return all rows and all columns from both x
and y
. Where there are not matching values, returns
NA
for the one missing.
Joining with the result for a Recipe¶
Sometimes when you are working on complex scenarios you want to merge the data from the result of
another recipe that was not executed. All join steps support a Recipe
as y
argument:
# Left join with the DataFrame obtained from the execution of another Recipe
# Recipe to prepare the prices dataset
prices_recipe = Recipe([
SortStep(['ticker', 'date'])
])
# Recipe to prepare the stocks data:
stocks_recipe = Recipe([
LeftJoinStep(prices_recipe, by="ticker", df=df_prices)
])
# `prices_recipe` will be executed using `df_prices` inside the `stocks_recipe` execution
stocks_recipe.prepare(df_stocks).bake(df_stocks)