Pandas | merge_ordered method
Start your free 7-days trial now!
Pandas merge_ordered(~)
method joins two DataFrames with the option to perform filling or interpolation.
Parameters
1. left
link | DataFrame
The left DataFrame to perform the join on.
2. right
link | DataFrame
The right DataFrame to perform the join on.
3. on
| string
or list
The label of the columns to join on.
The on
parameter is there only for convenience. If the columns to join on have different labels, then you must specify left_on
and right_on
.
4. left_on
link | string
or array-like
The label of the column in left
to perform join on.
5. right_on
link | string
or array-like
The label of the column in right
to perform join on.
6. left_by
link | string
or list<string>
The label(s) of the columns in left
to "expand" on. See example below for clarification.
7. right_by
| string
or list<string>
The label(s) of the columns in right
to "expand" on. See example below for clarification.
8. fill_method
link | string
or None
| optional
How to fill NaN
in the merged DataFrame:
Value | Description |
---|---|
| Use the previous non- |
| Leave |
By default, fill_method=None
.
9. suffixes
link | tuple
of (string, string)
| optional
The suffix names to append to the duplicate column labels in the resulting DataFrame. You can also pass a single None
instead of a string in suffixes
to indicate that the left or right column label should be left as is. By default, suffixes=("_x", "_y")
.
10. how
link | string
| optional
The type of join to perform:
Value | Description |
---|---|
| All rows from the source DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a left-join. |
| All rows from the right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a right-join. |
| All rows from the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of an outer-join. |
| All rows that have matching values in both the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent to inner-join. |
By default, how="outer"
.
Here's the classic Venn Diagram illustrating the differences:
Return Value
A merged DataFrame
.
Examples
Basic usage
Consider a shop with some data about their products and customers:
[df_products] | [df_customers] product bought_by | name ageA computer bob | 0 alex 10B smartphone alex | 1 bob 20C headphones david | 2 cathy 30
To perform an outer join on two DataFrames based on columns bought_by
and name
:
pd.merge_ordered(df_products, df_customers, left_on="bought_by", right_on="name", how="outer")
product bought_by name age0 smartphone alex alex 10.01 computer bob bob 20.02 NaN NaN cathy 30.03 headphones david NaN NaN
Specifying fill_method
Unlike merge(~)
, merge_ordered(~)
allows to fill missing values that arise due to the join.
Again, consider the same DataFrames as above:
[df_products] | [df_customers] product bought_by | name ageA computer bob | 0 alex 10B smartphone alex | 1 bob 20C headphones david | 2 cathy 30
By default, fill_method=None
, which means that the resulting NaN
are left as is:
pd.merge_ordered(df_products, df_customers, left_on="bought_by", right_on="name", how="outer")
product bought_by name age0 smartphone alex alex 10.01 computer bob bob 20.02 NaN NaN cathy 30.03 headphones david NaN NaN
To fill those NaN
, set fill_method="ffill"
like so:
pd.merge_ordered(df_products, df_customers, left_on="bought_by", right_on="name", how="outer", fill_method="ffill")
product bought_by name age0 smartphone alex alex 101 computer bob bob 202 computer bob cathy 303 headphones david cathy 30
Notice how all the NaN
are filled with the previous non-NaN
value.
Note that this example is just to illustrate how the filling works - we will never perform such fillings. A practical use case of this filling logic is reserved mainly for Time-series when it makes more sense to fill with the previously recorded datetime.
Specifying left_by
Let us use the same example as above:
[df_products] | [df_customers] product bought_by | name ageA computer bob | 0 alex 10B smartphone alex | 1 bob 20C headphones david | 2 cathy 30
By default, left_by=None
, which means that resulting DataFrame is constructed using a traditional join:
pd.merge_ordered(df_products, df_customers, left_on="bought_by", right_on="name", how="outer")
product bought_by name age0 smartphone alex alex 10.01 computer bob bob 20.02 NaN NaN cathy 30.03 headphones david NaN NaN
Setting left_by="product"
will repeat each product item for every row in the joined key (bought_by
):
pd.merge_ordered(df_products, df_customers, left_on="bought_by", right_on="name", how="outer", left_by="product")
product bought_by age name0 computer NaN 10.0 alex1 computer bob 20.0 bob2 computer NaN 30.0 cathy3 smartphone alex 10.0 alex4 smartphone NaN 20.0 bob5 smartphone NaN 30.0 cathy6 headphones NaN 10.0 alex7 headphones NaN 20.0 bob8 headphones NaN 30.0 cathy9 headphones david NaN NaN
Specifying suffixes
Consider the following DataFrames:
[df_products] | [df_customers] product age bought_by | name ageA computer 7 bob | 0 alex 10B smartphone 8 alex | 1 bob 20C headphones 9 david | 2 cathy 30
Notice how the two DataFrames have an overlapping column label - age
.
By default, suffixes=("_x","_y")
, which means that if the merged DataFrame has overlapping column labels, then the suffix "_x"
will be appended to the overlapping column label of the left DataFrame, and "_y"
to the right DataFrame:
pd.merge_ordered(df_products, df_customers, left_on="bought_by", right_on="name", how="outer")
product age_x bought_by name age_y0 smartphone 8.0 alex alex 101 computer 7.0 bob bob 20...
We can specify our own suffixes like so:
pd.merge_ordered(df_products, df_customers, left_on="bought_by", right_on="name", how="outer", suffixes=["_A","_B"])
product age_A bought_by name age_B0 smartphone 8.0 alex alex 101 computer 7.0 bob bob 20...
You can also pass a None
instead of a string to indicate that the left or right column label should be left as is:
pd.merge_ordered(df_products, df_customers, left_on="bought_by", right_on="name", how="outer", suffixes=[None,"_B"])
product age bought_by name age_B0 smartphone 8.0 alex alex 101 computer 7.0 bob bob 20...