Pandas DataFrame | merge method
Start your free 7-days trial now!
Pandas DataFrame.merge(~)
method merges the source DataFrame with another DataFrame or a named Series.
Parameters
1. right
| DataFrame
or named Series
The DataFrame or Series to merge the source DataFrame with.
2. how
| string
| optional
The type of merge to perform:
Value | Description |
---|---|
All rows from the source DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a left-join. | |
All rows from the right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a right-join. | |
All rows from the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of an outer-join. | |
All rows that have matching values in both the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent to inner-join. |
By default, how="inner"
.
Here's the classic Venn Diagram illustrating the differences:
3. on
| string
or list
| optional
The label of the columns or index-levels to perform the join on. This only works if both left and right DataFrames have this same label. By default, on=None
, which means that an inner join will be performed.
The on
parameter is there only for convenience. If the columns to join on have different labels, then you must use left_on
, right_on
, left_index
and right_index
instead.
4. left_on
| string
or array-like
| optional
The label(s) of the column or index level of the left DataFrame to perform the join on.
5. right_on
| string
or array-like
| optional
The label(s) of the column or index level of the right DataFrame to perform the join on.
6. left_index
| boolean
| optional
Whether or not to perform the join on the index of the left DataFrame. By default, left_index=False
.
7. right_index
| boolean
| optional
Whether or not to perform the join on the index of the right DataFrame. By default, right_index=False
.
Many textbooks and documentation use the words merge keys or join keys to denote the columns by which a join is performed.
8. sort
link | boolean
| optional
Whether or not to sort the rows based on the join key. By default, sort=False
.
9. suffixes
link | tuple
of (string, string)
| optional
The suffix names to append to the duplicate column labels in the resulting DataFrame. By default, suffixes=("_x", "_y")
.
10. copy
| boolean
| optional
If
True
, then return a new copy of the DataFrame.If
False
, then avoid creating a new copy if possible.
By default, copy=True
.
11. indicator
link | boolean
or string
| optional
Whether or not to add append an extra column called _merge
, which tells us which DataFrame the row was constructed from. By default, indicator=False
.
12. validate
link | string
| optional
The validation logic to run:
Value | Description |
---|---|
| Checks whether the merge keys in the left and right are unique. |
| Checks whether the merge keys in the left are unique. |
| Checks whether the merge keys in the right are unique. |
| No check is performed. |
By default, validate=None
.
Return value
A merged DataFrame.
Examples
Performing an inner-join
Suppose a shopkeeper has the following data:
id | product | bought_by |
---|---|---|
A | computer | 1 |
B | smartphone | 3 |
C | headphones | NaN |
id | name | age |
---|---|---|
1 | alex | 10 |
2 | bob | 20 |
3 | cathy | 30 |
Here, the top table is about the products that the shop has, while the bottom table is about the profile of registered customers.
Suppose we wanted to see the profile of the customers who have purchased a product. To do this, we need to perform an inner join on the bought_by
column of the products
table and the index labels of the customers
table.
First off, let's create the two DataFrames. Here's the products
DataFrame:
"bought_by": [1, 3, pd.np.NaN]}, index=["A","B","C"])df_products
product bought_byA computer 1.0B smartphone 3.0C headphones NaN
Here's the customers
DataFrame:
"age": [10, 20, 30]}, index=[1,2,3])df_customers
name age1 alex 102 bob 203 cathy 30
Now, to perform the inner-join:
df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True)
product bought_by name ageA computer 1.0 alex 10B smartphone 3.0 cathy 30
Note the following:
left_on="bought_by"
indicates that we want to align using thebought_by
column of the left DataFrame.right_index=True
indicates that we want to align using the index of the right DataFrame. We can also merge based on columns instead by specifying theright_on
parameter.You can omit the
how
parameter here since the default value is"inner"
.Notice the ordering of the columns - the columns of the source DataFrame (i.e.
df_products
) come first, followed by those of the right DataFrame.
Performing a left-join
To demonstrate a left-join, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
To see all products as well as the profile of the customers who have purchased the products, we need to perform a left-join like so:
df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True)
product bought_by name ageA computer 1.0 alex 10.0B smartphone 3.0 cathy 30.0C headphones NaN NaN NaN
left_on="bought_by"
indicates that we want to align using thebought_by
column of the left DataFrame.right_index=True
indicates that we want to align using the index of the right DataFrame.Notice the ordering of the columns - the columns of the source DataFrame (i.e.
df_products
) come first, followed by those of the right DataFrame.
Performing a right-join
To demonstrate a right-join, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
To see the profile of all customers as well as the products they purchased, we need to perform a right-join like so:
df_products.merge(df_customers, how="right", left_on="bought_by", right_index=True)
product bought_by name ageA computer 1.0 alex 10B smartphone 3.0 cathy 30NaN NaN 2.0 bob 20
Performing a full-join
To demonstrate a full-join, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
To see the profile of all customers as well as all products, we need to perform a full-join like so:
df_products.merge(df_customers, how="outer", left_on="bought_by", right_index=True)
product bought_by name ageA computer 1.0 alex 10.0B smartphone 3.0 cathy 30.0C headphones NaN NaN NaNNaN NaN 2.0 bob 20.0
Specifying the sort parameter
To demonstrate what the sort
parameter does, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 3.0 | 1 alex 10B smartphone 1.0 | 2 bob 20C headphones NaN | 3 cathy 30
Just as a side note, the ordering of the values in the bought_by
column was swapped as this will allow us to see the behaviour of sort=True
.
Suppose we performed a left-join, like so:
df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True, sort=True)
product bought_by name ageB smartphone 1.0 alex 10.0A computer 3.0 cathy 30.0C headphones NaN NaN NaN
Notice how the resulting DataFrame is sorted based on the join key - the bought_by
column. Without sort=True
, the column would not be sorted, that is, the original order will be kept.
Specifying the suffixes parameter
To demonstrate what the suffix
parameter does, we shall use the same example of products and customers as before:
[df_products] | [df_customers] name bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
Note that previously we used the column label product
for the product names, but for this example, we will be using the column label name
instead. We now have an overlap in the column names.
To see how Pandas deal with this by default, let's perform an inner-join:
df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True)
name_x bought_by name_y ageA computer 1.0 alex 10B smartphone 3.0 cathy 30
We see that Pandas has added the suffixes _x
and _y
to differentiate the columns.
We can override this behaviour by specifying the suffixes parameter, which simply takes in a tuple of suffixes:
df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True, suffixes=("_product", "_customer"))
name_product bought_by name_customer ageA computer 1.0 alex 10B smartphone 3.0 cathy 30
Observe how the column names now reflect our specified suffixes.
Specifying the indicator parameter
To demonstrate what the indicator
parameter does, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
Suppose we performed a left join with indicator=True
:
df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True, indicator=True)
product bought_by name age _mergeA computer 1.0 alex 10.0 bothB smartphone 3.0 cathy 30.0 bothC headphones NaN NaN NaN left_only
Note the following:
Notice how we end up with an additional column at the end called
_merge
. This column tells us which DataFrame the row was constructed from.The first two rows have the values
"both"
, which means that a match was found in both the source DataFrame and theright
DataFrame. You can easily confirm that this isTrue
by looking back at our two DataFrames.The last row has the value
"left_only"
, which means that the row comes from the source DataFrame.
Specifying the validate parameter
We'll use the same example about products and customers again:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones 3.0 | 3 cathy 30
For this example, we changed the "bought_by"
value of headphones from NaN
to 3
.
Let's perform an inner-join on the customer id:
df_products.merge(df_customers, how="inner", right_index=True, left_on="bought_by")
product bought_by name ageA computer 3 cathy 30C headphones 3 cathy 30B smartphone 1 alex 10
Here, one customer (Cathy) has bought 2 products, so this is a 1:2
mapping, which is often written as 1:m
where m just represents a number greater than 1
.
Let's now call the exact same function, but with validate="1:1"
:
df_products.merge(df_customers, how="inner", right_index=True, left_on="bought_by", validate="1:1")
MergeError: Merge keys are not unique in left dataset; not a one-to-one merge
Since this is not a 1:1
mapping, an error is thrown. On a more general note, if the left DataFrame has duplicate values, then validation rules "1:1"
and "1:m"
would throw an error.