search
Search
Publish
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe: "Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
share
thumb_up_alt
bookmark
arrow_backShare
Twitter
Facebook
thumb_up
1
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

Pandas DataFrame | join method

Programming
chevron_right
Python
chevron_right
Pandas
chevron_right
Documentation
chevron_right
DataFrame
chevron_right
Combining DataFrames
schedule Mar 10, 2022
Last updated
local_offer PythonPandas
Tags

Pandas DataFrame.join(~) merges the source DataFrame with another Series or DataFrames.

NOTE

The join(~) method is a wrapper around the merge(~) method, so if you want more control over the join process, consider using merge(~) instead.

Parameters

1. otherlink | Series or DataFrame or list of DataFrames

The other object to join with.

2. onlink | string or list | optional

The column or index level name of the source DataFrame to perform the join on. The index of the other will be used for the join. If you want to specify a non-index column to join on for other, use merge(~) instead, which has a right_on parameter.

3. howlink | string | optional

The type of join to perform:

Value

Description

"left"link

All rows from the source DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a left-join.

"right"link

All rows from the right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a right-join.

"outer"link

All rows from the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of an outer-join.

"inner"link

All rows that have matching values in both the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent to inner-join.

By default, how="left".

Here's the classic Venn Diagram illustrating the differences:

4. lsuffixlink | string | optional

The suffix to append to the overlapping label of the source DataFrame. This is only relevant if there are duplicate column labels in the result. By default, lsuffix="".

5. rsuffixlink | string | optional

The suffix to append to the overlapping label of other. This is only relevant if there are duplicate column labels in the result. By default, rsuffix="".

6. sortlink | boolean | optional

Whether or not to sort the rows based on the join key. By default, sort=False.

Return Value

A merged DataFrame.

Examples

Basic usage

Consider the following DataFrame about some products of a shop:

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
"bought_by": ["bob", "alex", "bob"]},
index=["A","B","C"])
df_products
product bought_by
A computer bob
B smartphone alex
C headphones bob

Here's a DataFrame about some customers of the shop:

df_customers = pd.DataFrame({"age": [10, 20, 30]},
index=["alex","bob","cathy"])
df_customers
age
alex 10
bob 20
cathy 30

To perform a left-join on the bought_by column of df_products:

df_products.join(df_customers, on="bought_by")   # how="left"
product bought_by age
A computer bob 20
B smartphone alex 10
C headphones bob 20

By default, the index of other will be used as the join key. If you want more flexibility as to which columns are used for the join, use the merge(~) method instead.

Comparison of different joins

Consider the following DataFrames about products and customers:

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
"bought_by": ["bob", "alex", "david"]},
index=["A","B","C"])
df_customers = pd.DataFrame({"age": [10, 20, 30]}, index=["alex","bob","cathy"])
[df_products] | [df_customers]
product bought_by | age
A computer bob | alex 10
B smartphone alex | bob 20
C headphones david | cathy 30

Left join

df_products.join(df_customers, on="bought_by", how="left")
product bought_by age
A computer bob 20.0
B smartphone alex 10.0
C headphones david NaN

Right join

df_products.join(df_customers, on="bought_by", how="right")
product bought_by age
B smartphone alex 10
A computer bob 20
NaN NaN cathy 30

Inner join

df_products.join(df_customers, on="bought_by", how="inner")
product bought_by age
A computer bob 20
C headphones bob 20
B smartphone alex 10

Outer join

df_products.join(df_customers, on="bought_by", how="outer")
product bought_by age
A computer bob 20
C headphones bob 20
B smartphone alex 10
NaN NaN cathy 30

Specifying lsuffix and rsuffix

Suppose we wanted to join the following DataFrames:

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
"age": [5,6,7],
"bought_by": ["bob", "alex", "bob"]},
index=["A","B","C"])
df_customers = pd.DataFrame({"age": [10, 20, 30]},
index=["alex","bob","cathy"])
product age bought_by | age
A computer 5 bob | alex 10
B smartphone 6 alex | bob 20
C headphones 7 bob | cathy 30

Notice how both DataFrames have the age column.

Due to this overlap in the naming, performing the join results in a ValueError:

df_products.join(df_customers, on="bought_by")
ValueError: columns overlap but no suffix specified: Index(['age'], dtype='object')

This error can be resolved by specifying lsuffix or rsuffix:

df_products.join(df_customers, on="bought_by", lsuffix="_product")
product age_product bought_by age
A computer 5 bob 10
B smartphone 6 alex 10
C headphones 7 bob 20

Note that lsuffix and rsuffix only take effect when there are duplicate column labels.

Specifying sort

Consider the same example as before:

        [df_products]         |   [df_customers]
   product      bought_by     |        age
A  computer     bob         |     alex  10
B  smartphone   alex        |     bob  20
C  headphones   bob         |     cathy  30

By default, sort=False, which means that the resulting DataFrame's rows are not sorted by the join key (bought_by):

df_products.join(df_customers, on="bought_by") # sort=False
product bought_by age
A computer bob 20
B smartphone alex 10
C headphones bob 20

Setting sort=True yields:

df_products.join(df_customers, on="bought_by", sort=True)
product bought_by age
B smartphone alex 10
A computer bob 20
C headphones bob 20
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...