Pandas.merge() function

The merge() function in Pandas is a powerful tool for combining two or more dataframes based on one or more keys. It is analogous to the JOIN operation in SQL databases and offers various options to customize the merge behavior.

Here’s the basic syntax of the merge() function:

pandas.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(‘_x’, ‘_y’), copy=True, indicator=False, validate=None)
Let’s go through some of the important parameters:

left and right: The dataframes to be merged.
how: Specifies the type of join to perform. Options include ‘inner’, ‘left’, ‘right’, and ‘outer’. Default is ‘inner’.
on: Column or index level names to join on. Must be found in both dataframes.
left_on and right_on: Columns or index levels from the left and right dataframes to use as keys for the join.
left_index and right_index: If True, use the index from the left or right dataframe as the join key(s).
suffixes: Suffixes to apply to overlapping column names in the resulting dataframe. Defaults to ‘_x’ and ‘_y’. The suffixes specified by lsuffix and rsuffix are appended to the overlapping column names from the left and right dataframes, respectively.
sort: Sort the resulting dataframe by the join keys. Default is False.
indicator: Adds a column to the output dataframe indicating the source of each row (e.g., ‘both’, ‘left_only’, ‘right_only’). Default is False.
validate: Checks if merge operation is of a certain type. Options include ‘one_to_one’, ‘one_to_many’, ‘many_to_one’, and ‘many_to_many’. Default is None.
Inner Join

Inner Join

​pd.merge(left = capitals, right = currency, how = ‘inner’)
Country ISO Capital Currency Digraph
France FR Paris Euro FR
India IN New Delhi Indian Rupee IN
Nepal NP Katmandu Nepalese Rupee NP
Russia RU Moscow Rouble RU
Spain ES Madrid Euro ES
Country ISO Capital Currency Digraph 0 France FR Paris Euro FR 1 India IN New_Delhi Indian_Rupee IN 2 Nepal NP Katmandu Nepalese_Rupee NP 3 Russia RU Moscow Rouble RU 4 Spain ES Madrid Euro ES

See how simple it can be. The pandas the function automatically identified the common column Country and joined based on that. We did not explicitly say which columns to join on. But if you want to, it can be mentioned explicitly.

This was the case when the columns having the same content was also having the same heading. But you may notice, that’s not always the case.

What if the names of the columns are different in the left and right columns?
Also, What if there is more than one common column, and you want to specify which one you want the join operation to be performed “ON”?
merge() the function gives you answer to both the questions above.

Example
Let’s start with specifying the “Country” column specifically in the above code.

pd.merge(left = capitals, right = currency, how = ‘inner’, on = ‘Country’ )
Country ISO Capital Currency Digraph
France FR Paris Euro FR
India IN New Delhi Indian Rupee IN
Nepal NP Katmandu Nepalese Rupee NP
Russia RU Moscow Rouble RU
Spain ES Madrid Euro ES
The results of the above code are the same as the previous one, and that was expected. Isn’t it?

Now notice that there is another column, in both the left and right tables, which has the same content. We can join on that table as well, but in such case, which table name shall be mentioned in on=?

In such cases, we use left_on and right_on keywords in parameter.

pd.merge(left = capitals,
right = currency,
how= ‘inner’,
left_on=’ISO’,
right_on=’Digraph’,
suffixes=(‘_x’, ‘_y’))
Country_x ISO Capital Country_y Currency Digraph
France FR Paris France Euro FR
India IN New Delhi India Indian Rupee IN
Nepal NP Katmandu Nepal Nepalese Rupee NP
Russia RU Moscow Russia Rouble RU
Spain ES Madrid Spain Euro ES
Outer Join or Full Join
Join DataFrames SQL Python outer join

In pandas, an outer join is a method for combining two DataFrames based on a common column or index, including all rows from both DataFrames. Pandas provides the merge() function to perform joins, and the how parameter determines the type of join to execute. The syntax for an outer join in pandas is

result = pd.merge(left, right,
how=’outer’,
on=None,
left_on=None,
right_on=None,
left_index=False,
right_index=False,
sort=False,
suffixes=(‘_x’, ‘_y’),
copy=True,
indicator=False,
validate=None)
Here’s an explanation of each parameter:

left: The DataFrame on the left side of the join.
right: The DataFrame on the right side of the join.
how: Specifies the type of join. For an outer join, set how to ‘outer’. Other options include ‘inner’ (default), ‘left’, and ‘right’.
on: Column name or list of column names to join on. If None, and left_index and right_index are False, the join will be based on the intersection of the columns in both DataFrames.
left_on: Column name or list of column names from the left DataFrame to join on.
right_on: Column name or list of column names from the right DataFrame to join on.
left_index: If True, use the index from the left DataFrame as the join key(s).
right_index: If True, use the index from the right DataFrame as the join key(s).
sort: Sort the result DataFrame by the join keys. Defaults to False.
suffixes: Tuple of string suffixes to apply to overlapping column names from the left and right DataFrames, respectively.
copy: If True, always copy data.
indicator: If True, adds a special column _merge to the result DataFrame indicating the source of each row.
validate: Checks if merge is a valid operation. Values can be ‘one_to_one’, ‘one_to_many’, ‘many_to_one’, or ‘many_to_many’
Example

Outer Join


pd.merge(left = capitals, right = currency, how = ‘outer’)
Country ISO Capital Currency Digraph
0 Afghanistan AF Kabul NaN NaN
1 Argentina AR Buenos_Aires NaN NaN
2 Australia AU Canberra NaN NaN
3 Canada CA Ottawa NaN NaN
4 China CN Beijing NaN NaN
5 France FR Paris Euro FR
6 India IN New_Delhi Indian_Rupee IN
7 Nepal NP Katmandu Nepalese_Rupee NP
8 Russia RU Moscow Rouble RU
9 Spain ES Madrid Euro ES
10 Sri_Lanka NaN NaN Rupee LK
11 United_Kingdom NaN NaN Pound GB
12 USA NaN NaN US_Dollar US
13 Uzbekistan NaN NaN Sum_Coupons UZ
14 Zimbabwe NaN NaN Zimbabwe_Dollar ZW
Notice that there is a total of 15 rows in the output above, whereas both the tables have 10 rows each. What happened here is, 5 rows are common(which came as an output of inner join) that appeared once, and the rest 5 rows of each table also got included in the final table. The values, which were not there in the tables, are filled with NaN, which is python’s way of writing Null values.

Here also, we can do the join on different column names from the left and right table, and remove the duplicates, as we saw for the inner join.

filtering the column by using regular expressions


pd.merge(left = capitals,
right = currency,
how= ‘outer’,
left_on=’ISO’,
right_on=’Digraph’,
suffixes=(”, ‘_drop’)).filter(regex=’^(?!.*_drop)’)
Country ISO Capital Currency Digraph
0 Afghanistan AF Kabul NaN NaN
1 Argentina AR Buenos_Aires NaN NaN
2 Australia AU Canberra NaN NaN
3 Canada CA Ottawa NaN NaN
4 China CN Beijing NaN NaN
5 France FR Paris Euro FR
6 India IN New_Delhi Indian_Rupee IN
7 Nepal NP Katmandu Nepalese_Rupee NP
8 Russia RU Moscow Rouble RU
9 Spain ES Madrid Euro ES
10 NaN NaN NaN Rupee LK
11 NaN NaN NaN Pound GB
12 NaN NaN NaN US_Dollar US
13 NaN NaN NaN Sum_Coupons UZ
14 NaN NaN NaN Zimbabwe_Dollar ZW
But here the “Country” column has Nan values for the rows not there in the left table. Hence, for outer join, it’s better to change the column name of the common value columns in such a way that both tables have the same column names and then do the outer join as seen above.

Left Join
The left join, as demonstrated in Pandas, is a method of merging two dataframes where all rows from the left dataframe are retained, and matching rows from the right dataframe are included. Any rows from the left dataframe that do not have corresponding matches in the right dataframe will have NaN values in the columns from the right dataframe.

Here’s an example of performing a left join in Pandas:

Left Join

left_join_result = pd.merge(left=left_dataframe, right=right_dataframe, how=’left’)
In this code snippet:

left_dataframe represents the left dataframe to be merged.
right_dataframe represents the right dataframe to be merged.
how=’left’ specifies that a left join should be performed.
The resulting left_join_result dataframe will contain all rows from the left_dataframe, and matching rows from the right_dataframe. If there are no matches for a row from the left_dataframe, the corresponding columns from the right_dataframe will contain NaN values.

Left joins are particularly useful when you want to retain all records from the primary dataframe while optionally incorporating additional information from a secondary dataframe based on matching keys.

Example

Left Join


pd.merge(left = capitals, right = currency, how = ‘left’)
Country ISO Capital Currency Digraph
0 Afghanistan AF Kabul NaN NaN
1 Argentina AR Buenos_Aires NaN NaN
2 Australia AU Canberra NaN NaN
3 Canada CA Ottawa NaN NaN
4 China CN Beijing NaN NaN
5 France FR Paris Euro FR
6 India IN New_Delhi Indian_Rupee IN
7 Nepal NP Katmandu Nepalese_Rupee NP
8 Russia RU Moscow Rouble RU
9 Spain ES Madrid Euro ES
Notice that there is a total of 10 rows in the output above, whereas both the tables have 10 rows each. What happened here is, only the 10 rows of the LEFT table got included in the final table. The values, which were not there in the LEFT table(Currency and Digraph), are filled with NaN, which is python’s way of writing Null values.

Note: If you have ever used Microsoft Excel or any other similar spreadsheet program, you would have done this kind of joining of data in the excel workbook. The Left Join is similar to “VLOOKUP”.

Right Join
The Right Join operation in pandas allows us to merge two dataframes, ensuring that all rows from the right dataframe are included in the final result, with matching rows from the left dataframe appended where available. Here’s how you can perform a Right Join using the merge() function in pandas:

Performing Right Join

import pandas as pd

right_join_result = pd.merge(left=df_left, right=df_right, how=’right’)

print(right_join_result)
In this code:

df_left represents the left dataframe.
df_right represents the right dataframe.
We use the pd.merge() function to perform the Right Join.
The how parameter is set to ‘right’, indicating a Right Join operation.
The result is stored in the variable right_join_result.
Finally, we print or further manipulate right_join_result to analyze the joined data.
This operation ensures that all rows from the right dataframe are included, with corresponding rows from the left dataframe added where matches are found. Any non-matching rows from the left dataframe will contain NaN values in the merged dataframe.

Example

Right Join


pd.merge(left = capitals, right = currency, how = ‘right’)
Country ISO Capital Currency Digraph
0 France FR Paris Euro FR
1 India IN New_Delhi Indian_Rupee IN
2 Nepal NP Katmandu Nepalese_Rupee NP
3 Russia RU Moscow Rouble RU
4 Spain ES Madrid Euro ES
5 Sri_Lanka NaN NaN Rupee LK
6 United_Kingdom NaN NaN Pound GB
7 USA NaN NaN US_Dollar US
8 Uzbekistan NaN NaN Sum_Coupons UZ
9 Zimbabwe NaN NaN Zimbabwe_Dollar ZW
Notice that there is a total of 10 rows in the output above, whereas both the tables have 10 rows each. What happened here is, only the 10 rows of the RIGHT table got included in the final table. The values, which were not there in the RIGHT table(ISO and Capital), are filled with NaN, which is python’s way of writing Null values.

Removing the Duplicate Columns after JOIN
One apparent issue which crept into the result is duplication of the “Country” column, and you may notice that the column names have now suffix that is provided as default. There is no way in the function parameters to avoid this duplication, but a clean and smart workaround may be used in the same line of code. We can smartly use the suffixes= for this purpose.

I am going to add “_drop” suffix to the duplicate column.

Example

filtering the column by using regular expressions


pd.merge(left = capitals,
right = currency,
how= ‘inner’,
left_on=’ISO’,
right_on=’Digraph’,
suffixes=(”, ‘_drop’)).filter(regex=’^(?!.*_drop)’)
Output:
Country ISO Capital Currency Digraph
France FR Paris Euro FR
India IN New Delhi Indian Rupee IN
Nepal NP Kathmandu Nepalese Rupee NP
Russia RU Moscow Rouble RU
Spain ES Madrid Euro ES
Conclusion
In conclusion, the ability to join dataframes efficiently is a fundamental aspect of data science and manipulation, especially when working with structured data in Python. Pandas provides powerful tools for merging and joining dataframes, mirroring the functionality of SQL joins.

Leave a Comment