Pandas Merge, Join, and Concat: How To and Examples

Python’s Pandas library is a popular library for cleaning, manipulating, and interpreting large amounts of data. In an earlier post, we looked at the foundational structure in Pandas—the DataFrame. If you’re unfamiliar with Pandas DataFrames, take a look at that post to understand the basics.

In this post, we’ll learn how to combine multiple DataFrames using Pandas mergejoin, and concat. These methods let you supercharge your data by gluing together data from different sources.

We’ll cover:

  • When to use concat and when to use merge or join;
  • How to use Pandas’ concat method;
  • How to use Pandas’ merge and join.

Let’s get started.

When to use the Pandas concat vs. merge and join

While mergejoin, and concat all work to combine multiple DataFrames, they are used for very different things. In this section, we’ll learn when you will want to use one operation over another. The key distinction is whether you want to combine your DataFrames horizontally or vertically.

The concat method allows you to combine DataFrames vertically.

Imagine you had two DataFrames with the same columns. Perhaps the first DataFrame includes 10 rows of stock trading data for one stock while the second DataFrame includes 10 rows of stock trading data for a different stock. A vertical combination would use a DataFrame’s concat method to combine the two DataFrames into a single DataFrame with twenty rows.

Notice that in a vertical combination with concat, the number of rows has increased but the number of columns has stayed the same.

By contrast, the merge and join methods help to combine DataFrames horizontally.

Imagine you have two DataFrames. The first contains stock trading information various companies. The second contains information about the headquarters and numbers of employees for a particular company. If the two DataFrames have one field in common—such as a stock symbol or company name—you can combine the two DataFrames so that each row contains both the stock trading data and the company background information.

Notice that in this horizontal combination, we aren’t adding any additional rows. Rather, we’re adding columns to existing rows. The horizontal combination from a merge operation is similar to a JOIN operator in SQL.

Now that we understand the difference between vertical combinations with concat and horizontal combinations with merge or join, let’s take a deeper look at how to use these methods.

How to use Pandas’ concat method

In our previous post on Pandas DataFrames, we used sample stock data to show create, explore, and manipulate DataFrames. We used sample stock trading data from Amazon (“AMZN”), Apple (“AAPL”) and Google (“GOOG”).

Imagine we had a second data set with trading information on two additional companies, Facebook (“FB”) and Tesla (“TSLA”). These DataFrames have the same shape, so it would be useful to combine them to operate on them together.

We can do this with the concat method as follows:

>>> import pandas as pd
>>> stocks = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv')
>>> stocks.info()
< class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 7 columns):
date      15 non-null object
symbol    15 non-null object
open      15 non-null float64
high      15 non-null float64
low       15 non-null float64
close     15 non-null float64
volume    15 non-null int64
dtypes: float64(4), int64(1), object(2)
memory usage: 920.0+ bytes
>>> stocks2 = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/53ebac036b404875ef8e981c0cbd0901/raw/6c70336130eb7e45cec167ee7cd52d15baa392ea/stocks2.csv')
>>> combined = pd.concat([stocks, stock2], ignore_index=True)
>>> combined
          date symbol     open       high      low    close    volume
0   2019-03-01   AMZN  1655.13  1674.2600  1651.00  1671.73   4974877
1   2019-03-04   AMZN  1685.00  1709.4300  1674.36  1696.17   6167358
2   2019-03-05   AMZN  1702.95  1707.8000  1689.01  1692.43   3681522
3   2019-03-06   AMZN  1695.97  1697.7500  1668.28  1668.95   3996001
4   2019-03-07   AMZN  1667.37  1669.7500  1620.51  1625.95   4957017
5   2019-03-01   AAPL   174.28   175.1500   172.89   174.97  25886167
6   2019-03-04   AAPL   175.69   177.7500   173.97   175.85  27436203
7   2019-03-05   AAPL   175.94   176.0000   174.54   175.53  19737419
8   2019-03-06   AAPL   174.67   175.4900   173.94   174.52  20810384
9   2019-03-07   AAPL   173.87   174.4400   172.02   172.50  24796374
10  2019-03-01   GOOG  1124.90  1142.9700  1124.75  1140.99   1450316
11  2019-03-04   GOOG  1146.99  1158.2800  1130.69  1147.80   1446047
12  2019-03-05   GOOG  1150.06  1169.6100  1146.19  1162.03   1443174
13  2019-03-06   GOOG  1162.49  1167.5700  1155.49  1157.86   1099289
14  2019-03-07   GOOG  1155.72  1156.7600  1134.91  1143.30   1166559
15  2019-03-01     FB   162.60   163.1320   161.69   162.28  11097770
16  2019-03-04     FB   163.90   167.5000   163.83   167.37  18894689
17  2019-03-05     FB   167.37   171.8800   166.55   171.26  28187890
18  2019-03-06     FB   172.90   173.5700   171.27   172.51  21531723
19  2019-03-07     FB   171.50   171.7400   167.61   169.13  18306504
20  2019-03-01   TSLA   306.94   307.1300   291.90   294.79  22911375
21  2019-03-04   TSLA   298.12   299.0000   282.78   285.36  17096818
22  2019-03-05   TSLA   282.00   284.0000   270.10   276.54  18764740
23  2019-03-06   TSLA   276.48   281.5058   274.39   276.24  10335485
24  2019-03-07   TSLA   278.84   284.7000   274.25   276.59   9442483

In the example above, we create our first DataFrame and use the DataFrame’s info method to see that there are 15 rows and seven columns in our DataFrame. Then, we create our second DataFrame and combine it with our first DataFrame using the concat method. After this, we can see our DataFrame includes information on all five companies in across our two DataFrames.

If we use the info method to look at our new DataFrame, we can see that there are still seven columns but now there are 25 rows—the combination of the two DataFrames.

>>> combined.info()
< class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 9
Data columns (total 7 columns):
date      25 non-null object
symbol    25 non-null object
open      25 non-null float64
high      25 non-null float64
low       25 non-null float64
close     25 non-null float64
volume    25 non-null int64
dtypes: float64(4), int64(1), object(2)
memory usage: 1.6+ KB

The concat method is a great way to combine multiple DataFrames that contain similar data shapes. Note that the columns don’t need to match entirely—Pandas will simply include a null value for columns without values.

Using Pandas’ merge and join to combine DataFrames

The merge and join methods are a pair of methods to horizontally combine DataFrames with Pandas. This is a great way to enrich with DataFrame with the data from another DataFrame.

Both merge and join are operating in similar ways, but the join method is a convenience method to make it easier to combine DataFrames. In fact, join is using merge under the hood. I prefer to use join where possible as it’s slightly easier syntax.

When using either merge or join, you’ll need to specify how the DataFrames should be merged or joined. There are four possible values for how to join two DataFrames

Back to Top