Pandas Merge, Join, and Concat: How To and Examples
Python’s Pandas library is a popular library for cleaning, manipulating, and interpreting large amounts of data. In an earlier post, we looked at the foundational structure in Pandas—the DataFrame. If you’re unfamiliar with Pandas DataFrames, take a look at that post to understand the basics.
In this post, we’ll learn how to combine multiple DataFrames using Pandas merge
, join
, and concat
. These methods let you supercharge your data by gluing together data from different sources.
We’ll cover:
- When to use
concat
and when to usemerge
orjoin
; - How to use Pandas’
concat
method; - How to use Pandas’
merge
andjoin
.
Let’s get started.
When to use the Pandas concat
vs. merge
and join
While merge
, join
, and concat
all work to combine multiple DataFrames, they are used for very different things. In this section, we’ll learn when you will want to use one operation over another. The key distinction is whether you want to combine your DataFrames horizontally or vertically.
The concat
method allows you to combine DataFrames vertically.
Imagine you had two DataFrames with the same columns. Perhaps the first DataFrame includes 10 rows of stock trading data for one stock while the second DataFrame includes 10 rows of stock trading data for a different stock. A vertical combination would use a DataFrame’s concat
method to combine the two DataFrames into a single DataFrame with twenty rows.
Notice that in a vertical combination with concat
, the number of rows has increased but the number of columns has stayed the same.
By contrast, the merge
and join
methods help to combine DataFrames horizontally.
Imagine you have two DataFrames. The first contains stock trading information various companies. The second contains information about the headquarters and numbers of employees for a particular company. If the two DataFrames have one field in common—such as a stock symbol or company name—you can combine the two DataFrames so that each row contains both the stock trading data and the company background information.
Notice that in this horizontal combination, we aren’t adding any additional rows. Rather, we’re adding columns to existing rows. The horizontal combination from a merge
operation is similar to a JOIN
operator in SQL.
Now that we understand the difference between vertical combinations with concat
and horizontal combinations with merge
or join
, let’s take a deeper look at how to use these methods.
How to use Pandas’ concat
method
In our previous post on Pandas DataFrames, we used sample stock data to show create, explore, and manipulate DataFrames. We used sample stock trading data from Amazon (“AMZN”), Apple (“AAPL”) and Google (“GOOG”).
Imagine we had a second data set with trading information on two additional companies, Facebook (“FB”) and Tesla (“TSLA”). These DataFrames have the same shape, so it would be useful to combine them to operate on them together.
We can do this with the concat
method as follows:
>>> import pandas as pd
>>> stocks = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv')
>>> stocks.info()
< class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 7 columns):
date 15 non-null object
symbol 15 non-null object
open 15 non-null float64
high 15 non-null float64
low 15 non-null float64
close 15 non-null float64
volume 15 non-null int64
dtypes: float64(4), int64(1), object(2)
memory usage: 920.0+ bytes
>>> stocks2 = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/53ebac036b404875ef8e981c0cbd0901/raw/6c70336130eb7e45cec167ee7cd52d15baa392ea/stocks2.csv')
>>> combined = pd.concat([stocks, stock2], ignore_index=True)
>>> combined
date symbol open high low close volume
0 2019-03-01 AMZN 1655.13 1674.2600 1651.00 1671.73 4974877
1 2019-03-04 AMZN 1685.00 1709.4300 1674.36 1696.17 6167358
2 2019-03-05 AMZN 1702.95 1707.8000 1689.01 1692.43 3681522
3 2019-03-06 AMZN 1695.97 1697.7500 1668.28 1668.95 3996001
4 2019-03-07 AMZN 1667.37 1669.7500 1620.51 1625.95 4957017
5 2019-03-01 AAPL 174.28 175.1500 172.89 174.97 25886167
6 2019-03-04 AAPL 175.69 177.7500 173.97 175.85 27436203
7 2019-03-05 AAPL 175.94 176.0000 174.54 175.53 19737419
8 2019-03-06 AAPL 174.67 175.4900 173.94 174.52 20810384
9 2019-03-07 AAPL 173.87 174.4400 172.02 172.50 24796374
10 2019-03-01 GOOG 1124.90 1142.9700 1124.75 1140.99 1450316
11 2019-03-04 GOOG 1146.99 1158.2800 1130.69 1147.80 1446047
12 2019-03-05 GOOG 1150.06 1169.6100 1146.19 1162.03 1443174
13 2019-03-06 GOOG 1162.49 1167.5700 1155.49 1157.86 1099289
14 2019-03-07 GOOG 1155.72 1156.7600 1134.91 1143.30 1166559
15 2019-03-01 FB 162.60 163.1320 161.69 162.28 11097770
16 2019-03-04 FB 163.90 167.5000 163.83 167.37 18894689
17 2019-03-05 FB 167.37 171.8800 166.55 171.26 28187890
18 2019-03-06 FB 172.90 173.5700 171.27 172.51 21531723
19 2019-03-07 FB 171.50 171.7400 167.61 169.13 18306504
20 2019-03-01 TSLA 306.94 307.1300 291.90 294.79 22911375
21 2019-03-04 TSLA 298.12 299.0000 282.78 285.36 17096818
22 2019-03-05 TSLA 282.00 284.0000 270.10 276.54 18764740
23 2019-03-06 TSLA 276.48 281.5058 274.39 276.24 10335485
24 2019-03-07 TSLA 278.84 284.7000 274.25 276.59 9442483
In the example above, we create our first DataFrame and use the DataFrame’s info method to see that there are 15 rows and seven columns in our DataFrame. Then, we create our second DataFrame and combine it with our first DataFrame using the concat
method. After this, we can see our DataFrame includes information on all five companies in across our two DataFrames.
If we use the info method to look at our new DataFrame, we can see that there are still seven columns but now there are 25 rows—the combination of the two DataFrames.
>>> combined.info()
< class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 9
Data columns (total 7 columns):
date 25 non-null object
symbol 25 non-null object
open 25 non-null float64
high 25 non-null float64
low 25 non-null float64
close 25 non-null float64
volume 25 non-null int64
dtypes: float64(4), int64(1), object(2)
memory usage: 1.6+ KB
The concat
method is a great way to combine multiple DataFrames that contain similar data shapes. Note that the columns don’t need to match entirely—Pandas will simply include a null value for columns without values.
Using Pandas’ merge
and join
to combine DataFrames
The merge
and join
methods are a pair of methods to horizontally combine DataFrames with Pandas. This is a great way to enrich with DataFrame with the data from another DataFrame.
Both merge
and join
are operating in similar ways, but the join
method is a convenience method to make it easier to combine DataFrames. In fact, join
is using merge
under the hood. I prefer to use join
where possible as it’s slightly easier syntax.
When using either merge
or join
, you’ll need to specify how the DataFrames should be merged or joined. There are four possible values for how to join two DataFrames