Guide: Pandas DataFrames for Data Analysis
“Data scientist” is one of the hottest jobs in tech, and Python is the lingua franca of data science. Python’s easy-to-learn syntax, open ecosystem, and strong community has made it one of the fastest growing languages in recent years.
In this post, we’ll learn about Pandas, a high-performance open-source package for doing data analysis in Python.
We’ll cover:
- What Pandas is and why should you use it.
- What a Pandas DataFrame is.
- Creating and viewing a DataFrame.
- Manipulating data in a DataFrame.
Let’s get started.
What Is Pandas and Why Should I Use It?
Pandas is an open-source library for performing data analysis with Python. It was created by Wes McKinney when he was working for AQR Capital, an investment firm. Wes and AQR Capital open-sourced the project, and its popularity has exploded in the Python community.
A big portion of a data scientist’s time is spent cleaning data, and this is where Pandas really shines. Pandas helps you quickly and efficiently operate on large tables of data.
As an example, imagine you have a large, two-dimensional data set, comparable to an Excel spreadsheet. Your data set has many columns and rows.
You would use Pandas for
- Setting default values for rows with missing values.
- Merging (or “joining,” in SQL parlance) two separate data sets.
- Filtering your data set based on the values in a particular column.
- Viewing summary statistics, such as mean, standard deviation and percentiles.
These operations can save you a lot of time and let you get to the important work of finding the value from your data.
Now that we know what Pandas is and why we would use it, let’s learn about the key data structure of Pandas.
What Is a Pandas DataFrame?
The core data structure in Pandas is a DataFrame. A DataFrame is a two-dimensional data structure made up of columns and rows
If you have a background in the statistical programming language R, a DataFrame is modeled after the data.frame object in R.
The Pandas DataFrame structure gives you the speed of low-level languages combined with the ease and expressiveness of high-level languages.
Each row in a DataFrame makes up an individual record—think of a user for a SaaS application or the summary of a single day of stock transactions for a particular stock symbol.
Each column in a DataFrame represents an observed value for each row in the DataFrame. DataFrames can have multiple columns, each of which has a defined type.
For example, if you have a DataFrame that contains daily transaction summaries for a stock symbol, you might have one column of type float that indicates the closing price while another column of type int that indicates the total volume traded that day.
DataFrames are built on top of NumPy, a blazing-fast library that uses C/C++ and Fortran for fast, efficient computation of data.
Now that we understand the basics behind a DataFrame, let’s play around with creating and viewing a DataFrame.
Creating and Viewing a Pandas DataFrame
In this section, we’re going to create and view a Pandas DataFrame. We’ll use some summary stock data to learn the basic Pandas operations.
Installing Pandas can be tricky due to its dependencies on numerical computing libraries like NumPy, which include tools for integrating with Fortran and other low-level languages.
If you’re not a Python expert, the easiest way to get started with Pandas is to install the Anaconda distribution of Python. Check the Pandas installation docs to see all of your options.
Back to Top