The Pandas Library for Python

Introduction to Pandas

So, what is Pandas – practically speaking? In short, it’s the major data analysis library for Python. For scientists, students, and professional developers alike, Pandas represents a central reason for any learning or interaction with Python, as opposed to a statistics-specific language like R, or a proprietary academic package like SPSS or Matlab. (Fun fact – Pandas is named after the term Panel Data, and was originally created for the analysis of financial data tables). I like to think that the final “s” stands for Series or Statistics.

Although there are plenty of ways to explore numerical data with Python out-of-the box, these will universally involve some fairly low-performance results, with a ton of boilerplate. It may sound hard to believe, but Pandas is often recommended as the next stop for Excel users who are ready to take their data analysis to the next level. Nearly any problem that can be solved with a spreadsheet program can be solved in Pandas – without all the graphical cruft.

More importantly, because problems can be solved in Pandas via Python, solutions are already automated, or could be run as a service in the cloud. Further, Pandas makes heavy use of Numpy, relying on its low level calls to produce linear math results orders of magnitude more quickly than they would be handled by Python alone. These are just a few of the reasons Pandas is recommended as one of the first libraries to learn for all Pythonistas, and remains absolutely critical to Data Scientists.

About the Data

In this post, we’re going to be using a fascinating data set to demonstrate a useful slice of the Pandas library. This data set is particularly interesting as it’s part of a real world example, and we can all imagine people lined up at an airport (a place where things do occasionally go wrong). When looking at the data, I imagine people people sitting in those uncomfortable airport seats having just found out that their luggage is missing – not just temporarily, but it’s nowhere to be found in the system! Or, better yet, imagine that a hardworking TSA employee accidentally broke a precious family heirloom.

So it’s time to fill out another form, of course. Now, getting data from forms is an interesting process as far as data gathering is concerned, as we have a set of data that happens at specific times. This actually means we can interpret the entries as a Time Series. Also, because people are submitting the information, we can learn things about a group of people, too.

Back to our example: let’s say we work for the TSA and we’ve been tasked with getting some insights about when these accidents are most likely to happen, and make some recommendations for improving the service.

Pandas, luckily, is a one-stop shop for exploring and analyzing this data set. Feel free to download the excel file into your project folder to get started, or run the curl command below. Yes, pandas can read .xls or .xlsx files with a single call to pd.read_excel()! In fact, it’s often helpful for beginners experienced with .csv or excel files to think about how they would solve a problem in excel, and then experience how much easier it can be in Pandas.

So, without further ado, open your terminal, a text editor, or your favorite IDE, and take a look for yourself with the guidance below.

Example data:

Take for example, some claims made against the TSA during a screening process of persons or a passenger’s property due to an injury, loss, or damage. The claims data information includes claim number, incident date, claim type, claim amount, status, and disposition.

Directory: TSA Claims Data
Our Data Download: claims-2014.xls

Setup

To start off, let’s create a clean directory. You can put this wherever you’d like, or create a project folder in an IDE. Use your install method of choice to get Pandas: Pip is probably the easiest.

$ mkdir -p ~/Desktop/pandas-tutorial/data && cd ~/Desktop/pandas-tutorial

Install pandas along with xldr for loading Excel formatted files, matplotlib for plotting graphs, and Numpy for high-level mathematical functions.

$ pip3 install matplotlib numpy pandas xldr

Optional: download the example data with curl:

$ curl -O https://www.dhs.gov/sites/default/files/publications/claims-2014.xls
Back to Top