Note: This is Part 1 of a three-part article.
If you've ever worked with real-world datasets, you know how rare it is for them to be analysis-ready. Inevitably, there will be missing values, inexplicable outliers, unspecified time zones, unclear formatting. Some issues will be specific to the work you're doing; others will be fairly common across a wide variety of datasets.
In this guide, we'll focus on common problems and show you how to deal with them. The goal is to get a clean version of the dataset that's ready for further data exploration and/or predictive modeling. In Part 3, we'll show you two libraries that can expedite the exploration stage.
The dataset we will be using comes from LendingClub, a US-based fintech company that facilitates peer-to-peer lending. A brief description by Wikipedia states:
LendingClub enables borrowers to create unsecured personal loans between $1,000 and $40,000. The standard loan period is three years. Investors are able to search and browse the loan listings on the LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Institutional investors make money from the interest on these loans. LendingClub makes money by charging borrowers an origination fee and investors a service fee.
Each observation in this dataset represents one of the millions of loans made through the LendingClub platform. For detailed descriptions of each column in the dataset, please refer to this Excel file.
All the code snippets to follow can also be found in this Jupyter notebook.
The pandas library will help us do all the data manipulation operations we need to clean our data. Additionally, we'll use:
- urllib and zipfile to download and unzip LendingClub’s data
- pathlib for convenient operations on file system paths
- tqdm for displaying progress bars for long-running tasks
Thankfully, LendingClub’s data is available publicly. LendingClub provides data accessible at various URLs with this pattern:
where <YEAR_QUARTER> needs to be replaced with values indicating the year and the quarter of the year corresponding to each batch of data: 2017Q1, 2017Q2, …, 2020Q2, etc.
We simply need to download the compresse archives (one per quarter), unzip them, and load the resulting CSV files into pandas DataFrames.
First, we create a list of years and quarters we are interested in:
Then we create a corresponding list of URLs:
We'll store our data in its own directory called data. Let's create it:
Next, we'll download each zip file from a URL list and unzip it into the data directory:
Now let's collect all files with a .csv extension in the data directory:
We now have all the files we need on our local file system!
Coming up next...
In Part 2, we'll uncover various issues that are polluting this data and demonstrate a few techniques for dealing with them.
Learn more about the data science bootcamp and other courses by visiting TripleTen and signing up for your free introductory class.