If you’re looking for free datasets for practicing new skills, you’re in luck. The number of free, publicly available datasets has only proliferated over time, making for a treasure trove of information data science professionals can practice their skills on.
With that in mind, we talked to two senior-level data science instructors — Joe Eddy, of the Metis bootcamp in New York City, and Raja Iqbal, founder of Data Science Dojo — to get an overview of the free datasets best suited for a variety of competencies, including product purchasing analysis, ad-click prediction, image classification, sentiment analysis and time series analysis.
Here are the top datasets they recommend.
Top Free Datasets for Data Science Practice
- Lending Club Loan Data by Lending Club
- Instacart Market Basket Analysis by Instacart
- MNIST Database of Handwritten Digits by Yann LeCun, Corinna Cortes and Christopher J.C. Burges
- Amazon Product Reviews by Julian McAuley
- Hourly Energy Consumption by PJM
- Web Traffic Time Series Forecasting by Google
- Uber Pickups in New York City by FiveThirtyEight and NYC Taxi and Limousine Commission
- Individual Household Electric Power Consumption by UC Irvine
Tabular data is data organized in a table using rows and columns. It’s a simple form of data found in spreadsheet and comma-separated values (CSV) files, and often contains mixed data types (having string and numeric values). Tabular data is used to train machine learning models to find relationships between data points and make predictions on new data.
1. Lending Club Loan Data
The Lending Club Loan Data set is a great resource for data scientists to practice loan default prediction and expand their finance domain knowledge. It has a massive number of data points, covering all loans made between 2007 and 2015, and it’s feature rich, including credit scores, number of finance inquiries and geographical information. It’s not always easy to find a finance dataset that checks both boxes. “Sometimes finance data is kind of hard to get,” Eddy said.
Also consider diving into Lending Club’s API, or — as Iqbal suggested — the UCI Machine Learning Depository’s Default of Credit Card Loans dataset, sourced from default payments in Taiwan.
2. Instacart Market Basket Analysis
The Instacart Market Basket Analysis set is one of the largest real-world grocery datasets available, making it a go-to for honing product purchasing prediction and analysis. It spans a whopping three million orders placed by 200,000-plus users, with at least four orders per user and some including as many as 100. It also includes the sequence in which users bought products, and the time of day of each purchase. The patterns within the dataset are easily Google-able, but it remains a great resource for sharpening consumer-side predictive work, Eddy said.
3. Avito Context Ad Clicks
Dataset Avito Context Ad Clicks truly lets data scientists exercise their ad-click prediction and commercial sector data muscles. It holds quite a lot of data and tables, and offers good opportunities to do some feature engineering in a relational data setting. The Avito dataset is similar to a version of Craigslist, as it includes details like item descriptions in ads, geographical details and demand information.
4. Outbrain Click Prediction
The Outbrain Click Prediction dataset also deals with predicting what recommended content users will click next. It samples two billion page views, nearly 17 million clicks and a mess of user recommendations that were made across hundreds of publisher sites over the course of two weeks in 2016. (Outbrain is one of the companies that put boxes of sponsored-content articles at the bottom of sites.)
“So much of in-practice data science is literally just ad-click predictions,” Eddy said.
Image data is data extracted from images or photos, and can include information on pixels and other visual characteristics. This data is found from image files such as JPEGs, PNGs and GIFs, and is used to train machine learning models to recognize and classify certain objects from pictures (leading to abilities like computer vision).
5. MNIST Database of Handwritten Digits
The MNIST Database of Handwritten Digits is a well-known dataset — consisting of the digits 0 through 9, written in a variety of handwriting styles — remaining as an ideal entry point for image classification newcomers.
Basic classification is “pretty much the simplest possible problem for images, but it’s a good starting point for anyone who’s playing around with neural network image classification from scratch,” Eddy said.
“One of the hard things about working with neural networks when you’re starting is that, sometimes, training and retooling your models is just very time consuming,” he added. “So having relatively smaller, simpler datasets dramatically speeds that up.”
6. Dogs vs. Cats
“It’s simple enough to be accessible, but complicated enough to allow for meaningful work — my absolute favorite resource to recommend to a beginner in deep learning, in particular those who want to work with Python,” Eddy said. “Having a guided approach is extremely helpful.”
There’s a universe of more complex problems waiting beyond these simple classifications, but the core of those problems often involve repeat applications of exactly the kind of work needed to solve simpler ones, “so starting with one or two of those simple datasets will give you a really strong foundation for exploring almost any standard image problem,” Eddy said.
Text Mining and Text Analysis Datasets
Text mining and text analysis examine and identify patterns in unstructured text data. This data includes any large amount of text that is not traditionally organized or formatted into a table or database. Text mining and analysis can be used for sentiment analysis, topic modeling and named entity recognition, and may apply natural language processing (NLP) to achieve these tasks.
7. Large Movie Review Dataset
The Large Movie Review Dataset, a 2017 cache of IMDB reviews, includes 25,000 reviews for testing and 25,000 more for training, remaining as a popular tool for sharpening sentiment analysis skills.
As Towards Data Science noted in a spotlight, be prepared to do a fair amount of cleaning and vectorization before building and training your classifier. But the effort should pay off.
“You can [predict sentiment] with traditional NLP techniques or with slightly fancier, modern neural network techniques,” Eddy said. “It’s a very easy playground for a wide range of different possible techniques.”
8. Twitter and Reddit Sentimental Analysis Dataset
Twitter and Reddit hold mounds of text conversations and threads, so the Twitter and Reddit Sentimental Analysis Dataset — containing over 160,000 tweets and 37,000 Reddit comments — offers a handy way to pair your sentiment analysis project to some of the biggest social platforms out there.
It’s also a good opportunity to practice topic modeling, query writing and a bit of text preprocessing, since web users aren’t always known for their grammatical precision.
9. Stack Exchange API
Similarly, the Stack Exchange API dataset gives a glimpse into the ecosystem of the Stack Exchange Q&A sites, so you’re bound to find some domains of interest — probably with generally cleaner text than the Reddit dataset.
Eddy recalled a past project: “I grabbed a bunch of Statistics Stack Exchange questions to analyze what topics were more or less popular, what language was associated with getting more responses to the question. It was really interesting because it was a topic that I had an attachment to.”
10. Amazon Product Reviews
The Amazon Product Reviews dataset is a sentiment analysis-friendly set that Iqbal points to, particularly for an advanced data scientist who works, or hopes to break into, marketing. It contains 142.8 million reviews, extensive product information and “also viewed” and “also bought” details, culled from user activity between 1996 and 2014. A natural fit for those looking to filter sentiment analysis into building recommender systems.
Time Series Datasets
Time series data is data collected over an interval of time, and can include historical or real-time data points. This data is used in time series analysis and forecasting, which detect patterns and predict when specific changes may occur over time. Time series data helps forecast events like the weather, stock prices or heart rate readings.
Eddy stresses two key criteria when picking datasets for time series analysis — especially for newcomers. First, make sure the time interval is fixed. Whether day-to-day, minute-to-minute, hour-to-hour, the key thing is that the data is recorded in a regular, standardized measurement. Second, watch for clear, seasonal patterns that have logical effects.
Any of the below time series datasets — 11 to 13 being beginner-friendly, and 14 to 17 being more advanced — fit the bill.
11. Hourly Energy Consumption
Hourly Energy Consumption’s dataset features over 10 years of hourly energy consumption data in eastern U.S. states in megawatts, provided by PJM Interconnection. This time series set lets data scientists practice how to predict energy consumption on certain times of the day, week, year or special occasions like holidays.
12. Web Traffic Time Series Forecasting
The Web Traffic Time Series Forecasting dataset, provided by Google, contains traffic data to 145,000 Wikipedia articles, with a focus on using said data to predict future web traffic trends. Each time series data point states the name of the Wikipedia article visited and type of traffic represented (desktop, mobile or spider bot traffic).
13. International Greenhouse Gas Emissions
This International Greenhouse Gas Emissions dataset covers global greenhouse gas emission levels from 1990 to 2017, provided by the United Nations. The set aims to help forecast emissions trends and possible types present over time, including emission information on carbon dioxide, methane, nitrous oxide and hydrofluorocarbons.
14. Uber Pickups in New York City
Uber Pickups in New York City’s dataset supplies date, time and location data for over 20 million Uber and for-hire vehicle trips in the NYC area. The Uber data spans April to September in 2014, while the for-hire vehicles data spans January to June in 2015. Uber’s data comes via the New York City Taxi & Limousine Commission. It was released following a 2015 FOIA request by FiveThirtyEight, which delivered much (at the time) eye-opening reporting based on the data.
15. Citi Bike System Data
The Citi Bike System Data set sheds light on where, when and how far Citi Bike users in New York City ride. This set includes extensive travel information like bike ride ID, start and end times, start and end station IDs and geographical location. Data scientists can utilize this set to determine what days of the week and what times most of these bike rides are taken on.
16. Capital Bike Sharing
Bike Sharing is an intermediate-level dataset showing the hourly and daily count of bike rentals in the Capital bikeshare system between 2011 and 2012. Users can practice predicting how many bikes may be rented at certain times based on weather conditions and seasonal factors, and delve into exploratory data analysis, regression modeling or data visualization techniques.
17. Individual Household Electric Power Consumption
Finally, Iqbal recommends the Individual Household Electric Power Consumption dataset, from UCI’s Machine Learning Depository, for advanced-level time series practice. The set presents measurements of electric power consumption for one household within a one-minute sampling rate over a span of four years. This can help exercise short-term forecasting skills for one subject, being the single home.