20 Free Datasets for Data Science Projects

Looking for free datasets to practice with? Check out these ones suggested by data science instructors.

Written by Stephen Gossett
laptop with Python logo displayed over screen
Image: Shutterstock
UPDATED BY
Brennan Whitfield | May 07, 2024

If you’re looking for free datasets for practicing new skills, you’re in luck. The number of free, publicly available datasets has only proliferated over time on sites like Google Dataset Search, Kaggle and Data.gov, making for a treasure trove of information data science professionals can practice their skills on.

With that in mind, we rounded up free datasets best suited for a variety of competencies, including product purchasing analysis, ad-click prediction, image classification, sentiment analysis and time series analysis.

Top Free Datasets for Data Science Practice

  • Lending Club Loan Data by Lending Club
  • Instacart Market Basket Analysis by Instacart
  • MNIST Database of Handwritten Digits by Yann LeCun, Corinna Cortes and Christopher J.C. Burges
  • Amazon Product Reviews by Julian McAuley 
  • Hourly Energy Consumption by PJM
  • Web Traffic Time Series Forecasting by Google
  • Uber Pickups in New York City by FiveThirtyEight and NYC Taxi and Limousine Commission
  • Individual Household Electric Power Consumption by UC Irvine

 

Tabular Datasets

Tabular data is data organized in a table using rows and columns. It’s a simple form of data found in spreadsheet and comma-separated values (CSV) files, and often contains mixed data types (having string and numeric values). Tabular data is used to train machine learning models to find relationships between data points and make predictions on new data.

1. Lending Club Loan Data

Dataset: Lending Club Loan Data

The Lending Club Loan Data set is a great resource for data scientists to practice loan default prediction and expand their finance domain knowledge. It has a massive number of data points, covering all loans made between 2007 and 2015, and it’s feature rich, including credit scores, number of finance inquiries and geographical information. It’s not always easy to find a finance dataset that checks both boxes. “Sometimes finance data is kind of hard to get,”  Joe Eddy, data science instructor of the Metis bootcamp in New York City, told Built In.

Also consider diving into Lending Club’s API, or — as Raja Iqbal, founder of Data Science Dojo, suggested — the UCI Machine Learning Depository’s Default of Credit Card Loans dataset, sourced from default payments in Taiwan.

2. Instacart Market Basket Analysis

Dataset: Instacart Market Basket Analysis

The Instacart Market Basket Analysis set is one of the largest real-world grocery datasets available, making it a go-to for honing product purchasing prediction and analysis. It spans a whopping three million orders placed by 200,000-plus users, with at least four orders per user and some including as many as 100. It also includes the sequence in which users bought products, and the time of day of each purchase. The patterns within the dataset are easily Google-able, but it remains a great resource for sharpening consumer-side predictive work, Eddy said.

3. Avito Context Ad Clicks 

Dataset: Avito Context Ad Clicks

Dataset Avito Context Ad Clicks truly lets data scientists exercise their ad-click prediction and commercial sector data muscles. It holds quite a lot of data and tables, and offers good opportunities to do some feature engineering in a relational data setting. The Avito dataset is similar to a version of Craigslist, as it includes details like item descriptions in ads, geographical details and demand information.

4. Outbrain Click Prediction

Dataset: Outbrain Click Prediction

The Outbrain Click Prediction dataset deals with predicting what recommended content users will click next. It samples two billion page views, nearly 17 million clicks and a mess of user recommendations that were made across hundreds of publisher sites over the course of two weeks in 2016. (Outbrain is one of the companies that put boxes of sponsored-content articles at the bottom of sites.) 

“So much of in-practice data science is literally just ad-click predictions,” Eddy said.

5. Coffee Reviews Dataset

Dataset: Coffee Reviews Dataset

This dataset organizes global reviews of coffee between 2017 and 2022 based on factors like blend name, type of roast, price and geographical origin of coffee beans. It is pre-processed and cleaned, and can be used for pandas, data analysis and feature engineering practice. The original version of the dataset comes with 12 features, while the simplified version has nine features.

6. Electric Vehicle Population Data

Dataset: Electric Vehicle Population Data

Provided by the State of Washington, this dataset displays information about battery electric vehicles (BEVs) and plug-in hybrid electric vehicles (PHEVs) currently registered through the Washington State Department of Licensing. Data is separated into 17 different columns, showing each vehicle’s VIN, county and city of registration, make and model, electric type and electric range. Vehicle model years range from 2013 to the current year, with metadata being routinely updated by the Washington government.

 

Image Datasets

Image data is data extracted from images or photos, and can include information on pixels and other visual characteristics. This data is found from image files such as JPEGs, PNGs and GIFs, and is used to train machine learning models to recognize and classify certain objects from pictures (leading to abilities like computer vision).

7. ImageNet

Dataset: ImageNet

ImageNet is a database of over 14 million images, intended for training image recognition, image classification and computer vision models. The database is organized according to the WordNet hierarchy and has over 20,000 synonym sets indexed, with about 1,000 images associated for each of these sets. The ImageNet project has been historically impactful for advancing computer vision and deep learning research.

8. MNIST Database of Handwritten Digits

Dataset: MNIST Dataset

The MNIST Database of Handwritten Digits is a well-known dataset — consisting of the digits 0 through 9, written in a variety of handwriting styles — remaining as an ideal entry point for image classification newcomers.

Basic classification is “pretty much the simplest possible problem for images, but it’s a good starting point for anyone who’s playing around with neural network image classification from scratch,” Eddy said.

“One of the hard things about working with neural networks when you’re starting is that, sometimes, training and retooling your models is just very time consuming,” he added. “So having relatively smaller, simpler datasets dramatically speeds that up.”

9. Dogs vs. Cats

Dataset: Dogs vs. Cats

The Dogs vs. Cats dataset is another recommended starting point for image classification, even being referenced by Keras creator Francois Chollet in his book, Deep Learning With Python

“It’s simple enough to be accessible, but complicated enough to allow for meaningful work — my absolute favorite resource to recommend to a beginner in deep learning, in particular those who want to work with Python,” Eddy said. “Having a guided approach is extremely helpful.”

There’s a universe of more complex problems waiting beyond these simple classifications, but the core of those problems often involve repeat applications of exactly the kind of work needed to solve simpler ones, “so starting with one or two of those simple datasets will give you a really strong foundation for exploring almost any standard image problem,” Eddy said.

RelatedA Tech Company’s Guide to Deleting Personal Identifying Information

 

Text Mining and Text Analysis Datasets

Text mining and text analysis examine and identify patterns in unstructured text data. This data includes any large amount of text that is not traditionally organized or formatted into a table or database. Text mining and analysis can be used for sentiment analysis, topic modeling and named entity recognition, and may apply natural language processing (NLP) to achieve these tasks.

10. Large Movie Review Dataset 

Dataset: Large Movie Review Dataset

The Large Movie Review Dataset, a 2017 cache of IMDB reviews, includes 25,000 reviews for testing and 25,000 more for training, remaining as a popular tool for sharpening sentiment analysis skills.

As Towards Data Science noted in a spotlight, be prepared to do a fair amount of cleaning and vectorization before building and training your classifier. But the effort should pay off.

“You can [predict sentiment] with traditional NLP techniques or with slightly fancier, modern neural network techniques,” Eddy said. “It’s a very easy playground for a wide range of different possible techniques.”

11. Twitter and Reddit Sentimental Analysis Dataset

Dataset: Twitter and Reddit Sentimental Analysis Dataset

X (formerly Twitter) and Reddit hold mounds of text conversations and threads, so the Twitter and Reddit Sentimental Analysis Dataset — containing over 160,000 X posts and 37,000 Reddit comments — offers a handy way to pair your sentiment analysis project to some of the biggest social platforms out there.

It’s also a good opportunity to practice topic modeling, query writing and a bit of text preprocessing, since web users aren’t always known for their grammatical precision.

12. Stack Exchange API

Dataset: Stack Exchange API

Similarly, the Stack Exchange API dataset gives a glimpse into the ecosystem of the Stack Exchange Q&A sites, so you’re bound to find some domains of interest — probably with generally cleaner text than the Reddit dataset.

Eddy recalled a past project: “I grabbed a bunch of Statistics Stack Exchange questions to analyze what topics were more or less popular, what language was associated with getting more responses to the question. It was really interesting because it was a topic that I had an attachment to.”

13. Amazon Product Reviews

Dataset: Amazon Product Reviews

The Amazon Product Reviews dataset is a sentiment analysis-friendly set that Iqbal points to, particularly for an advanced data scientist who works, or hopes to break into, marketing. It contains 142.8 million reviews, extensive product information and “also viewed” and “also bought” details, culled from user activity between 1996 and 2014. A natural fit for those looking to filter sentiment analysis into building recommender systems.

RelatedHow to Do Data Science From Home Without Going Mad

 

Time Series Datasets

Time series data is data collected over an interval of time, and can include historical or real-time data points. This data is used in time series analysis and forecasting, which detect patterns and predict when specific changes may occur over time. Time series data helps forecast events like the weather, stock prices or heart rate readings.

Eddy stresses two key criteria when picking datasets for time series analysis — especially for newcomers. First, make sure the time interval is fixed. Whether day-to-day, minute-to-minute, hour-to-hour, the key thing is that the data is recorded in a regular, standardized measurement. Second, watch for clear, seasonal patterns that have logical effects.

Any of the below time series datasets fit the bill.

14. Hourly Energy Consumption

Dataset: Hourly Energy Consumption

Hourly Energy Consumption’s dataset features over 10 years of hourly energy consumption data in eastern U.S. states in megawatts, provided by PJM Interconnection. This time series set lets data scientists practice how to predict energy consumption on certain times of the day, week, year or special occasions like holidays.

15. International Greenhouse Gas Emissions

Dataset: International Greenhouse Gas Emissions

This International Greenhouse Gas Emissions dataset covers global greenhouse gas emission levels from 1990 to 2017, provided by the United Nations. The set aims to help forecast emissions trends and possible types present over time, including emission information on carbon dioxide, methane, nitrous oxide and hydrofluorocarbons.

16. Individual Household Electric Power Consumption

Dataset: Individual Household Electric Power Consumption

Iqbal recommends the Individual Household Electric Power Consumption dataset, from UCI’s Machine Learning Depository, for advanced-level time series practice. The set presents measurements of electric power consumption for one household within a one-minute sampling rate over a span of four years. This can help exercise short-term forecasting skills for one subject, being the single home.

17. Web Traffic Time Series Forecasting

Dataset: Web Traffic Time Series Forecasting

The Web Traffic Time Series Forecasting dataset, provided by Google, contains traffic data to 145,000 Wikipedia articles, with a focus on using said data to predict future web traffic trends. Each time series data point states the name of the Wikipedia article visited and type of traffic represented (desktop, mobile or spider bot traffic).

18. Uber Pickups in New York City 

Dataset: Uber Pickups in New York City

This dataset supplies date, time and location data for over 20 million Uber and for-hire vehicle trips in the NYC area. The Uber data spans April to September in 2014, while the for-hire vehicles data spans January to June in 2015. Uber’s data comes via the New York City Taxi & Limousine Commission. It was released following a 2015 FOIA request by FiveThirtyEight, which delivered much (at the time) eye-opening reporting based on the data.

19. Citi Bike System Data

Dataset: Citi Bike System Data

The Citi Bike System Data set sheds light on where, when and how far Citi Bike users in New York City ride. This set includes extensive travel information like bike ride ID, start and end times, start and end station IDs and geographical location. Data scientists can utilize this set to determine what days of the week and what times most of these bike rides are taken on.

20. Capital Bike Sharing

Dataset: Bike Sharing

Bike Sharing is an intermediate-level dataset showing the hourly and daily count of bike rentals in the Capital bikeshare system between 2011 and 2012. Users can practice predicting how many bikes may be rented at certain times based on weather conditions and seasonal factors, and delve into exploratory data analysis, regression modeling or data visualization techniques.

 

Frequently Asked Questions

Free datasets can be found on websites such as:

  • Google Dataset Search 
  • Kaggle
  • Data.gov
  • GitHub
  • Data.world
  • UCI Machine Learning Repository
  • FiveThirtyEight

Public datasets can be accessed on websites like Kaggle, Google Dataset Search or GitHub, or through government sources like Data.gov, data.healthcare.gov or data.europa.eu.

A dataset is a collection of data often formatted and used for data analysis, research and machine learning projects. A database is a system built for the ongoing management of data, and is often used for storing and accessing data used across a business.

Explore Job Matches.