There’s real-world data science, and there’s Kaggle-competition data science.
Among the many differences: In the real world, the data you need for a desired project might not be immediately available — if it even exists at all.
Luckily, that’s not the case if you’re looking for data sets for practicing new skills. If anything, depending on the category of work, you might face a paradox-of-choice problem, as the number of free, publicly available data sets has only proliferated in recent years.
With that in mind, we reached out to two senior-level data science instructors — Joe Eddy, of the Metis bootcamp in New York City, and Raja Iqbal, founder of Data Science Dojo — to get an overview of the free data sets best suited for a variety of competencies, including product purchasing analysis, ad-click prediction, image classification, sentiment analysis and time-series analysis.
Here’s what they recommend…
Top Data Science Datasets for Analysis
- Tabular Data
- Image Data
- Text Mining and Text Analysis
- Time Series
For a data scientist looking to expand finance domain knowledge, there’s no more classic problem than loan default prediction. And Lending Club’s loan data set is a great resource for that competency for a few reasons. It has a massive number of data points, covering all loans made between 2007 and 2015, and it’s feature rich, including credit scores, number of finance inquiries and geographical information. It’s not always easy to find a finance data set that checks both boxes. “Sometimes finance data is kind of hard to get,” Eddy said.
Also consider diving into Lending Club’s API, or — as Iqbal suggested — the UCI Machine Learning Depository’s Default of Credit Card Loans data set, sourced from default payments in Taiwan.
The largest real-world set of grocery data available is a go-to for honing product purchasing prediction and analysis. It spans a whopping three million orders placed by 200,000-plus users, with at least four orders per user and some including as many as 100. It also includes the sequence in which users bought products, and the time of day of each purchase. The patterns within the data set are easily Goolge-able, but it remains a great resource for sharpening consumer-side predictive work, Eddy said.
Outbrain Click Prediction Contest
“So much of in-practice data science is literally just ad-click predictions,” Eddy said.
Indeed, a working data scientist employed in commercial sectors may be sick to the back teeth of data sets like these from Avito and Outbrain, but anyone needing to exercise that muscle should consider either. Both have quite a lot of data and tables, and offer good opportunities to do some feature engineering in a relational data setting.
The Outbrain data set samples two billion page views, nearly 17 million clicks and a mess of user recommendations that were made across hundreds of publisher sites over the course of two weeks in 2016. (Outbrain is one of the companies that put boxes of sponsored-content articles at the bottom of sites.) Avito is basically the Russian version of Craigslist, so its set includes details like item descriptions in ads, geographical details and demand information.
The MNIST Database of Handwritten Digits
This well-known data set — consisting of the digits 0 through 9, written in a variety of handwriting styles — remains an ideal entry point for image classification newcomers.
Basic classification is “pretty much the simplest possible problem for images, but it’s a good starting point for anyone who’s playing around with neural network image classification from scratch,” Eddy said.
“One of the hard things about working with neural networks when you’re starting is that, sometimes, training and retooling your models is just very time consuming,” he added. “So having relatively smaller, simpler data sets dramatically speeds that up.”
Another recommended starting point for classification, this is the data set referenced by Keras creator Francois Chollet in his book, Deep Learning With Python. (You can find that book’s accompanying Jupyter notebooks here.)
“It’s simple enough to be accessible, but complicated enough to allow for meaningful work — my absolute favorite resource to recommend to a beginner in deep learning, in particular those who want to work with Python,” Eddy said. “Having a guided approach is extremely helpful.”
There’s a universe of more complex problems waiting beyond these simple classifications, but the core of those problems often involve repeat applications of exactly the kind of work needed to solve simpler ones, “so starting with one or two of those simple data sets will give you a really strong foundation for exploring almost any standard image problem,” Eddy said.
Text Mining and Text Analysis
This 2017 cache of IMDB reviews, which includes 25,000 reviews for testing and 25,000 more for training, remains a popular tool for sharpening sentiment analysis skills.
As Towards Data Science noted in a spotlight, be prepared to do a fair amount of cleaning and vectorization before building and training your classifier. But the effort should pay off.
“You can [predict sentiment] with traditional NLP techniques or with slightly fancier, modern neural network techniques,” Eddy said. “It’s a very easy playground for a wide range of different possible techniques.”
Every topic under the sun has a dedicated subreddit, so this data set of Reddit comments, available through BigQuery, offers a handy way to pair your sentiment analysis project to a domain you’re passionate about.
It’s also a good opportunity to practice topic modeling, query writing and a bit of text preprocessing, since Redditors aren’t always known for their grammatical precision. But it should be easier than, say, Twitter, where you’ll have to wrangle the API and deal with the drawbacks of character limits and even messier text.
Similarly, the ecosystem of Stack Exchange Q&A sites runs deep and diverse, so you’re bound to find some domains of interest — probably with generally cleaner text than the Reddit data set.
Eddy recalled a past project: “I grabbed a bunch of Statistics Stack Exchange questions to analyze what topics were more or less popular, what language was associated with getting more responses to the question. It was really interesting because it was a topic that I had an attachment to.”
Iqbal points to this sentiment analysis-friendly data set, particularly for an advanced data scientist who works, or hopes to break into, marketing. It contains 142.8 million reviews, extensive product information and “also viewed” and “also bought” details, culled from user activity between 1996 and 2014. A natural fit for those looking to filter sentiment analysis into building recommender systems.
Web Traffic Time Series Forecasting With Wikipedia Pageviews
International Greenhouse Gas Emissions
Eddy stresses two key criteria when picking data sets for time series analysis — especially for newcomers. First, make sure the time interval is fixed. Whether day-to-day, minute-to-minute, hour-to-hour, the key thing is that the data is recorded in a regular, standardized measurement. Second, watch for clear, seasonal patterns that have logical effects.
Exchange trading hours, for example, always run from 9:30 a.m. to 4 p.m., but the nature of stock prediction is hugely volatile. “There won’t be a clear pattern versus if you want to predict, say, daily subway ridership each day of the week,” Eddy said. “That’s actually tractable because there will be clear patterns, with people riding more during the week and less on the weekends and so on.”
Any of the above beginner-friendly time series data sets — two related to energy, one related to pageviews — fit the bill.
Speaking of transportation data, these ride-hailing and bike-rental data sets cover the same criteria but are a bit more challenging, in terms of acquisition and formatting.
The Uber data comes via the New York City Taxi & Limousine Commission. It was released following a 2015 FOIA request by FiveThirtyEight, which delivered much (at the time) eye-opening reporting based on the data.
Finally, Iqbal recommends this Electricity Consumption data set, from UCI’s Machine Learning Depository, for advanced-level time-series practice.