4 Types of Projects You Need in Your Data Science Portfolio
Although data science is a rapidly growing field, the number of competing job seekers seems to increase exponentially year over year. So, even though demand for qualified data scientists is high, finding a job in the field remains extremely difficult. In order to get a job, you’ll need to stand out among hundreds, if not thousands, of other applicants.
As a data scientist, you need to have a strong portfolio that clearly demonstrates your technical skill set as well as your soft skills. Most importantly, your portfolio needs to prove you’re hungry to learn.
4 Types of Data Science Projects You Need in Your Portfolio
- Data Cleaning
- Exploratory Data Analysis
- Data Visualization
- Machine Learning
The umbrella term “data science” covers many topics, including all subfields of machine learning, computer version, artificial intelligence and natural language processing.
Despite the variety of sub-disciplines, in order to prove your value as a candidate, you only need to demonstrate your abilities in the core competencies of data science. Here are four projects that will make you stand out in a crowd and help you land your dream job.
As a data scientist, you’ll probably spend close to 80 percent of your time cleaning data. You can’t build an efficient, solid model on a data set that’s disorganized.
When you’re cleaning your data, it can take you hours upon hours of research to figure out each column's purpose in the data set. Sometimes after hours—and even days—of cleaning, you discover the data set you’re analyzing isn’t really suitable for what you’re trying to achieve!
Then, you’ll need to start the process all over again.
Cleaning data can be a frustrating and daunting task. It is, however, an essential part of every data science job. To make it less daunting (and more efficient) you need practice and there are data sets out that can help.
When you’re looking for a good candidate for data cleaning projects, you need to make sure the data set:
is spread over multiple files.
has a lot of nuances, null values and many possible cleaning approaches.
requires a good amount of research to fully understand.
needs to be as close to a real-life application as possible.
We can often find good data sets for cleaning—messy sets as I call them—on websites that collect and aggregate data sets. These kinds of websites collect data from various sources without sorting them out, which makes them great candidates for cleaning projects.
Where to Find Data Sets
- Reddit data sets
Exploratory Data Analysis
Once your data is clean and organized, you’ll need to perform exploratory data analysis (EDA), one of the important steps in every data science project. There are many benefits of performing EDA, including:
Maximizing data set insights
Revealing underlying patterns and structure
Extracting important information
There are many techniques we can follow to perform an efficient EDA and most of these techniques are graphical in nature because it’s easier to spot patterns and anomalies in the data when we represent the set visually. The particular graphical techniques we use in EDA tasks are straightforward. For example:
Plotting the raw data to obtain initial insights
Plotting simple statistics on the raw data, such as mean plots and standard deviation plots
Focusing the analysis on specific sections of the data for better results
There are many sources where you can learn the basics of EDA and develop an intuition for exploring and finding patterns within your data; one of my favorite courses on the topic is the one offered by Johns Hopkins University on Coursera.
In order to stand out, you need to be a good storyteller and one of the skills that every data scientist must develop is the ability to tell a compelling story with their data. When you build any kind of data science project, you’re often trying to uncover information that improves or clarifies the data in some way. Most of the time, you’ll need to report out on your findings in a university or business setting.
The best way to tell a story is to visualize it.
There are many publicly available data sets you can use to practice data visualization, building dashboards and telling a story with your data. Some of my favorite ones include: FiveThirtyEight, Google’s Dataset Search, Data is Plural, and of course we can’t talk about data sets without mentioning Kaggle.
One of the things that can make or break your chances of landing a data science job is your machine learning fluency. Sometimes when newcomers join the field, they tend to skip over the basics and jump straight into the field's more advanced concepts and trendy buzzwords.
Before you dive into machine learning’s advanced concepts, you need to make sure you’ve built a solid foundation with the basics. Perfecting the basics will not only strengthen your skill base but will give you the knowledge necessary to pick up any advanced concepts faster and with ease.
Make sure to have projects that cover all machine learning basics, such as regression (linear, logistics, etc.), classification algorithms and clustering. Some of my favorite courses on machine learning fundamentals are the machine learning basics chapter of The Deep Learning Book, and the CodeAcademy machine learning course.
Here are some simple machine learning project ideas that can have a positive impact on your portfolio:
Loan prediction using loan prediction data set
Housing prices prediction using housing price prediction data set
Music genre classification
Personality prediction using personality prediction data set.
Handwritten character recognition
Speech to text or vice-versa
Landing a good job in data science can be quite challenging due to the huge pool of applicants and large number of people interested in the field. To stand out in the crowd, your portfolio needs to prove you can learn, implement and adapt to new models and algorithms with ease.
This article was originally published on Towards Data Science.