10 Steps to Become a Data Scientist

Summary: Becoming a data scientist involves mastering programming, databases, math, version control, data visualization, machine learning, time series, neural networks, deep learning and natural language processing — all can be done without needing a formal degree.

Data science, one of the most rapidly growing careers in tech, is an interdisciplinary field that helps us analyze and make sense of the world around us. Due to the booming job market and companies’ growing reliance on data-driven solutions, this demand isn't slowing down any time soon.

Luckily, becoming a data scientist doesn’t require a degree. As long as you’re open to learning new things and willing to put in the effort and time, you can become a data scientist.

The question now is: Where do you start?

The internet is full of tutorials about every aspect of data science, such as machine learning basics, natural language processing, speech recognition and all kinds of amazing data science magic. But, for a beginner, that amount of information can be overwhelming and lead someone to give up before they even begin.

What you need is a structured roadmap that clearly lays out what you need to learn (and in what order) to become a data scientist along with the skills you need to hone on your data science learning journey.

What You Need to Know to Become a Data Scientist

Programming
Databases
Math: Probability Theory, Statistics and Linear Algebra basics
Version Control
Data Science Basics: Finding Data Sets, Science Communication and Data Visualization
Machine Learning Basics
Time Series and Model Validation
Neural Networks
Deep Learning
Natural Language Processing

Clarify Your Code5 Ways to Write More Pythonic Code

1. Programming

If you’re new to tech, programming is the best place to start. Currently, the two programming languages most data scientists use are Python and R.

R: A programming language for statistical computing, widely used for developing statistical software and data analysis.
Python: A high-level, general-purpose programming language. Python is widely used in many applications and fields, from simple programming to quantum computing.

Because Python is a beginner-friendly programming language, it’s a great place to start with data science (and maybe more fields in the future). Due to Python's popularity, there are many resources available to learn it.

Some of my favorite Python learning resources are CodeAcademy, Google Classes, Learn Python the Hard Way.

However, if you decide to go with R, both Coursera and edX have great courses that you can audit for free.

Some of you may already know how to program and might be transitioning to data science from another technical field. In that case, you can skip this step and move forward to the next step of the journey.

2. Databases

You can think of data science as the art of telling a story using data, but you have to be able to actually access the data to tell your story. In other words, whenever you work on a data science project, you’ll need data to analyze, visualize and build a valid project. The data you need are often stored away in some database.

An essential step to standing out as a data scientist is to interact and communicate with databases effectively. For example, having the skills to design a simple database can take you to the next level.

To communicate with a database, you will need to speak its language: SQL, which stands for Structured Query Language and we use it to communicate with all kinds of databases. My favorite resources to learn SQL are CodeAcademy, Khan Academy, and interactive learning, SQLCourse.

3. Math

The core of data science is math. To understand how the different concepts of data science function, you need to have some understanding of the math behind them including the basics of probability theory, statistics, and linear algebra to comprehend data science.

Now, I know math is the one thing that could make some run for the hills before pursuing a career in data science. However, most tools you’ll use in your career will eliminate implementing the math itself in your projects, but you’ll still want some grasp of the foundational principles.

Don’t let math intimidate you from exploring the world of data science! I would say it’s well worth it. There are some helpful materials on Coursera that can help you tackle the math you need.

4. Version Control

In software development in general and data science in particular, one of the most important concepts you can learn is version control.

Whenever you work on a data science project, you will need to write different code files, explore data sets and collaborate with other data scientists. You’ll have to manage all the changes in the code via version control, namely Git.

Git is a version control system used to track changes in source code during the software development process. Git coordinates work among a group of programmers or track changes in any set of files by a single programmer.

Although Git is a system, some websites allow you to use Git easily without needing to interact much with the command line (though you will move to the command line eventually)— such as GitHub or GitLab.

Luckily, there are many resources to help you understand the inner functionality of Git; my top choices are BitBucket Learn Git Tutorials and this lecture from the Harvard CS50 course.

Learn More About Version Control5 Git Commands That Don’t Get Enough Hype

5. Data Science Basics

Data science is a broad term and includes many different concepts and technologies. So before you take a deep dive into the big sea of data science, you need to familiarize yourself with some basics first.

Finding data sets: There are two ways to kickstart any data science project; you either have a data set you want to use to build a project or you have a question and need to find a data set to answer it. Exploring data sets and choosing the right one for your project is an important skill to obtain.
Science communication: As a data scientist, you will need to communicate with a general audience to deliver your process and findings. So, you’ll need to develop your science communication and public speaking skills to explain complex concepts using simple terms.
Effective visualization: The only way to validate your findings is to visualize them. Visualization plays a big role in data science, from exploring your data to delivering your results. Getting familiar with effective visualization of data can save you tons of time and effort while working on your project.

Get the Career Advice You Need4 Essential Skills Every Data Scientist Needs

6. Machine Learning Basics

So, you worked on your programming skills, brushed up your math and dived into databases. You’re now ready to start the fun part: applying what you learned so far to build your first project.

Now it’s time to jump into machine learning . Here’s when you start learning and exploring basic algorithms and techniques, such as linear and logistic regression, decision trees, naive Bayes and support vector machines (SVM). You’ll also start discovering the different Python or R packages to organize and implement your data. You’ll get to use Scikit-learn, SciPy and NumPy.

You’ll also learn how to clean up your data to have more accurate positions and results. This is really where you’ll get to experience what you can do with data science and will be able to see the impact the field has on our daily lives.

The best place to start learning about the different aspects of machine learning is the various articles on Built In.

7. Time Series and Model Validation

It’s time to dive deeper into machine learning. Your data is not going to be stagnant; it’s often related to time somehow. Time series are data points ordered based on time.

Most commonly, time series are sequences of data taken at successive equally spaced points in time, which makes them discrete-time data. Time series shows you how time changes your data. This allows you to gain insights about trends, periodicity in the data and predict the data's future behavior.

When dealing with time series, you’ll need to work on two main components:

Analyzing time series data.
Forecasting time series data.

Building models to predict future behavior is not enough; you need to validate the model's accuracy, too. Here’s where you’ll learn how to build and test models efficiently.

Moreover, you’ll learn how to estimate the threshold of error for each project and how to keep your models within acceptable ranges.

8. Neural Networks

Neural networks (Artificial Neural Networks or ANN) are a biologically-inspired programming paradigm that enables a computer to learn from observational data. ANNs started as an approach to mimic the human brain's architecture to perform different learning tasks. To resemble the human brain, an ANN contains the same components a human cell has.

So, ANN contains a collection of neurons; each neuron represents a node connected to another via links. These links correspond to the biological axon-synapse-dendrite connections. Moreover, each of these links has a weight that determines the strength one node has on another.

Learning ANN enables you to tackle a wider range of tasks, including recognizing handwriting, pattern recognition and face identification.

ANN represents the basic logic you need to know to proceed to the next step in your data science journey, deep learning.

9. Deep Learning

Neural networks are paradigms that power deep learning. Deep learning represents a powerful set of techniques that harness the learning power of neural networks.

You can use neural networks and deep learning to tackle the optimal solutions to many problems in various fields, including image recognition, speech recognition and natural language processing.

By now, you’ll be familiar with many Python packages that deal with different aspects of data science. In this step, you’ll get the chance to try popular packages such as Keras and TensorFlow.

Also, by this step, you’ll be proficient enough to read recent research advances in data science and maybe develop your own algorithms.

10. Natural Language Processing

You’re almost at the end. You can already see the finish line. You have gone through many theoretical and practical concepts so far, from simple math to complex deep learning concepts.

So, what’s next?

It’s my personal favorite sub-field of data science: natural language processing (NLP). Natural language processing is an exciting branch of AI that enables you to use the power of machine learning to teach the computer to understand and process human languages.

This will include speech recognition, text to speech application (and speech to text), virtual assistants (like Siri and BERT), and all kinds of different conversational bots.

Here we are at the end of the road. But every end is really a beginning. Just like any other technology-related field, there’s really no end. The field is developing rapidly; new algorithms and techniques are under research as you read this article.

So, being a data scientist means you’ll be a lifelong learner. You’ll develop your knowledge and style as you go. You’ll probably develop an attraction to a specific sub-field, dig even deeper and maybe even specialize.

You’ll hit roadblocks and detours along the way. Just keep an open mind, be patient and dedicate the time and effort to reaching your destination. The most important thing to remember as you embark on this journey is: you can do it.

Frequently Asked Questions

Do I need a degree to become a data scientist?

No, a formal degree isn’t required. With dedication, self-study and practical experience, it’s possible to break into the field without one.

Which programming languages should I learn for data science?

Python and R are the most widely used. Python is beginner-friendly and has broad applications, while R is strong for statistical computing.

What math do I need to know for data science?

Foundational knowledge in probability, statistics and linear algebra is essential to understand core data science concepts.

What are the basics I should learn before diving into machine learning?

Start with programming, databases, math fundamentals, version control and key concepts like data sourcing, visualization and communication.