Data science, one of the most rapidly growing careers in tech, is an interdisciplinary field that helps us analyze and make sense of the world around us. Due to the booming job market and companies’ growing reliance on data-driven solutions, this demand isn't slowing down any time soon.
Luckily, becoming a data scientist doesn’t require a degree. As long as you’re open to learning new things and willing to put in the effort and time, you can become a data scientist.
The question now is: Where do you start?
The internet is full of tutorials about every aspect of data science, such as machine learning basics, natural language processing, speech recognition and all kinds of amazing data science magic. But, for a beginner, that amount of information can be overwhelming and lead someone to give up before they even begin.
What you need is a structured roadmap that clearly lays out what you need to learn (and in what order) to become a data scientist along with the skills you need to hone on your data science learning journey.
What You Need to Know to Become a Data Scientist
- Programming
- Databases
- Math: Probability Theory, Statistics and Linear Algebra basics
- Version Control
- Data Science Basics: Finding Data Sets, Science Communication and Data Visualization
- Machine Learning Basics
- Time Series and Model Validation
- Neural Networks
- Deep Learning
- Natural Language Processing
1. Programming
If you’re new to tech, programming is the best place to start. Currently, the two programming languages most data scientists use are Python and R.
-
R: A programming language for statistical computing, widely used for developing statistical software and data analysis.
-
Python: A high-level, general-purpose programming language. Python is widely used in many applications and fields, from simple programming to quantum computing.
Because Python is a beginner-friendly programming language, it’s a great place to start with data science (and maybe more fields in the future). Due to Python's popularity, there are many resources available to learn it.
Some of my favorite Python learning resources are CodeAcademy, Google Classes, Learn Python the Hard Way.
However, if you decide to go with R, both Coursera and edX have great courses that you can audit for free.
Some of you may already know how to program and might be transitioning to data science from another technical field. In that case, you can skip this step and move forward to the next step of the journey.
2. Databases
You can think of data science as the art of telling a story using data, but you have to be able to actually access the data to tell your story. In other words, whenever you work on a data science project, you’ll need data to analyze, visualize and build a valid project. The data you need are often stored away in some database.
An essential step to standing out as a data scientist is to interact and communicate with databases effectively. For example, having the skills to design a simple database can take you to the next level.
To communicate with a database, you will need to speak its language: SQL, which stands for Structured Query Language and we use it to communicate with all kinds of databases. My favorite resources to learn SQL are CodeAcademy, Khan Academy, and interactive learning, SQLCourse.
3. Math
The core of data science is math. To understand how the different concepts of data science function, you need to have some understanding of the math behind them including the basics of probability theory, statistics, and linear algebra to comprehend data science.
Now, I know math is the one thing that could make some run for the hills before pursuing a career in data science. However, most tools you’ll use in your career will eliminate implementing the math itself in your projects, but you’ll still want some grasp of the foundational principles.
Don’t let math intimidate you from exploring the world of data science! I would say it’s well worth it. There are some helpful materials on Coursera that can help you tackle the math you need.
4. Version Control
In software development in general and data science in particular, one of the most important concepts you can learn is version control.
Whenever you work on a data science project, you will need to write different code files, explore data sets and collaborate with other data scientists. You’ll have to manage all the changes in the code via version control, namely Git.
Git is a version control system used to track changes in source code during the software development process. Git coordinates work among a group of programmers or track changes in any set of files by a single programmer.
Although Git is a system, some websites allow you to use Git easily without needing to interact much with the command line (though you will move to the command line eventually)— such as GitHub or GitLab.
Luckily, there are many resources to help you understand the inner functionality of Git; my top choices are BitBucket Learn Git Tutorials and this lecture from the Harvard CS50 course.
5. Data Science Basics
Data science is a broad term and includes many different concepts and technologies. So before you take a deep dive into the big sea of data science, you need to familiarize yourself with some basics first.
-
Finding data sets: There are two ways to kickstart any data science project; you either have a data set you want to use to build a project or you have a question and need to find a data set to answer it. Exploring data sets and choosing the right one for your project is an important skill to obtain.
-
Science communication: As a data scientist, you will need to communicate with a general audience to deliver your process and findings. So, you’ll need to develop your science communication and public speaking skills to explain complex concepts using simple terms.
-
Effective visualization: The only way to validate your findings is to visualize them. Visualization plays a big role in data science, from exploring your data to delivering your results. Getting familiar with effective visualization of data can save you tons of time and effort while working on your project.
6. Machine Learning Basics
So, you worked on your programming skills, brushed up your math and dived into databases. You’re now ready to start the fun part: applying what you learned so far to build your first project.
Now it’s time to jump into machine learning . Here’s when you start learning and exploring basic algorithms and techniques, such as linear and logistic regression, decision trees, naive Bayes and support vector machines (SVM). You’ll also start discovering the different Python or R packages to organize and implement your data. You’ll get to use Scikit-learn, SciPy and NumPy.
You’ll also learn how to clean up your data to have more accurate positions and results. This is really where you’ll get to experience what you can do with data science and will be able to see the impact the field has on our daily lives.
The best place to start learning about the different aspects of machine learning is the various articles on Built In.
7. Time Series and Model Validation
It’s time to dive deeper into machine learning. Your data is not going to be stagnant; it’s often related to time somehow. Time series are data points ordered based on time.
Most commonly, time series are sequences of data taken at successive equally spaced points in time, which makes them discrete-time data. Time series shows you how time changes your data. This allows you to gain insights about trends, periodicity in the data and predict the data's future behavior.
When dealing with time series, you’ll need to work on two main components:
-
Analyzing time series data.
-
Forecasting time series data.
Building models to predict future behavior is not enough; you need to validate the model's accuracy, too. Here’s where you’ll learn how to build and test models efficiently.
Moreover, you’ll learn how to estimate the threshold of error for each project and how to keep your models within acceptable ranges.
8. Neural Networks
Neural networks (Artificial Neural Networks or ANN) are a biologically-inspired programming paradigm that enables a computer to learn from observational data. ANNs started as an approach to mimic the human brain's architecture to perform different learning tasks. To resemble the human brain, an ANN contains the same components a human cell has.
So, ANN contains a collection of neurons; each neuron represents a node connected to another via links. These links correspond to the biological axon-synapse-dendrite connections. Moreover, each of these links has a weight that determines the strength one node has on another.
Learning ANN enables you to tackle a wider range of tasks, including recognizing handwriting, pattern recognition and face identification.
ANN represents the basic logic you need to know to proceed to the next step in your data science journey, deep learning.
9. Deep Learning
Neural networks are paradigms that power deep learning. Deep learning represents a powerful set of techniques that harness the learning power of neural networks.
You can use neural networks and deep learning to tackle the optimal solutions to many problems in various fields, including image recognition, speech recognition and natural language processing.
By now, you’ll be familiar with many Python packages that deal with different aspects of data science. In this step, you’ll get the chance to try popular packages such as Keras and TensorFlow.
Also, by this step, you’ll be proficient enough to read recent research advances in data science and maybe develop your own algorithms.
10. Natural Language Processing
You’re almost at the end. You can already see the finish line. You have gone through many theoretical and practical concepts so far, from simple math to complex deep learning concepts.
So, what’s next?
It’s my personal favorite sub-field of data science: natural language processing (NLP). Natural language processing is an exciting branch of AI that enables you to use the power of machine learning to teach the computer to understand and process human languages.
This will include speech recognition, text to speech application (and speech to text), virtual assistants (like Siri and BERT), and all kinds of different conversational bots.
Here we are at the end of the road. But every end is really a beginning. Just like any other technology-related field, there’s really no end. The field is developing rapidly; new algorithms and techniques are under research as you read this article.
So, being a data scientist means you’ll be a lifelong learner. You’ll develop your knowledge and style as you go. You’ll probably develop an attraction to a specific sub-field, dig even deeper and maybe even specialize.
You’ll hit roadblocks and detours along the way. Just keep an open mind, be patient and dedicate the time and effort to reaching your destination. The most important thing to remember as you embark on this journey is: you can do it.