Data science is one of the fields that can be overwhelming for newcomers. The term “data science” itself can be confusing because it’s an umbrella term that covers many subfields: machine learning, artificial intelligence, natural language processing, data mining ... the list goes on.
Within each of these subfields we have a plethora of terminology and industry jargon that overwhelm newcomers and discourage them from pursuing a career in data science.
When I first joined the field, I had to juggle learning the techniques, getting up to date with the research and advancements in the field, all while trying to understand the lingo. Here are 10 foundation terms every data scientist needs to know to build and develop any data science project.
10 Data Science Terms You Need to Know
One of the most important terms in data science you’ll hear quite often is “model”: model training, improving model efficiency, model behavior, etc. But what is a model?
Mathematically speaking, a model is a specification of some probabilistic relationship between different variables. In layperson’s terms, a model is a way of describing how two variables behave together.
Since the term “modeling” can be vague, “statistical modeling” is often used to describe modeling done by data scientists specifically.
Another way to describe models is how well they fit the data to which you apply them.
Overfitting happens when your model considers too much information about that data. So, you end up with an overly complex model that’s difficult to apply to various training data.
Underfitting (the opposite of overfitting) happens when the model doesn’t have enough information about the data. In either case, you end up with a poorly fitted model.
One of the skills you will need to learn as a data scientist is how to find the middle ground between overfitting and underfitting.
Cross-validation is a way to evaluate a model’s behavior when you ask it to learn from a data set that’s different from the training data you used to build the model. This is a big concern for data scientists because your model will often have good results on the training data but end up with too much noise when applied to real-life data.
There are different ways to apply cross-validation to a model; the three main strategies are:
The holdout method — training data is divided into two sections, one to build the model and one to test it.
The k-fold validation — an improvement on the holdout method. Instead of dividing the data into two sections, you’ll divide it into k sections to reach higher accuracy.
The leave-one-out cross-validation — the extreme case of the k-fold validation. Here, k will be the same number of data points in the data set you’re using.
Regression is a machine learning term — the simplest, most basic supervised machine learning approach. In regression problems, you often have two values, a target value (also called criterion variables) and other values, known as the predictors.
For example, we can look at the job market. How easy or difficult it is to get a job (criterion variable) depends on the demand for the position and the supply for it (predictors).
There are different types of regression to match different applications; the easiest ones are linear and logistic regressions.
Parameter can be confusing because it has slightly different meanings based on the scope in which you’re using it. For example, in statistics, a parameter describes a probability distribution's different properties (e.g., its shape, scale). In data science or machine learning, we often use parameters to describe the precision of system components.
In machine learning, there are two types of models: parametric and nonparametric models.
Parametric models have a set number of parameters (features) unaffected by the number of training data. Linear regression is considered a parametric model.
Nonparametric models don’t have a set number of features, so the technique's complexity grows with the number of training data. The most well-known example of a nonparametric model is the KNN algorithm.
In data science, we use bias to refer to an error in the data. Bias occurs in the data as a result of sampling and estimation. When we choose some data to analyze, we often sample a large data pool. The sample you select could be biased, as in, it could be an inaccurate representation of the pool.
Since the model we’re training only knows the data we give it, the model will learn only what it can see. That’s why data scientists need to be careful to create unbiased models.
In general, we use correlation to refer to the degree of occurrence between two or more events. For example, if depression cases increase in cold weather areas, there might be some correlation between cold weather and depression.
Often, events correlate by different degrees. For example, following a recipe that results in a delicious dish may have a higher correlation than depression and cold weather. We call this the correlation coefficient.
When the correlation coefficient is one, the two events in question are strongly correlated, whereas if it is, let’s say, 0.2, then the events are weakly correlated. The coefficient can also be negative. In that case, there is an inverse relationship between two events. For example, if you eat well, your chances of becoming obese will decrease. There’s an inverse relationship between eating a well-balanced diet and obesity.
Finally, you must always remember the axiom of all data scientists: correlation doesn’t equal causation.
A hypothesis, in general, is an explanation for some event. Often, hypotheses are made based on previous data and observations. A valid hypothesis is one you can test with results, either true or false.
In statistics, a hypothesis must be falsifiable. In other words, we should be able to test any hypothesis to determine whether it’s valid or not. In machine learning, the term hypothesis refers to candidate models we can use to map the model’s inputs to the correct and valid output.
Outlier is a term used in data science and statistics to refer to an observation that lies an unusual distance from other values in the data set. The first thing every data scientist should do when given a data set is to decide what’s considered usual distancing and what’s unusual.
An outlier can represent different things in the data; it could be noise that occurred during the collection of the data or a way to spot rare events and unique patterns. That’s why outliers shouldn’t be deleted right away. Instead, make sure you to always investigate your outliers like the good data scientist you are.