Data science can be an overwhelming field for newcomers. The term itself is confusing because it’s an umbrella term that covers many subfields, such as machine learning, data mining and more.
While the terminology and jargon can feel daunting, don’t let this discourage you — here are 10 foundational data science terms every professional needs to know to build and develop any data science project.
10 Data Science Terms You Need to Know
- Model
- Overfitting
- Underfitting
- Cross-Validation
- Regression
- Parameter
- Bias
- Correlation
- Hypothesis
- Outlier
1. Model
One of the most important terms in data science you’ll hear quite often is “model”: model training, improving model efficiency, model behavior, etc. But what is a model?
Mathematically speaking, a model is a specification of some probabilistic relationship between different variables. In simple terms, a model is a mathematical representation that describes the relationship between one or more variables.
Since the term “modeling” can be vague, “statistical modeling” is often used to describe modeling done by data scientists specifically.
2. Overfitting
Overfitting occurs when a machine learning model learns the training data, including its noise and random fluctuations, too well. This results in a complex model that performs excellently on the data it was trained on but poorly on new, unseen data.
3. Underfitting
Underfitting occurs when a model is too simple to capture the underlying pattern of the data. This means the model hasn’t learned enough from the training data, so it performs poorly on both the training data and on new, unseen data. It’s the opposite of overfitting, where a model is too complex. An underfit model has high bias and low variance.
One of the skills you will need to learn as a data scientist is how to find the middle ground between overfitting and underfitting.
4. Cross-Validation
Cross-validation is a way to evaluate a model’s behavior when you ask it to learn from a data set that’s different from the training data you used to build the model. This is a big concern for data scientists because your model will often have good results on the training data but end up with too much noise when applied to real-life data.
There are different ways to apply cross-validation to a model; the three main strategies are:
- The holdout method — training data is divided into two sections, one to build the model and one to test it.
- The k-fold validation — an improvement on the holdout method. Instead of dividing the data into two sections, you’ll divide it into k sections to reach higher accuracy.
- The leave-one-out cross-validation — the extreme case of the k-fold validation. Here,
kwill be the same number of data points in the data set you’re using.
5. Regression
Regression is a machine learning term — the simplest, most basic supervised machine learning approach. In regression problems, you often have two values, a target value (also called criterion variables) and other values, known as the predictors.
For example, we can look at the job market. How easy or difficult it is to get a job (criterion variable) depends on the demand for the position and the supply for it (predictors).
There are different types of regression to match different applications; the easiest ones are linear and logistic regressions.
6. Parameter
Parameter can be confusing because it has slightly different meanings based on the scope in which you’re using it. For example, in statistics, a parameter describes a probability distribution’s different properties (e.g., its shape, scale). In data science or machine learning, we often use parameters to describe the precision of system components.
In machine learning, there are two types of models: parametric and nonparametric models.
- Parametric models have a set number of parameters (features) unaffected by the number of training data. Linear regression is considered a parametric model.
- Nonparametric models don’t have a set number of features, so the technique's complexity grows with the number of training data. The most well-known example of a nonparametric model is the KNN algorithm.
7. Bias
In machine learning, bias refers to the error introduced by a model that is too simplistic to capture the underlying patterns in the data. It represents the difference between a model’s predicted values and the actual values. While bias can also exist in the data itself (e.g., sampling bias), the term in this context is a fundamental concept in model performance.
When we choose some data to analyze, we often sample a large data pool. The sample you select could be biased, as in, it could be an inaccurate representation of the pool.
Since the model we’re training only knows the data we give it, the model will learn only what it can see. That’s why data scientists need to be careful to create unbiased models.
8. Correlation
In general, we use correlation to refer to the degree of occurrence between two or more events. For example, if depression cases increase in cold weather areas, there might be some correlation between cold weather and depression.
Events can correlate to different degrees. A common example is the positive correlation between ice cream sales and shark attacks; as one increases, so does the other, even though one does not cause the other.
When the correlation coefficient is one, the two events in question are strongly correlated, whereas if it is, let’s say, 0.2, then the events are weakly correlated. The coefficient can also be negative. In that case, there is an inverse relationship between two events. For example, if you eat well, your chances of becoming obese will decrease. There’s an inverse relationship between eating a well-balanced diet and obesity.
Finally, you must always remember the axiom of all data scientists: correlation doesn’t equal causation.
9. Hypothesis
A hypothesis, in general, is an explanation for some event. Often, hypotheses are made based on previous data and observations. A valid hypothesis is one you can test with results, either true or false.
In statistics, a hypothesis must be falsifiable. In other words, we should be able to test any hypothesis to determine whether it’s valid or not.
In machine learning, a hypothesis is a specific function that a learning algorithm chooses from a set of possible functions, known as the hypothesis space. This function is the algorithm’s best attempt to map the input data to the correct output.
10. Outlier
Outlier is a term used in data science and statistics to refer to an observation that lies an unusual distance from other values in the data set. The first thing every data scientist should do when given a data set is to decide what’s considered usual distancing and what’s unusual.
Outliers are data points that are significantly different from other observations. They can be a result of data entry errors, measurement mistakes or they can represent genuinely rare events. Because outliers can be either problematic noise or valuable insights, a data scientist must always investigate them before deciding how to handle them.
Frequently Asked Questions
What is data science?
Data science is the discipline of extracting insights and knowledge from structured and unstructured data using methods from statistics, computer science and mathematics. It combines data analysis, machine learning and domain expertise to inform decisions and solve complex problems.
What is the difference between overfitting and underfitting?
- Overfitting: Occurs when a machine learning model learns too much from the training data, becoming too complex to be applied to new data.
- Underfitting: Occurs when a machine learning model doesn't have enough information about the data, becoming too simple to capture underlying data patterns; it is the opposite of overfitting.
