Data types are an important concept in statistics. Understanding them allows you to accurately apply statistical measurements to your data and make valid assumptions about it. This blog post will introduce you to the different data types you need to know in order to do proper exploratory data analysis (EDA), since you can only use certain statistical measurements for specific data types. Proper data analysis is one of the most underestimated parts of a machine learning project.
What Are the Main Types of Data in Statistics?
- Nominal data
- Ordinal data
- Discrete data
- Continuous data
It’s also helpful to know which data type you are dealing with to choose the right visualization method. Think of data types as a way to categorize different types of variables.
We will discuss the main types of variables and look at an example for each. We will sometimes refer to them as measurement scales.
Categorical or Qualitative Data Types
Categorical data represents characteristics. It can represent things like a person’s gender, language, etc.. Categorical data can also take on numerical values (Example: 1 for female and 0 for male). Note that those numbers don’t have mathematical meaning.
Nominal Data
Nominal values represent discrete units and are used to label variables that have no quantitative value. Just think of them as “labels.” Note that nominal data that has no order. Therefore, if you would change the order of its values, the meaning would not change. You can see two examples of nominal features below:
The left feature that describes a person’s gender would be called “dichotomous,” which is a type of nominal scales that contains only two categories.
Examples of nominal data:
- Name
- Gender
- Nationality
- Eye color
- Zip code
Ordinal Data
Ordinal values represent discrete and ordered units. It is nearly the same as nominal data, except that its ordering matters. You can see an example below:
Note that the difference between Elementary and High School is different from the difference between High School and College. This is the main limitation of ordinal data, the differences between the values is not readily known. Because of that, ordinal scales are usually used to measure non-numeric features like happiness, customer satisfaction and so on.
Examples of ordinal data:
- Education level
- Socioeconomic status
- Customer satisfaction rating
- Letter grade on an exam
- Likert scale response
Numerical or Quantitative Data Types
Discrete Data
Discrete data involves values that are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. This type of data can’t be measured but it can be counted. It basically represents information that can be categorized into a classification. An example is the number of heads in 100 coin flips.
You can check whether or not you are dealing with discrete data by asking the following two questions: Can you count it and can it be divided up into smaller and smaller parts? If the data is countable, finite and exists only as fixed values, then it is discrete.
Examples of discrete data:
- Number of students in a class
- Number of employees in a company
- Number of products sold from a store
- Number of clicks on a website
Continuous Data
Continuous data represents measurements. Its values can’t be counted but they can be measured. An example would be the height of a person, which you can describe by using intervals on the real number line.
Interval Data
Interval values represent ordered units that have the same difference. Therefore, we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. An example would be a feature that contains temperature of a given place, as seen below:
The problem with interval data is that the values don’t have a “true zero.” That means, in regards to our example, that there is no such thing as no temperature. With interval data, we can add and subtract, but we cannot multiply, divide or calculate ratios. Because there is no true zero, a lot of descriptive and inferential statistics can’t be applied.
Ratio Data
Ratio data involves ordered units that have the same difference. Ratio values are the same as interval values, except they do have an absolute zero. Good examples are height, weight, length, etc.
Examples of continuous data:
- Height
- Weight
- Distance
- Temperature
- Time
Why Are Data Types Important in Statistics?
Data types are an important concept because statistical methods can only be used with certain data types. You have to analyze continuous data differently than categorical data, otherwise it would result in a wrong analysis. Therefore, knowing the types of data you are dealing with enables you to choose the correct method of analysis.
We will now go over every data type again, but this time we will look at what statistical methods can be applied. To understand properly what we will now discuss, you have to understand the basics of descriptive statistics. If you don’t know them, you can read my blog post about it.
Statistical Methods for Nominal, Ordinal and Continuous Data Types
Summarizing Nominal Data
When you are dealing with nominal data, you collect information through:
- Frequencies: The frequency is the rate at which something occurs over a period of time or within a data set.
- Proportion: You can easily calculate the proportion by dividing the frequency by the total number of events (e.g. how often something happened divided by how often it could happen).
- Visualization Methods: To visualize nominal data, you can use a pie chart or a bar chart.
In data science, you can use one hot encoding to transform nominal data into a numeric feature.
Summarizing Ordinal Data
When you are dealing with ordinal data, you can use the same methods as with nominal data, but you also need access to some additional tools. You can summarize your ordinal data with frequencies, proportions and percentages. You can also visualize it with pie and bar charts. Additionally, you can use percentiles, median, mode and the interquartile range to summarize your data.
In data science, you can use one label encoding to transform ordinal data into a numeric feature.
Summarizing Continuous Data
When you are dealing with continuous data, you can use most methods to describe your data. You can summarize your data using percentiles, median, mean, mode, interquartile range, standard deviation, and range.
To visualize continuous data, you can use a histogram or a boxplot. With a histogram, you can check the central tendency, variability, modality, and kurtosis of a distribution.
Summary
In this post, you discovered the different data types that are used throughout statistics. You learned the difference between discrete and continuous data, and learned what nominal, ordinal, interval and ratio measurement scales are. Furthermore, you now know what statistical measurements and visualization methods you can use with each data type. You also learned what methods can be used to transform categorical variables into numeric variables. This enables you to create a big part of an exploratory analysis on a given data set.