For data scientists, a crucial step in the process of conducting research is finding the right algorithm to model data. Over the last couple of decades, data has become like oil for many technology companies, businesses, and organizations. With the availability of modern frameworks and libraries, there are many algorithms that are predefined and ready to use. In reality, though, we encounter different kinds of data over time. This means that using the right model for the right use case is crucial. In this article, we’ll be quickly reviewing three frequently used modeling techniques: segmentation, correlation and time series analysis.
Segmentation
Segmentation is a type of modeling that is widely used in business, marketing and analytics. Its main goal is to divide the targets on the basis of some significant features. For example, if we were performing segmentation on global geography, we could use segmentation to draw important insights by using features like county, city, language, population and climate. In most cases, segmentation is used for data that is unlabeled, meaning that only the inputs are given. Based on the relations between them, the inputs are further segmented into different clusters or groups. In most cases related to machine learning, segmentation comes under unsupervised learning where the data is unlabeled. Let’s take a look at a few popular segmentation techniques in machine learning.
The K-means algorithm determines a set of k clusters and assigns each example to a single cluster. The clusters consist of similar examples. The similarity between examples is based on a distance measure between them. Each cluster is determined by the position of its center in the n-dimensional space. This position is called the centroid of the cluster. We decide the number of clusters initially and randomly initialize the centroids of each cluster. All the examples are then assigned to the nearest cluster in the algorithm.
DB scan is a type of density-based clustering. A density-based cluster is a maximal set of density connected points. It works on the notion of density-reachability. We define epsilon and min-samples before starting the algorithm. Basically, a point q is directly density-reachable from a point p if it is not farther away than a given distance epsilon and if p is surrounded by min-samples such that one may consider p and q to be part of a cluster. q is called density-reachable from p if there is a sequence p(1),…,p(n) of points with p(1) = p and p(n) = q where each p(i+1) is directly density-reachable from p(i). Note that density reachability is an asymmetric or directed relationship. The algorithm defines any two points x and y to be density connected if there exists a core point z, such that both x and y are density reachable from z.
In support vector machines, the data points are mapped from a data space to a high-dimensional space using a Gaussian kernel. In feature space, we search the smallest sphere that encloses the image of the data. This sphere is then mapped back to data space where it forms a set of contours that enclose the data points. The data points are then interpreted as the cluster boundaries. As the width parameter of the Gaussian kernel is decreased, the number of disconnected contours in data space increases, leading to an increasing number of clusters, and further segmentation.
Agglomeration clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than objects farther away. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which is where the common name “hierarchical clustering” comes from.
Correlation
Correlation is one of the most popular modeling techniques used in mathematics and statistics. The goal of correlation analysis is to identify the relationship between two variables. These methods are widely used in market analysis to identify the patterns or trends between the different attributes based on how they are changing with other features.
If correlation exists between any two attributes, it means that, when there is a methodical change in one variable, there is also a systematic change in the other. The variables change together over a certain period of time. These are usually measured as positive or negative considering a base attribute. A positive correlation exists if one variable increases simultaneously with the other. Otherwise, the variables are negatively correlated. Now, let’s examine a few correlation techniques that are used in machine learning.
ANOVA, short for analysis of variance, is a collection of statistical models and their associated estimation procedures used to analyze the differences among group means in a sample. There are two types of ANOVA tests for correlation, one-way and two-way, which refer to the number of independent variables in the test. A one-way test has one independent variable with two levels while a two-way test has two independent variables, which can have multiple levels.
Correlation matrices use a number between -1 and +1 that measures the degree of association between two attributes, which we will call X and Y. A positive value for the correlation implies a positive association. In this case, large values of X tend to be associated with large values of Y, and small values of X tend to be associated with small values of Y. A negative value for the correlation implies a negative or inverse association. In this case, large values of X tend to be associated with small values of Y and vice versa. This estimation can be used for creating a correlation matrix that shows correlations of all the attributes of the input example set.
Covariation is a measure of how much two attributes change together. If two attributes change in a similar manner then the covariance will be positive for these attributes. Similarly, if they are inversely proportional in their behavior then the covariance of these two attributes will be negative. If the covariance is zero then there is no linear relationship between the attributes. We can find covariance of two parameters or features using the formula Cov(x,y) = E{[ x - E(x) ][ y - E(y) ]}.
Time series
Time series modeling is all about forecasting future values with respect to time. In this analysis, we are trying to estimate values that are going to happen. General machine learning models are also able to estimate future values, but not with respect to time. In time series, each observation must be taken with respect to time.
Time series modeling consists of a collection of data observed and tracked according to some sort of regular time period, whether hourly, daily, weekly, monthly, quarterly or annually. This type of modeling is often used in analysis involving stocks, oil and gas production, web traffic estimation and customer count forecasting, among many others.
There are several patterns in time-series data that we can observe over a period of time, including:
-
Trends, which involve an increase or decrease in the slope of the observed values over a long time.
-
Seasonality, which is a pattern that recurs between regular intervals of time.
-
And error, which occurs as a change between present and past observations.
A time series is stationary if the data’s value is independent of the time of its collection. For example, time series which exhibit trends and seasonality are not stationary because the data will be different based on the time at which it was collected. We can check the stationarity of the time series model using several methods. Two popular such methods are the Augmented Dickey-Fuller test and the Kwiatkowski-Phillips-Schmidt-Shin test. After applying the tests, if the time series is non-stationary, then we can apply transformations to make it stationary in order to simplify the prediction problem.
Transformations are necessary for achieving stationarity by removing various qualities from the data. For instance, we can power transform data to bring it to Gaussian, or normal, distribution to avoid skewness. In difference transforming, or differencing, we remove systematic/seasonal structure from the time series to make it stationary. Logarithmic, or log, transforming removes a trend by penalizing large values in the time series and making the data appear constant.
Lags are, essentially, the delay in a given set of data. They are useful in time series analysis of a phenomenon called autocorrelation, which is a tendency for the values within a time series to be correlated with previous copies of itself. Autocorrelation is useful in allowing us to identify patterns within the time series, which helps in determining seasonality.
Another useful method for removing a trend from the observations used for time series is exponential smoothing. This technique penalizes the past observations and gives importance to the recent observations. After penalizing, our formula smooths the data.
Fourier transform is a method for expressing a function as a sum of periodic components and for recovering the signal from those components. When both the function and its Fourier transform are replaced with discretized counterparts, it is called the discrete Fourier transform (DFT). They help to smooth the data to make it stationary.
STL decomposition is used to analyze the trend, seasonality and residuals of the observed time samples. It also shows if there exists some seasonality that is cyclic here in our observation, as shown above, and replicates the same to the future forecasted values.
Finally, auto-regressive integrated moving average, or ARIMA, is the most applied model on time series observations and is also known as the Box-Junkins method. ARIMA uses the past data, including lags, from a time series to create a predictive equation.