A Guide to 4 Important Data Processing Terms

Data science terminology can confound even the more technical professionals at most tech companies. Some people may even wonder what data science really means.

At its core, data science seeks to comprehend the what and the why questions. This article aims to introduce all the branches of data science and explain its various phases.

Below is a quick look at all the terms and techniques that I’ll be reviewing in this article:

Data Access

Files
Databases
Applications
Cloud Storage

Blending

Changing Attribute Types
Renaming Columns
Filtering
Sorting
Merging, Trimming, Replacing, Cutting, Adjusting, Splitting

Cleansing

Normalization
Binning
Missing Data
Outliers
Dimensionality Reduction
Quality Assertion

Validation

Cross-Validation
Split Validation
Bootstrap Validation

Data Access

Data access is the first step in any data science project. It refers to the data scientist’s ability to read, write or receive the data within a database or a remote repository. The person who has data access has the ability to store, retrieve, move or manipulate stored data across different sources. A few examples include fetching data from online application programming interfaces (APIs) and accessing it, cloning databases directly from websites and accessing them, etc. Here are a few crucial places where data scientists gain data access.

Files: A file is simply a computer resource built to record discrete data. It can be used to store information (usually text or numbers). The two main operations used to access files are read and write; these operations are also known as access modes. Read operations enable us to see and copy the data, whereas the Write operation allows you to edit or manipulate the file contents. In data science, there are numerous types of files, but the most frequently used ones are CSV (Comma Separated Values), TSV (Tab Separated Values), Excel, URLs (Universal Resource Locators), and XML (Extensible Markup Language). These files are loaded using Data Science tools like Numpy, Pandas, etc., based on the access modes available.
Database: A database is a widely used term across various sectors. It is mostly used in developing web applications. In short, a database is an organized collection of data that is stored and accessed in the form of tables. The three main data access operations used for databases are read, write and update. To fetch and work with data from various databases, a data scientist must be able to manage database connections and manipulate them by running different queries. Postgres, Standard Query Language (SQL), Mongo, etc. are some of the most frequently used databases.
Applications: There are many applications that handle and receive tons and tons of data every day. These applications are again of many different types and the data fetched using these is highly variable. A few examples include tools like Salesforce, Hubspot and social networks like Twitter and Facebook. We can directly fetch the data using APIs or by requesting them using an HTTP (HyperText Transfer Protocol) method like GET or PUT. The operations here are the same: They allow you to read, write, update, and delete the data.
Cloud Storage: Cloud storage is a complex service model that allows you to store, maintain and manage the data. They have several methods and backup alternatives that help users access the data remotely over a network. The cloud is the best place to work and save your data science projects, especially when dealing with complex and large data sets. You can directly and remotely work with data computations and operations in the cloud. The most competitive cloud storage services for the tech-savvy include AWS S3 (Amazon Web Services), Google Storage and Dropbox. These services are the cheapest, fastest and most reliable in the market.

This section was about where and how data are stored and managed. But how do we arrange the data once it’s accessed? Let’s move onto our next phase, Blending.

Blending

As the word suggests, data blending is the process of combining data from multiple sources into a proper functioning data set. This allows our data to be less redundant and more convenient to work with. By using blending techniques, we can seriously simplify the analytics and analysis process.

The three main goals of data blending are:

Provide more intelligent solutions by retrieving data from multiple sources.
Cut down the amount of time it takes for data scientists to perform analytics.
Employ better decision-making processes across the company.

There are several techniques necessary to accomplish blending. Here they are, one by one:

Changing Attribute Types: Changing the attribute type in the data set helps us redesign the data into usable and functional information. This also tweaks the computation to make it faster and more precise. For example, if we would like to have our data rounded off to the nearest number, changing the type would be a quick and easy way to power this operation. Here are a few data type conversions that are used frequently: Float to Integer, Real to Numerical, Numerical to Date, Text to Nominal.
Renaming Columns: Data can have different naming conventions for different features. We can use this technique to rename a set of attributes by replacing parts of the attribute names with a specified replacement. For example, consider two columns: one is named “initial latitude, longitude,” and the other is called “final latitude, longitude.” In this case, we can construct a single column named simply “distance.”
Filtering: Data filtering is the process of using smaller chunks, or subsets, in a larger data set. This is a temporary technique that’s mostly used for testing purposes if the computation power is low. Once the implementation is performed on the filtered data, the same scripts/logic will be used on the original data set.
Sorting: Sorting is a simple blending technique that arranges data in an ordered fashion, either in ascending order or in descending order. The complete data set shall then be sorted based on a single value. This provides on-the-fly statistics about the minimum, maximum and most frequent/least frequent values present in the data set.
Merging, Trimming, Replacing, Cutting, Adjusting, and Splitting: All of these operations are based on their naming conventions.
- Merging: This operation simply combines two nominal values of the specified regular attributes.
- Trimming: Trimming strips the leading and trailing whitespaces from the given data.
- Replacing: You can quickly replace a particular value. For example, you can replace all the NaN values (Not a Number) with numerics.
- Cutting: Cutting allows us to retrieve a substring, which is part of the data set values.
- Adjusting: Adjusts the data in the specified attribute by adding or subtracting the specified value.
- Splitting: Splitting refers to formulating new attributes from the given set of nominal attributes by specifying a split mode. As an example, consider a case where we have the following data:
```
(request)
(request, response)
(response)
```
  To span this data across two columns symmetrically, we would want to split the data across two attributes. Thus, the unordered split indicating the presence of the two possible values (request and response) could be done as follows:
```
(true, false)
(true, true)
(false, true)
```

Cleansing

Data cleansing is the process of cleaning or correcting the inaccurate records in a given data set. This is one of the essential steps prior to building any machine learning algorithm. There are several data cleansing techniques available: removing the unimportant values, updating incorrect values, and filling the missing values. Data cleansing will not only get the dataset clean but also ensure that the resulting algorithms deliver high-performance numbers. There are a lot of frameworks and libraries that have cleansing techniques preinstalled. However, it’s helpful to know the various functionalities in case you ever need to build your own models [1]. The most popular techniques include:

Normalization: Normalization is used to limit values to a particular range. For example, consider three values: [100, 200, 300]. We can divide these values by 100 so that they’ll fit in the range 0-10. The resulting normalized list will be [1, 2, 3]. This is one of the most popular cleansing techniques that is used by every data scientist before applying machine learning/deep learning to any model. This can be used with any type of data regardless of the size of the data set.
Binning: Data binning is a pre-processing data cleansing technique that reduces minor observation errors. In this technique, we use bins to replace the original data with the binned data. These bins represent the intervals of original data, often the central value. For example, consider a case where you want to arrange a cart of shoes based on their price. In this case, we can bin for every $500 price increase; the shoes that fall below $500 will be organized under one bin, while shoes ranging from $500-1,000 will be in the other bin, and so on.
Missing Data: In the data cleansing process, handling the missing values is one of the most crucial steps. Below are a few operations or steps that we can carry out to deal with the missing values:
- Replacing the missing values.
- Completely deleting the missing values (deleting the tuples).
- Filling missing values with constants.
Outliers: Outliers are the extreme values that deviate from other observations in the data. Outlier detection is a powerful means of finding the variability in measurements as well as experimental errors. Below are a few metrics to calculate the outliers:
- Distance: Based upon the distance between the data points, one could make a guess about whether the data points really add up to the data under consideration. When the actual distance metrics are considered, we would be able to filter and remove the not-so-useful data points, or simply the outliers.
- Density: When sparsely populated data are present along with the densely populated data, we could say that sparse data doesn’t contribute to the actual data. They might also be responsible for the machine learning models deviating from the right path. Thus, sparse data points are considered to be outliers.
Dimensionality Reduction: Dimensionality reduction helps reduce the number of features in a given data set. For example, if you have three features, we usually represent them in a three-dimensional space. In order to reduce the number of features to two, we slightly change the axis and fit them into a two-dimensional space. There are two main components in dimensionality reduction that help reduce the number of features in a data set; they are feature selection and feature extraction [2].

Examples include:
- PCA - Principal Component Analysis
- LDA - Linear Discriminant Analysis
- GDA - Generalized Discriminant Analysis
- T-SNE - t-Distributed Stochastic Neighbor Embedding
Quality Assertion: Quality assertion is the process of assessing the data based on a few rules. These rules include the specification of null values (whether a value can be empty), non-null values (whether a value must not be empty at any time), attributes, domain mapping (whether the data maps to a specific domain), etc.

Validation

Data validation is a technique that is used prior to modeling a machine learning algorithm. It allows the data scientist to check for the correctness of the chosen machine learning model before the data is sent into the algorithm. This can be performed on any data set including simple Excel sheets. The main goal of data validation is to create consistent, accurate and complete data so as to prevent data loss and errors while building a model.

Cross-Validation: This is used to estimate the model’s performance. It comprises two operations, training and testing. The training data is divided into n subsets. n-1 subsets are used for training, and one subset is used to test the model’s performance. The cross-validation process is then repeated n times where each of the subsets acts as the test data. The n results are averaged to get the final estimate of the model’s performance [3].
Split Validation: The data is split so that a specific set of data points are used for training, and the others are used for testing. This process provides an estimated accuracy of how a machine learning model fares.
Bootstrap Validation: Bootstrap validation carries forward the logic behind split validation (training and testing data) and the basic essence of cross-validation. Unlike cross-validation, it picks the samples (data points) from the data with replacement, meaning a bootstrapped data set can contain multiple instances of the same class. This means all samples have an equal probability of being selected. Ultimately, this technique allows for more randomization and shuffling of data, meaning there are fewer chances for becoming biased to a specific class of data [4].

Summary

Pre-processing of data paves the way for building robust machine learning models. To delve deep into the fields of data science, one has to get accustomed to all the processing techniques that could manipulate the data in hand to arrive at its most usable form. As discussed, these techniques could be broadly divided into four main categories: accessing the data available in various formats with ease, blending the data to produce a sophisticated representation of the whole chunk of data, cleaning the data to get rid of unwanted data points, and validating the data to examine its correctness.

These concepts will be employed when building data models. You could also put these concepts into action by coding them in any specific programming language.

References

[1] Scikit-Learn Data Processing Techniques 6.3. Preprocessing data — scikit-learn 0.22.1 documentation
[2] A survey of dimensionality reduction techniques- https://arxiv.org/pdf/1403.2877.pdf
[3] Rapidminer Validation Techniques - Cross-Validation
[4] Rapidminer Validation Techniques - Bootstrapping Validation

Explaining 4 Important Data Processing Terms