Data transformation is the process of cleaning and organizing data from one format into another. It’s one of the key aspects of work for data analysis, data science and even artificial intelligence.
7 Data Transformation Functions to Know in R
arrange()
select()
filter()
gather()
spread()
group_by()
andsummarize()
mutate()
In this article, we will see how to transform data in R. R is an open-sourced programming language for statistical computing and machine learning supported by the R Foundation for Statistical Computing. It is easy to learn and comfortable to work with its widely used integrated development environment- RStudio. The R package that we will use here is tidyverse, an opinionated collection of R packages for data science.
Functions from dplyr and tidyr packages of tidyverse mostly do the work of data transformation.
How to Install Tidyverse for Data Transformations in R
Let’s install and load the package first.
install.packages("tidyverse")
library(tidyverse)
You will need to install the package only once, but you’ll need to load the package every time you start your environment.
7 Functions to Know for Transformations in R
These are the functions that we will work on in this article.
arrange()
: This orders the observations.select()
: Used to select variables or columns.filter()
: Used to filter observations by their values.gather()
: This shifts observations from columns to rows.spread()
: This shifts variables from rows to columns.group_by()
andsummarize()
: Used to summarize data into groups.mutate()
: Used to create new variables from existing variables.
1. Arrange()
This function arranges rows by variables.
It arranges the observations in order. It takes a column name or a set of column names as arguments. For a single column name, it sorts the column’s data with other columns following that column. For multiple column names, each additional column breaks ties in the values of preceding columns.
To show this, we will load the mpg data set. It has fuel economy data from 1999 to 2008 for 38 popular car models.
?mpg
data(mpg)
View(mpg)
![Mpg data set table](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/1_transformations-r.jpg)
First, we will order the observations as per the “displ” column. Values of this column will be arranged in ascending order by default and other columns will follow the order of the “displ” column.
mpg_arrange1 <- mpg %>% arrange(displ)
View(mpg_arrange1)
![Mpg data set table organized displ column.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/2_transformations-in-r.jpg)
Then, we will add more two column names — cty
and hwy
in the arguments.
mpg_arrange2 <- mpg %>% arrange(displ, cty, hwy)
View(mpg_arrange2)
![Cty and hwy column names added in mpg data set table.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/3_transformations-in-r.jpg)
We can see the second (cty
) and the third (hwy
) columns are breaking the ties in the values of the first (displ
) and second (cty
) columns respectively.
To order the observations in descending order, we’ll use the desc()
function.
mpg_arrange3 <- mpg %>% arrange(desc(displ))
View(mpg_arrange3)
![Mpg data set in descending order.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/4_transformations-in-r.jpg)
If there were any missing values, they would be sorted at the end.
2. Select()
Select variables by name.We will use the same mpg data set here.
?mpg
data(mpg)
View(mpg)
![Mpg data set table with displ, cty and hwy highlighted.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/5_transformations-in-r.jpg)
Now, we will select three columns —displ
, cty
and hwy
to a new data frame.
mpg_select1 <- mpg %>% select(displ, cty, hwy)
View(mpg_select1)
![Mpg data set with only displ, cty and hwy columns.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/6_transformations-in-r.jpg)
If we want to select all columns from displ
to hwy
, we can write below:
mpg_select2 <- mpg %>% select(displ : hwy)
View(mpg_select2)
![Selecting all columns in mpg data set from displ to hwy.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/7_transformations-in-r.jpg)
3. Filter()
Filter observations by conditions. For this we will use the diamonds data set and see how to select observations based on conditions.
?diamonds
data(diamonds)
View(diamonds)
![Diamond data set table.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/8_transformations-in-r.jpg)
We can check the summary of the column’s values.
> summary(diamonds$carat)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
> summary(diamonds$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
326 950 2401 3933 5324 18823
> table(diamonds$cut)
Fair Good Very Good Premium Ideal
1610 4906 12082 13791 21551
Let’s filter only the ideal cut diamonds.
diamonds_filter1 <- diamonds %>% filter(cut == 'Ideal')
View(diamonds_filter1)
![Diamond data set sorted by ideal cut.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/9_transformations-in-r.jpg)
Now, let’s filter ideal and premium cut diamonds.
diamonds_filter2 <- diamonds %>%
filter(cut == c('Ideal', 'Premium'))
View(diamonds_filter2)
![Diamond data set filtered by ideal and premium cut.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/10_transformations-in-r.jpg)
This is how we will filter diamonds with a price greater than 2500.
diamonds_filter3 <- diamonds %>% filter(price > 2500)
View(diamonds_filter3)
![Diamond data set with prices greater than 2600.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/11_transformations-in-r.jpg)
Now, we will apply our filter in three columns: Ideal cut, 0.54 carat diamonds and price equal to 1266.
diamonds_filter4 <- diamonds %>%
filter(cut == 'Ideal', carat == 0.54, price == 1266)
View(diamonds_filter4)
![Diamond data set filtered to ideal cut, .54 carat and 1266 price.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/12_transformations-in-r_0.jpg)
Let’s see a more complex filtering criteria: Ideal cut, between 0.4 and (&
) 1.0 carat with price less than 580 or (|
) greater than 10000. Appropriate logical operators will be used to express these conditions:
diamonds_filter5 <- diamonds %>%
filter(cut == 'Ideal' ,
carat >= 0.4 & carat <= 1.0 ,
price < 580 | price > 10000)
View(diamonds_filter5)
![Diamond data set filtered by a range of carats, cuts and prices.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/13_transformations-in-r.jpg)
The not (!
) operator needs to be used very carefully. We will need to keep in mind the De Morgan’s law while using it. This law states that !(x & y)
is the same as !x | !y
, and !(x | y)
is the same as !x & !y
.
diamonds_filter6 <- diamonds %>%
filter(cut != 'Ideal',
!(carat >= 0.4 & carat <= 1.0) ,
!(price < 580 | price > 10000))
View(diamonds_filter6)
![Diamond data set filtered by cut, quality and price.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/14_transformations-in-r.jpg)
4. Gather()
Gather columns into key-value pairs. Sometimes we can have a data set where the observations are found in column names that need to be gathered under a variable with a new column name. Let’s build the data set first.
Country_Name <- c('Qatar', 'United States', 'Germany',
'Canada', 'United Kingdom')
Y2009 <- c(59094, 47099, 41732, 40773, 38454)
Y2010 <- c(67403, 48466, 41785, 47450, 39080)
Y2011 <- c(82409, 49883, 46810, 52101, 41652)
gdp <- data.frame(Country_Name, Y2009, Y2010, Y2011)
View(gdp)
![Country gdp data set by years.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/15_transformations-in-r.jpg)
This data set can be considered as GDP data of five countries for 2009, 2010 and 2011. It is clearly seen that Y2009
, Y2010
and Y2011
these column names are themselves observations and should be under a single variable or column name: ‘Year.’ We will do it using the gather()
function.
gdp_gather <- gdp %>% gather("Year", "GDP" , 2:4)
View(gdp_gather)
![Country gdp data set grouped by years 2009, 2010, 2011.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/16_transformations-in-r.jpg)
To make this data set ready for numerical analysis, we need to remove the character “Y”(or any character) from the Year values and convert it from character to integer.
gdp_gather$Year <- gsub("[a-zA-Z]", "", gdp_gather$Year)
gdp_gather$Year <- as.integer(gdp_gather$Year)
View(gdp_gather)
glimpse(gdp_gather)
![Country GDP grouped by years with the Y removed.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/17_transformations-in-r.jpg)
![Code for data transformation.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/18_transformations-in-r.jpg)
5. Spread()
Spread a key-value pair across multiple columns. Sometimes we can see variables are distributed in observations in a dataset. In this case we need to spread it to column names. Let’s build a data set with key-values.
Student <- c('Jack', 'Jack', 'Jack', 'Jack', 'Jack', 'Jack',
'Rose', 'Rose', 'Rose', 'Rose', 'Rose', 'Rose')
Subject <- c('English', 'Biology', 'Chemistry', 'Maths', 'Physics',
'Social Science', 'English', 'Biology', 'Chemistry',
'Maths', 'Physics', 'Social Science')
Marks <- c(80, 70, 87, 75, 90, 82, 65, 70, 88, 92, 79, 93)
reportCard <- data.frame(Student, Subject, Marks)
View(reportCard)
![Student grades data set with student and subject highlighted.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/19_transformations-in-r.jpg)
In this report card data set, if we consider the “Subject” names as variables, then it needs to spread out to the column names. We will do it through the spread()
function.
reportCard_spread1 <- reportCard %>% spread(Subject, Marks)
View(reportCard_spread1)
![Student grades spread by class.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/20_transformations-in-R.jpg)
If we consider “Student” names as variables:
reportCard_spread2 <- reportCard %>% spread(Student, Marks)
View(reportCard_spread2)
![Student grades with names as variables.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/21_transformations-in-r.jpg)
6. Group_by() and Summarize()
Group by variables and reduce multiple values down to a single value.
These are very useful functions that couple together to summarize values into groups. Here, we will use the mammals sleep (msleep) data set. It contains sleep and body weight data for some of the mammals.
?msleep
data(msleep)
colnames(msleep)
msleep <- msleep %>% select(name, order, sleep_total, bodywt)
View(msleep)
![msleep data set with order, sleep_total and bodywt highlighted.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/22_transformations-in-r.jpg)
We will summarize sleep_total
to its average values sleep_avg
and bodywt
to its maximum values bodywt_max
. This summarization will be grouped as per order and the number of each order observation will be under the count column. All these will be done by group_by()
and summarize()
with mathematical functions: n()
, mean()
and max()
.
msleep_groupby_summarize <- msleep %>% group_by(order) %>%
summarise(
count = n(),
sleep_avg = mean(sleep_total),
bodywt_max = max(bodywt)
)
View(msleep_groupby_summarize)
![Msleep data set grouped and summarized by order, sleep_total and bodywt.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/23_transformations-in-r.jpg)
7. Mutate()
Create or transform variables. It is very common in data analysis to derive new variables from existing variables. Here we will use the flights data set. It has on-time data for all flights that departed New York City in 2013. To access this data set, we need to install and load the package nycflights13.
install.packages("nycflights13")
library(nycflights13)
?flights
data(flights)
colnames(flights)
flights <- flights %>% select(year, month, day, dep_delay,
arr_delay, distance, air_time )
View(flight)
![NYC flights data set.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/24_transformations-in-r.jpg)
Now, we will create two new variables:time_gain
by subtracting arr_delay
from dep_delay
and speed
by dividing distance
by air_time
and multiplying with 60.
flights_mutate <- flights %>%
mutate(time_gain = dep_delay - arr_delay ,
speed = distance / air_time * 60)
View(flights_mutate)
![NYC flights data set with new variables added.](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/25_transformations-in-r.jpg)
I would encourage you to practice these data transformation functions with different data sets and with your own data. This could be your fist step in entering the amazing field of data science.