Data by itself can be quite interesting, but even if you’re dealing with a small data set, the chances are that you’ll have to summarize or aggregate it. That’s where we’ll need groups.
While it’s nice to know the total amount of sales, it’s often more useful to know the total amount of sales either by salesperson or by month.
How to Group Data With R
- Load the data set into Tibble.
- Enter the function ‘group_by()’ to group the information.
- Use ‘summarise’ to analyze your data.
- Create a new column with ‘mutate’.
- Ungroup your data with ‘ungroup()’.
Grouping data is undeniably essential for data analysis, and I’ll investigate some of the methods for doing so with R, Tidyverse and the dplyr package.
The data set I’ll use for the next examples comes from Kaggle and contains Spotify’s top songs from 2010 to 2019.
library("tidyverse")
df <- read_csv('data/spotify.csv') %>%
rename(genre = `top genre`)
data:image/s3,"s3://crabby-images/2a72e/2a72e4281e23c66d0aee0f498e1caeb7f56967b8" alt="Screenshot of the Spotify top genre data."
Now our data set is loaded to a Tibble, which contains 603 records and 15 columns.
How to Group Data With R
Now, let’s group the data by genre and see what we get.
# group by genre
df %>%
group_by(genre)
data:image/s3,"s3://crabby-images/cb43c/cb43c6f2866ab946a2010db13bd0e4ab9a16e0f8" alt="Grouped data."
There’s almost no difference in the results, but now the second line has some information about the group. It may seem that nothing changed, but since the data is grouped, it’ll be treated differently in the next operations.
How to Analyze Data While Grouping in R
We can use summarise
to observe this difference, first, let’s do it without groups.
df %>%
summarise(summary = mean(bpm))
data:image/s3,"s3://crabby-images/6f558/6f55872ace2cdefeca7ba2802d9676bedb9ef731" alt="Summarised data."
Since we didn’t have any groups, R returns the mean value for the whole data set.
df %>%
group_by(genre) %>%
summarise(mean_bpm = mean(bpm))
data:image/s3,"s3://crabby-images/fdebc/fdebcdbef4c509519b5ffc0127ba2bd42b9b3d05" alt="Grouped and ”summarised“ data."
Now, we have unique values on the genre and their respective average beats per minute (bpm).
Note that this new Tibble doesn’t have the group’s information anymore.
How to Create a New Column While Grouping in R
Now we’ll create a new column with mutate
, instead of summarise
. First, we’ll see the result without grouping:
# mutate without grouping
df %>%
mutate(mean_bpm = (bpm - mean(bpm))^2) %>%
select(genre, mean_bpm)
data:image/s3,"s3://crabby-images/f56bd/f56bd8eaa2415272d1294d5f1f3d082c322b7c4b" alt="Data calculated with the mean bpm of all values."
# mutate with grouping
df %>%
group_by(genre) %>%
mutate(mean_bpm = (bpm - mean(bpm))^2) %>%
select(genre, mean_bpm)
data:image/s3,"s3://crabby-images/101f9/101f9e1a9d2519089b589cfb2272d2b80987cbbf" alt="Data calculated with the mean bpm of each group."
The results are very different. On the first one, we iterated each record, getting its bpm, then divided it by the mean of all records and squared the result.
On the second, we did the same thing but divided by the mean bpm of the records in that group. We can also see that even after using mutate, our data is still grouped.
How to Ungroup Your Data in R
If we need to change that, we can easily do so with ungroup()
to perform other operations.
df %>%
group_by(genre) %>%
mutate(mean_bpm = mean(bpm) %>%
select(genre, mean_bpm) %>%
mutate(my_grouped_sum = sum(mean_bpm)) %>%
ungroup() %>%
mutate(my_regular_sum = sum(mean_bpm))
data:image/s3,"s3://crabby-images/21370/21370735e6096ebc6b56ed1297219ad75430abe2" alt="Mutate function executed on grouped and ungrouped data."
It’s easy to mistake your data set with its grouped version, so it’s recommended that you always ungroup your data before saving the results to a variable.
We know that group_by
will return a Tibble very similar to our standard Tibble. The difference is in how it’ll handle the next operations.
How to Group Multiple Fields in R
Using multiple fields to group the data is also quite easy; we can add them as parameters on our group_by
.
# multiple fields
df %>%
group_by(genre, year) %>%
summarise(rec_count = n()) %>%
arrange(desc(year), desc(rec_count))
data:image/s3,"s3://crabby-images/b541b/b541be75066c50dfff6d91887b73ae6690147826" alt="Grouped and “summarised” by genre and year."
The last time we used summarise
, it returned a Tibble without groups. Now that we’re using multiple variables, we still have a group in the result.
While the grouped and ungrouped Tibble look similar, they are not. Let’s repeat this code, adding a column with the count after summarise
and another after ungrouping.
# multiple fields
df %>%
group_by(genre, year) %>%
summarise(genre_year_count = n()) %>%
arrange(desc(year), desc(genre_year_count)) %>%
mutate(genre_count = n()) %>%
ungroup() %>%
mutate(total_count = n())
data:image/s3,"s3://crabby-images/959a7/959a73387facc86755b5a7b4bbeb3e5165e916a1" alt="Data mutate after summarising and after ungrouping."
Sometimes we might perform some operations with a group and then need to add another field to our group_by
.
Using a group_by
after the other replaces the previous, but we can set the parameter add
to “true” to complete that action.
df %>%
group_by(genre) %>%
mutate(mean_bpm_genre = mean(bpm)) %>%
group_by(year, add = TRUE) %>%
mutate(mean_bpm_genre_year = mean(bpm)) %>%
select(genre, year, mean_bpm_genre, mean_bpm_genre_year)
data:image/s3,"s3://crabby-images/1e7f6/1e7f6fc9478a7199d5d93b3036d901be1d99eb7a" alt="Tibble with the added group."
Group in R With Variables and Functions
Now, let’s try using group_by
more programmatically. We’ll try to define a function, pass the group as a parameter, perform a simple count and get the results.
my_func <- function(df, group){
df %>%
group_by(group) %>%
summarise(my_count = n()) %>%
arrange(desc(my_count))
}
my_func(df, 'year')
data:image/s3,"s3://crabby-images/68629/6862925e33fdab6f5d2e0d0e6a2af64c45a71a73" alt="Error in the data. | Screenshot: Thiago Carvalho"
It isn’t so simple. We need to make a few adjustments to make this work. One possible solution is to use quosure
.
my_func <- function(df, group){
df %>%
group_by(!!group) %>%
summarise(my_count = n()) %>%
arrange(desc(my_count))
}
my_group = quo(year)
my_func(df, my_group)
data:image/s3,"s3://crabby-images/e8b9f/e8b9f90b6623846e13d2614a9433c6d03796c89a" alt="Function with quosure."
We could also use group_by_
.
Many dplyr verbs have an alternative version with an extra underline at the end. Those can help us use methods such as group_by
more programmatically.
Dplyr uses non-standard evaluation for most of its single table verbs, including: filter()
, mutate()
, summarise()
, arrange()
, select()
and group_by()
. While it’s faster to type and makes it possible to translate the code into SQL, it’s difficult to program.
Let’s try the function again, this time with group_by_
.
my_func <- function(df, group){
df %>%
group_by_(group) %>%
summarise(my_count = n()) %>%
arrange(desc(my_count))
}
my_func(df, 'genre')
data:image/s3,"s3://crabby-images/ef5e4/ef5e4316ab8331585e0abbba000a0a9fc79faf4d" alt="Function with SE verbs."
That’s easier than using quosure
, it’s more readable and it yields the same result.
We explored the basics of group_by
, how to use multiple fields to group our data, the differences between a grouped and a regular Tibble, and how to use group_by_
to achieve more programmatic solutions.
Grouping in R Variants
There are some variants such as group_by_all
and group_by_if
. As much as they’re considered to be superseded by the use of across
, it’s worth getting a look at them too.
Grouping in R Variants to Know
- group_by_all: Allows you to use every field in the data set.
- group_by_if: Allows you to use an ‘if’ function to group certain fields.
- group_split: Allows you to separate the data into a list of Tibbles.
- group_nest: Returns a Tibble containing the grouped columns and the data from those respective groups.
Group_by_all
The name gives it away, group_by_all
uses every field in the data set. As mentioned, the same can be done using group_by
and across
.
new_df <- select(df, genre, year)
new_df %>%
group_by_all() %>%
summarise(my_cnt = n()) %>%
arrange(desc(my_cnt))
new_df %>%
group_by(across()) %>%
summarise(my_cnt = n()) %>%
arrange(desc(my_cnt))
data:image/s3,"s3://crabby-images/e4b8e/e4b8e2b2c0f84ec0cc61970f74cc5917d6addac1" alt="The result from both methods."
I’ve never found myself in a situation where I needed to group all columns, but it could come in handy in some cases, and it’s good to know it exists.
Group_by_if
Another fascinating variant is group_by_if
, which allows us to use a function to select the fields.
# group_by_if
new_df <- df %>%
mutate(artist = as.factor(artist),
genre = as.factor(genre))
new_df %>%
group_by_if(is.factor) %>%
summarise(my_cnt = n()) %>%
arrange(desc(my_cnt))
new_df %>%
group_by(across(where(is.factor))) %>%
summarise(my_cnt = n()) %>%
arrange(desc(my_cnt))
data:image/s3,"s3://crabby-images/afef4/afef41422401a3f4ea2f2930a196b5418c22df82" alt="The result from both methods."
Let’s try another example with a custom function. We’ll group the columns where any record contains the word “dance” on them.
new_df %>%
group_by_if(function(x) any(grepl("dance", x, fixed=TRUE))) %>%
summarise(my_cnt = n())
data:image/s3,"s3://crabby-images/18a31/18a31608634a375617ed772dd96bcda8ab6dd5c2" alt="Group by with a custom function."
Group_split
Sometimes we’ll also need different treatments for different groups. That is made simple with group_split
, which separates our data into a list of Tibbles, one for each group.
In this example, we’ll group the data by year, split, and save the result to a variable called df_list
.
To test, we can select an index of this list. In return, we should get a Tibble containing only the records of one year.
# split
df_list <- df %>%
group_by(year) %>%
group_split()
df_list[[10]]
data:image/s3,"s3://crabby-images/f045d/f045d1db24505b58bac8eee55bd8036f701202a0" alt="A Tibble with only 2019 records."
Group_nest
A more sophisticated solution for separating our groups is group_nest
, which returns a Tibble containing the grouped columns and the data from those respective groups.
# nest
df_nest <- df %>%
group_nest(genre, year)
df_nest
data:image/s3,"s3://crabby-images/752ef/752efe9fbfacc511e7484463da42f1232c67022d" alt="Nested groups."
There are plenty of ways to group our data and manipulate it once it’s grouped, but I believe we covered enough for the basics.
Frequently Asked Questions
What does grouping do in R?
Grouping in R selects and applies operations on specific subsets of data in a set (such as columns in a table). Grouping data in R is often done by using the group_by()
function from the dplyr package, which converts an existing data table into a grouped table where operations are applied by group.
What does ungroup() mean in R?
The ungroup()
function removes any data grouping done by the group_by()
function in R. ungroup()
is a function included alongside group_by()
in the dplyr package.