As a common data type, R provides various built-in functions for cleaning and manipulating strings. One of the most useful of these is strsplit()
, a function for separating strings. In this article, I’ll explain different ways you can use strsplit()
for cleaning string data in R.
What is the strsplit() Function in R?
strsplit()
is a function for splitting strings in R using delimiters or regular expressions. It is included in R’s base package.
What Does the strsplit() Function Do?
As its name suggests, strsplit()
splits a string up into substrings using user-defined rules. In a simple case, strsplit()
splits a string every time a particular character or substring is present. The function can also use regular expressions to define more complex rules for splitting strings.
strsplit() Function Syntax
The strsplit()
function has five arguments, although most uses only concern three of these. The first argument, x
, is the string to be split up. The following arguments, split
and fixed
, define how to split x
into parts. Setting fixed = TRUE
tells strsplit()
to divide x
by exactly matching the value of split
. For example, setting split = ","
and fixed = TRUE
would return substrings separated wherever a comma was present in x
.
If fixed
is set to FALSE
, split
is treated as a regular expression. Regular expressions are sequences of characters that specify a rule for matching patterns in strings. These rules can go far beyond exact matches and give the user more power to set complex matching conditions by which to split strings.
The final strsplit()
arguments are perl
and useBytes
. perl
lets the user specify whether they want to use regular expressions that are compatible with Perl, a different programming language. useBytes
lets the user apply byte-by-byte matching rules. Both of these arguments cover less common use cases and are FALSE
by default, meaning most users won’t need to touch them.
To understand how these arguments work in practice, let’s explore some examples.
How to Use the strsplit() Function in R
Using the strsplit() Function
The simplest use case for strsplit()
is separating a sentence into separate words. To do this, we can set split = " "
to split the string wherever there’s a space. This uses an exact matching rule, but because fixed = TRUE
by default, there is no need to set this argument explicitly. The function returns a vector of individual words that can then be used in further analysis.
Input
x <- "In a hole in the ground there lived a hobbit"
x_split <- strsplit(x, split = " ")
x_split[[1]]
Output
[1] "In" "a" "hole" "in" "the" "ground" "there" "lived" "a" "hobbit"
It’s worth noting this vector is nested within a list. To get at it, you need to index the list as done above, or wrap the result in unlist()
, as shown below.
Input
unlist(x_split)
Output
[1] "In" "a" "hole" "in" "the" "ground" "there" "lived" "a" "hobbit"
Using strsplit() Function With Delimiter
As well as separating by spaces, strsplit()
allows for splitting strings by any other fixed delimiter. As an example, we can take an address and split it into parts separated by a comma and space by setting split = ", "
. Again, extracting parts of a string can be useful for further analysis. In this case, we can create a new DataFrame from the results, which we could then save with similar records.
Input
address <- "Bilbo Baggins, Bag End, Bagshot Row, Hobbiton"
address_split <- strsplit(address, split = ", ")
address_df <- as.data.frame(t(address_split[[1]]))
colnames(address_df) <- c("name", "house", "street", "town")
address_df
Output
name house street town
1 Bilbo Baggins Bag End Bagshot Row Hobbiton
Using strsplit() to Split Each Character of the String
Another common use of strsplit()
is to split a string by each of its characters by setting split = ""
.
Input
x <- "spaced"
x_split <- unlist(strsplit(x, split = ""))
x_split
Output
[1] "s" "p" "a" "c" "e" "d"
This functionality is great for extracting individual characters by their position within a string. As an example, we can get a word hidden within our original string by indexing the third, fourth, and fifth elements of x_split
and pasting them together.
Input
paste0(x_split[3:5], collapse = "")
Output
[1] "ace"
Using strsplit() to Split Dates
A common use of strsplit()
is splitting dates into separate parts. When given a standard date format, we can do this easily with a fixed delimiter.
Input
a_date <- "06-21-2024"
unlist(strsplit(a_date, split = "-"))
Output
[1] "06" "21" "2024"
But, as with many base R functions, strsplit()
is vectorized. This means that when you give it a set of several values as input, it will automatically process them all without the need for an additional loop. To demonstrate this, we can give strsplit()
a vector of dates as input. The function now returns a list of vectors, with each vector being the split version of each date in the input.
Input
dates <- c("06-21-2024", "06-22-2024", "06-23-2024")
dates_split <- strsplit(dates, split = "-")
dates_split
Output
[[1]]
[1] "06" "21" "2024"
[[2]]
[1] "06" "22" "2024"
[[3]]
[1] "06" "23" "2024"
This simple and computationally efficient operation can be part of a larger data wrangling process. As with the prior example, we can create a DataFrame from this output that stores the date components, forming the basis for further analysis.
Input
dates_df <- as.data.frame(matrix(unlist(dates_split), ncol=3,
byrow=T))
colnames(dates_df) <- c("month", "day", "year")
dates_df
Output
month day year
1 06 21 2024
2 06 22 2024
3 06 23 2024
Using strsplit() Function with Regular Expression Delimiter
As well as using fixed delimiters, strsplit()
can also split strings with regular expressions. These expressions are highly customizable and can contain more complex rules than matching a fixed character.
To show how useful regular expressions are for splitting strings, we can continue processing dates. Dates are often a hard data type to work with because of inconsistencies in their formatting. Different characters can separate the digits in dates, and there may not be a consistent formatting convention in a messy data set, as in the example below.
Input
messy_dates <- c("06-21-2024", "06/22/2024", "06_23_2024")
messy_dates_split <- strsplit(messy_dates, split = "[-/_]", fixed
= FALSE)
messy_dates_split
Output
[[1]]
[1] "06" "21" "2024"
[[2]]
[1] "06" "22" "2024"
[[3]]
[1] "06" "23" "2024"
We can solve this problem using regular expressions. Here, split
is set to "[-/_]"
; a regular expression that means “match any dashes, forward slashes, or underscores.” With fixed = FALSE
, strsplit()
interprets this command as a regular expression and separates the dates without an issue, despite their varied formats.
Learn to Use strsplit
Any R programmer should consider using strsplit()
when segmenting string data. Like other base-R string functions, it is a simple yet powerful tool for data manipulation.
Frequently Asked Questions
What is the strsplit() function in R?
strsplit()
is a function for splitting strings in R using delimiters or regular expressions. It is included in R’s base package.
Why use strsplit() in R?
strsplit()
has many strengths across a wide range of use cases:
- It’s a base-R function, and so doesn’t require any extra package dependencies.
- Its regular expression compatibility allows for splitting strings using complex matching rules.
- It uses vectorization and can process a list of inputs without needing loops or extra code.