How to Use strsplit() Function in R

The strsplit() function is a powerful tool for splitting strings in R using delimiters or regular expressions. Learn about its many uses with these examples.

Written by Rory Spanton
Published on Jul. 15, 2024
A pair of scissors cutting a string
Image: Shutterstock / Built In
Brand Studio Logo

As a common data type, R provides various built-in functions for cleaning and manipulating strings. One of the most useful of these is strsplit(), a function for separating strings. In this article, I’ll explain different ways you can use strsplit() for cleaning string data in R.

What is the strsplit() Function in R?

strsplit() is a function for splitting strings in R using delimiters or regular expressions. It is included in R’s base package.

More From Rory SpantonHow to Solve FizzBuzz in R

 

What Does the strsplit() Function Do?

As its name suggests, strsplit() splits a string up into substrings using user-defined rules. In a simple case, strsplit() splits a string every time a particular character or substring is present. The function can also use regular expressions to define more complex rules for splitting strings.

 

strsplit() Function Syntax

The strsplit() function has five arguments, although most uses only concern three of these. The first argument, x, is the string to be split up. The following arguments, split and fixed, define how to split x into parts. Setting fixed = TRUE tells strsplit() to divide x by exactly matching the value of split. For example, setting split = "," and fixed = TRUE would return substrings separated wherever a comma was present in x.

If fixed is set to FALSE, split is treated as a regular expression. Regular expressions are sequences of characters that specify a rule for matching patterns in strings. These rules can go far beyond exact matches and give the user more power to set complex matching conditions by which to split strings. 

The final strsplit() arguments are perl and useBytes. perl lets the user specify whether they want to use regular expressions that are compatible with Perl, a different programming language. useBytes lets the user apply byte-by-byte matching rules. Both of these arguments cover less common use cases and are FALSE by default, meaning most users won’t need to touch them.

To understand how these arguments work in practice, let’s explore some examples.

Video: YouTube

 

How to Use the strsplit() Function in R

Using the strsplit() Function

The simplest use case for strsplit() is separating a sentence into separate words. To do this, we can set split = " " to split the string wherever there’s a space. This uses an exact matching rule, but because fixed = TRUE by default, there is no need to set this argument explicitly. The function returns a vector of individual words that can then be used in further analysis.

Input

x <- "In a hole in the ground there lived a hobbit"
x_split <- strsplit(x, split = " ")
x_split[[1]]

Output

[1] "In" "a" "hole" "in" "the" "ground" "there"  "lived" "a" "hobbit"

It’s worth noting this vector is nested within a list. To get at it, you need to index the list as done above, or wrap the result in unlist(), as shown below.

Input

unlist(x_split)

Output

[1] "In"  "a"  "hole"  "in"  "the"  "ground"  "there"  "lived"  "a"  "hobbit"

Using strsplit() Function With Delimiter

As well as separating by spaces, strsplit() allows for splitting strings by any other fixed delimiter. As an example, we can take an address and split it into parts separated by a comma and space by setting split = ", ". Again, extracting parts of a string can be useful for further analysis. In this case, we can create a new DataFrame from the results, which we could then save with similar records.

Input

address <- "Bilbo Baggins, Bag End, Bagshot Row, Hobbiton"
address_split <- strsplit(address, split = ", ")
address_df <- as.data.frame(t(address_split[[1]]))
colnames(address_df) <- c("name", "house", "street", "town")
address_df

Output

           name   house      street     town
1 Bilbo Baggins  Bag End   Bagshot Row  Hobbiton

Using strsplit() to Split Each Character of the String

Another common use of strsplit() is to split a string by each of its characters by setting split = ""

Input

x <- "spaced"
x_split <- unlist(strsplit(x, split = ""))
x_split

Output

[1] "s"  "p"  "a"  "c"  "e"  "d"

This functionality is great for extracting individual characters by their position within a string. As an example, we can get a word hidden within our original string by indexing the third, fourth, and fifth elements of x_split and pasting them together.

Input

paste0(x_split[3:5], collapse = "")

Output

[1] "ace"

Using strsplit() to Split Dates

A common use of strsplit() is splitting dates into separate parts. When given a standard date format, we can do this easily with a fixed delimiter.

Input

a_date <- "06-21-2024"
unlist(strsplit(a_date, split = "-"))

Output

[1] "06"  "21"  "2024"

But, as with many base R functions, strsplit() is vectorized. This means that when you give it a set of several values as input, it will automatically process them all without the need for an additional loop. To demonstrate this, we can give strsplit() a vector of dates as input. The function now returns a list of vectors, with each vector being the split version of each date in the input.

Input

dates <- c("06-21-2024", "06-22-2024", "06-23-2024")
dates_split <- strsplit(dates, split = "-")
dates_split

Output

[[1]]
[1] "06"   "21"   "2024"
[[2]]
[1] "06"   "22"   "2024"
[[3]]
[1] "06"   "23"   "2024"

This simple and computationally efficient operation can be part of a larger data wrangling process. As with the prior example, we can create a DataFrame from this output that stores the date components, forming the basis for further analysis.

Input

dates_df <- as.data.frame(matrix(unlist(dates_split), ncol=3, 
byrow=T))
colnames(dates_df) <- c("month", "day", "year")
dates_df

Output

   month day year
1    06  21 2024
2    06  22 2024
3    06  23 2024

Using strsplit() Function with Regular Expression Delimiter

As well as using fixed delimiters, strsplit() can also split strings with regular expressions. These expressions are highly customizable and can contain more complex rules than matching a fixed character.

To show how useful regular expressions are for splitting strings, we can continue processing dates. Dates are often a hard data type to work with because of inconsistencies in their formatting. Different characters can separate the digits in dates, and there may not be a consistent formatting convention in a messy data set, as in the example below. 

Input

messy_dates <- c("06-21-2024", "06/22/2024", "06_23_2024")
messy_dates_split <- strsplit(messy_dates, split = "[-/_]", fixed 
= FALSE)
messy_dates_split

Output

[[1]]
[1] "06"   "21"   "2024"
[[2]]
[1] "06"   "22"   "2024"
[[3]]
[1] "06"   "23"   "2024"


We can solve this problem using regular expressions. Here, split is set to "[-/_]"; a regular expression that means “match any dashes, forward slashes, or underscores.” With fixed = FALSE, strsplit() interprets this command as a regular expression and separates the dates without an issue, despite their varied formats.

More in Data ScienceThe Kruskal Wallis Test: A Guide

 

Learn to Use strsplit

Any R programmer should consider using strsplit() when segmenting string data. Like other base-R string functions, it is a simple yet powerful tool for data manipulation.

Frequently Asked Questions

strsplit() is a function for splitting strings in R using delimiters or regular expressions. It is included in R’s base package.

strsplit() has many strengths across a wide range of use cases:

  • It’s a base-R function, and so doesn’t require any extra package dependencies.
  • Its regular expression compatibility allows for splitting strings using complex matching rules.
  • It uses vectorization and can process a list of inputs without needing loops or extra code.
Explore Job Matches.