Kasper Welbers & Wouter van Atteveldt 2022-01
- Introduction
- String basics
- Regular expressions
- Splitting text
- A note on Unicode, UTF-8 and Encoding
- A note on characters and factors
The goal of this tutorial is to get you acquainted with basic string
handling in R. A large part of this uses the stringr
included in the
Tidyverse. See also chapter 14 of R for
Data Science and the stringr cheat
sheet
Note that ‘string’ is not an official word in R (which uses character
to denote textual data), but since it’s the word used in most
documentations I will also use strings
to refer to objects containing
textual data. (AFAIK, the name originates from seeing a text as a
string
or sequence of characters)
The package stringr
has a number of functions for dealing with
strings. For example, str_length
gives the length of the string:
library(tidyverse)
str_length("john")
As usual, these functions can be applied to a vector or column of strings directly:
str_length(c("john", "mary"))
df = data.frame(name=c("john", "mary"))
str_length(df$name)
mutate(df, len=str_length(name))
Other useful functions are str_to_lower
, str_to_upper
(which mostly
mimic built-in tolower
and toupper
), and str_to_title
.
To combine two strings, you can use str_c
(which is equivalent to
built-in paste0
):
str_c("john", "mary")
str_c("john", "mary", sep = " & ")
It can also work of longer vectors, where shorter vectors are repeated as needed:
names = c("john", "mary")
str_c("Hello, ", names)
Finally, you can also ask it to collapse longer vectors after the initial pasting:
str_c(names, collapse=" & ")
str_c("Hello, ", names, collapse=" and ")
To take a fixed subset of a string, you can use str_sub. This can be useful, for example, to strip the time part off dates:
dates = c("2019-04-01T12:00", "2012-07-29 01:12")
str_sub(dates, start = 1, end = 10)
You can also replace a substring, for example to make sure the ‘T’ notation is used in the dates:
str_sub(dates, start=11, end=11) = "T"
dates
The example above showed how to extract or replace a fixed part of a string. In many cases, however, we want to find, replace, or extract certain patterns in a string (for example, dates, email addresses, or html tags).
For this purpose, R (like most other languages) use regular expressions, a very powerful way to write text patterns. Although a full overview of regular expressions is beyond the scope of this handout (there’s full books written on the subject!), below are some examples of what you can do.
Note: To find a regular expression in a text, I use the str_view
command, which is quite useful for designing/debugging expressions: (You
miught have to install the htmlwidgets package for this to work)
# if needed: install.packages("htmlwidgets")
txt = c("Hi, I'm Bob", "my email address is [email protected]", "A #hashtag for the #millenials")
str_view(txt, "my") # literal text
str_view(txt, "m.") # . matches any character
str_view(txt, "\\.") # \\. matches an actual period (.)
str_view(txt, "\\w") # \w matches any alphanumeric character (including numbers)
str_view(txt, "\\W") # \W matches any alphanumeric character
str_view(txt, "[a-z]") # [..] are character ranges, in this case, all lower caps letters
str_view(txt, "[^a-z]") # [^..] is a negated character range, in this case, all except lower caps letters
str_view(txt, "\\ba") # \b matches word boundaries,this matches an a at the beginning of a word
The different options above all match (a sequence of) individual characters. You can also specify multiples of a character:
str_view(txt, "ad*") # * means an a followed by zero or more (d's, in this case)
# and as many as possible (greedy)
str_view(txt, "ad*?") # *? means zero or more (d's, in this case), but as few as needed (non-greedy)
str_view(txt, "ad+") # + means one or more d's
str_view(txt, "ad+?") # + means one or more d's, but as few as needed
str_view(txt, "ad?") # ? means zero or one, i.e. an 'optional' match
str_view(txt, "add?") # a single d, optionally followed by another d
str_view(txt, "B.*m") # a B, followed by zero or more of any character
# (and as many as possible), followed by an m
str_view(txt, "B.*?m") # a B, followed by zero or more of any character
# (but as few as needed), followed by an m
These elements can be combined to make fairly powerful patterns, such as for emails or introductions:
str_view(txt, "\\w+@\\w+\\.\\w+") # matches (some) email addresses
str_view(txt, "I'm \\w+\\b") # matches "I'm XXX" phrases
str_view(txt, "#\\w+") # matches #hashtags
Note, the email address pattern is far from complete, and will match addresses with subdomains, numbers, and many other possibilities. It turns out emails are surprisingly complex to match - but the pattern below should do pretty well for all but the most arcane addresses:
regex_email = regex("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+")
str_view(txt, regex_email)
For more information, see the relevant section of R4DS, or one of the many available resources on regular expressions.
Regular expressions can be used e.g. to find rows containing a specific pattern. For example, if we had a data frame containing the earlier texts, we can filter for rows containing an email address:
t = tibble(id=1:3, text=txt)
t
t |> filter(str_detect(text, regex_email))
You can also str_count
to count how many matches of a pattern are
found in each text:
t |> mutate(n_hashtags = str_count(text, "#\\w+"))
You can also use regular expressions to do find-an-replace. For example, you can remove all punctionation, normalize whitespace, or redact email addresses:
t |> mutate(nopunct = str_replace_all(text, "[^\\w ]", ""),
normalized = str_replace_all(text, "\\s+", " "),
redacted = str_replace_all(text, "\\w+@", "****@"),
text = NULL)
Note the use of setting text to NULL as an alternative way to drop a
column. In textr
, most functions have a _all
variant which
replaces/finds/extracts all matches, rather than just the first.
Besides replacing patterns, it can also be useful to extract elements from a string, for example the email or hashtag:
regex_email = regex("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+")
t |> mutate(email = str_extract(text, regex_email),
hashtag = str_extract(text, "#\\w+"))
Note that for the hashtag, it only extracted the first hit. You can use
the str_extract_all
function, but since it can match zero, once, or
more often per text, it returns a list containing all matches per row
(or more correctly, per element of the input vector):
str_extract_all(t$text, "#\\w+")
The best way to deal with this in the context of a data frame is to use
unnest_longer
to turn the list into a long format. First, use mutate
to create a column containing the lists (so this is a column which
itself contains complex data)
hashtags = mutate(t, tags = str_extract_all(t$text, "#\\w+"))
hashtags
As you can see, the tags column has ‘list’ as its type, with the last
row containing two elements. To turn this into a more usable dataframe,
we ‘unnest’ it into a ‘long’ format using unnest_longer
:
hashtags |> unnest_longer("tags")
As you can see, this creates a new data frame with one row per hash tag.
Sometimes, a single column contains multiple data points, e.g. separated with a comma. For example, suppose we have this data:
d = tibble(person=c("Liam Jones", "Olivia Smith"), books=c("The great Gatsby, To kill a Mockingbird", "Pride and Prejudice, 1984, Moby Dick"))
d
If we want to split a column, in this case books, there are two good options.
First, if the number of elements is known and you can use multiple you
can use separate
to separate the column into seperate columns:
d |> separate(person, into=c("firstname", "lastname"), sep=" ")
As said, this is mostly useful if a column always contains a fixed number of data points that each have a distinct meaning, e.g. first and last name or city and state.
If there is variable number of data points such as the list of books in the data set above, you can use str_split, which takes a regular expression argument to split the column:
d |> mutate(books = str_split(books, pattern=","))
Just like above, this produces a column of type ‘list’ which contains
multiple values per row. To normalize this, use unnest_longer
as
above:
d |> mutate(books = str_split(books, pattern=",")) |> unnest_longer(books)
As you can also see, the spaces around the book titles are not removed.
You can fix this in two ways, either by changing the pattern to incluide
optional whitespace (\\w
is white space, *
means the preceding
element is optional and can be repeated); or by adding a trimws
call
afterwards:
d |> mutate(books = str_split(books, pattern="\\s*,\\s*")) |> unnest_longer(books)
d |> mutate(books = str_split(books, pattern=",")) |> unnest_longer(books) |> mutate(books=trimws(books))
NOTE: this code is not tested on non-western/non-linux computers, so let me know if anything does not work
Finally, a short note about unicode and character encodings. As with regular expressions, a full explanation of this topic is (well) beyond the scope of this tutorial. See this guide on unicode in R and the classic What Every Programmer .. Needs To Know About Encodings .. for some very useful information if you need to know more.
A fairly short version of the story is as follows: when computers were mostly dealing with English text, life was easy, as there are not a lot of different letters and they could easily assign each letter and some punctuation marks to a number below 128, so it could be stored as 7 bits. For example, A is number 65. This encoding was called ‘ASCII’.
It turned out, however, that many people needed more than 26 letters, for example to write accented letters. For this reason, the 7 bits were expanded to 8, and many accented latin letters were added. This representation is called latin-1, also known as ISO-8859-1.
Of course, many languages don’t use the latin script, so other 8-bit encodings were invented to deal with Cyrillic, Arabic, and other scripts. Most of these are based on ASCII, meaning that 65 still refers to ‘A’ in e.g. the Hebrew encoding. However, character 228 could refer to greek δ, cyrillic ф, or hebrew ה. Things get even more complex if you consider Chinese, where you can’t fit all characters in 256 numbers, so several larger (multi-byte) encodings were used.
This can cause a lot of confusion if you read a text that was encoding
in e.g. greek as if it were encoded in Hebrew. A famous example of this
confusion is that Microsoft Exchange used the WingDings font and
encoding for rendering symbols in emails, amongst others using character
74 as a smiley. For non-exchange users (who didn’t have that font),
however, it renders as the ASCII character nr 74: “J”. So, if you see an
email from a microsoft user with a J where you expected a smiley, now
you know :)
.
To end this confusion, unicode was invented, which assigns a unique number (called a code point) to each letter. A is still 65 (or “1” in hexadecimal R notataion), but δ is now uniquely “3B4”, and ф is uniquely “444”. There are over 1M possible unicode characters, of which about 100 thousand have been currently assigned. This gives enough room for Chinese, old Nordic runes, and even Klingon to be encoded.
You can directly use these in an R string:
"Some Unicode letters: \u41 \u03B4 \u0444"
Now, to be able to write all 1M characters to string, one would need almost 24 bits per character, tripling the storage and memory needed to handle most text. So, more efficient encodings were invented that would normally take only 8 or 16 bits per character, but can take more bits if needed. So, while the problem of defining characters is solved, unfortunately you still need to know the actual encoding of a text. Fortunately, UTF-8 (which uses 1 byte for latin characters, but more for non-western text) is emerging as a de facto standard for most texts. This is a compromise which is most efficient for latin alphabeters, but is still able to unambiguously express all languages.
It is still quite common, however, to encounter text in other encodings, so it can be good to understand what problems you can face and how to deal with them
To show how this works in R, we can use the charToRaw function to see how a character is encoded in R:
charToRaw('A')
## [1] 41
Note that the output of this function depends on your regional settings (called ‘locale’). On most computers, this should produce 41 however, as most encodings are based on ASCII.
For other alphabets it can be more tricky. The Chinese character “蘭” (unicode “62d”) on my computer is expressed in UTF-8, where it takes 3 bytes:
To convert between encodings, you can use the iconv function. For example, to express the Chinese character above in GB2312 (Chinese national standard) encoding:
charToRaw(iconv('蘭', to='GB2312'))
The most common way of dealing with encodings is to ignore the problem and hope it goes away. However, outside the English world this is often not an option. Also, due to general unicode ignorance many people will use the wrong encoding, and you will even see things like double-utf8-encoded text.
The sane way to deal with encodings is to make sure that all text stored inside your program is encoded in a standard encoding, presumably UTF-8. This means that whenever you read text from an external source, you need to convert it to UTF-8 if it isn’t already in that form.
This means that when you use read_csv
(on text data) or readtext
,
you should ideally always specify which encoding the text is encoded
in:
readtext::readtext("file.txt", encoding = "utf-8")
read_csv("file.csv", locale=locale(encoding='utf-8'))
If you don’t know what encoding a text is in, you can try utf-8 and the most common local encodings (e.g. latin-1 in many western countries), you can inspect the raw bytes, or you can use the guessEncoding function from readr:
guess_encoding("file.txt")
In R, every vector (column) has a data type, which you can inspect using
class(x)
. Two relevant data types are character
for actual texts,
and factor
for labeled nominal values (groups).
In most cases, you want you texts to be stored as character
, but many
default functions (like data.frame
and read.csv
) store text colunms
as factor
– while the tidyverse equivalents of tibble
and read_csv
use character
by default.
In most cases, in R you can convert object types using as.character
:
df = data.frame(name=c("john", "mary"))
class(df$name)
library(tidyverse)
tib = tibble(name=c("john", "mary"))
class(tib$name)
Before R version 4.0, textual columns in data frames were automatically
converted to a factor
. A factor
looks like a character
column, but
it is intended for groups or nominal values. Internally it is stored as
a number, with a label attached to each possible value.
This makes storing large amounts of data more efficient (numbers are
smaller than texts), but sometimes, you get unexpected results when you
apply a text-based function to a factor. For example, normally
as.numeric
can be used to convert strings to numbers, setting
non-numeric strings to NA:
as.numeric(c("22", "16", "x"))
However, if we have this data encoded as a factor, it gives a completely different result:
survey = data.frame(age=c("22", "16", "1"), stringsAsFactors = TRUE)
as.numeric(survey$age)
What happened here? The trick is to remember that factors are intended as labeled groups, and internally R stores the group index and has a separate list of labels (by default sorted alphabetically). Thus, “22” was the second element in the labels, “16” the first, etc. To convert the labels into numbers, convert to character first (which uses the labels) before converting to numeric:
as.numeric(as.character(survey$age))
Since R version 4.0, these problems are much less common, but it’s good
to be aware of the difference between factors and strings. Also, when
using ggplot
it’s important to set factor data types and levels
correctly for nominal values as they control the ordering and
visualization of nominal data. For more information on dealing with
factors, see the forcats package and
cheat
sheet