Let's take this sample sentence (which already hints at our frustration in a couple of seconds):
I would like my regex to take this sentence, but it actually takes more, and I don't know why!
We would like to write a regex that takes the text until the first comma. Our best guess would be to write a regex like .+,
(to also allow for other non-alphanumeric characters before the comma). However, what our regex engine matches is not the text until the first comma, but the text until the second comma. Why?
Because repeating qualifiers (like the +
we used) are greedy by default behaviour. This means they tend to take as much text as possible (actually, in the case of .+
the whole text of the document) and then take steps back (backtrack) until they satisfy the required condition (in our case, that a comma must be at the end of the string).
One possible solution would be to make this repeating qualifier lazy. We can do this by adding a question mark ?
just after the repeating qualifier, so our regex will become .+?,
Note
The question mark ?
used as a modifier to make a repeating qualifier lazy should not be confused with the question mark ?
used as a repeating qualifier itself (meaning zero or one repetition, as we saw in the previous section).
Another solution would be to use a different regex, for example one where we exclude the character comma in a set with a repeating qualifier, and then add the comma outside the set: [^,]+,
(The latter solution is actually more efficient from a computational point of view.)
Anchors match a pattern based on its position in the string. They are useful for defining the context in which a pattern should be matched within a text, allowing for more precise and controlled matching.
String | RE | Match |
---|---|---|
complicated | ^comp |
Yes |
appreciated | ed$ |
Yes |
rain | ^rain$ |
Yes |
rain | ^r[ai]+n$ |
Yes |
raaaain | ^r[ai]+n$ |
Yes |
complicated | ^comp.*ed$ |
Yes |
Note: Most RE engines have a multi-line mode that makes caret ^
match after any line break, and dollar_sign $
before any line break.
Remarks:
^pattern$
has the meaning of a total match
Special sequence | Matches at |
---|---|
\b |
a word boundary |
\B |
not a word boundary |
A word boundary is a position between a character that can be matched by the set of characters of \w
and a character that cannot be matched by \w
. \b
also matches at the ends of the string if the first/last characters in the string are word characters. \B
matches at every position where \b
cannot match.
String | RE | Match |
---|---|---|
complicated | \bcomp |
Yes |
appreciated | \Bed\b |
Yes |
rain | \brain\b |
Yes |
rain | $r[ai]+n\b |
Yes |
complicated | \bcomp.+\b |
Yes |
Regular expression can be also used for the search & replace.
In the simple case you just search using the regular expression and replace it with a fixed text (try it out on the regex101 using the Substitution function).
On top of that it is possible to use a selected parts of the match in the replace (so-called backreferences):
- Backreference are defined using brackets, e.g.
a(.)b(.c.)d
defines two of them- any single character between a and b,
- a sequence of any character, c and any character in between of a and d.
- How to refer to backreferences in the replace string depends on the regular expressions engine but two most popular syntaxes are
\\backreferenceNumber
and$backreferenceNumber
, e.g.\\1
and$1
(you can try it out on the regex101 by choosing various flavors).
As an example let's naively reorder a conditional sentence:
- search for:
(.*), ?(.*)[.]
- replace with:
$2, $1.
- test string:
if you see a red light, stop.
- result:
stop, if you see a red light.
To enable some more flexibility or specification during the search for the pattern, some regex flags can be used. We will quickly overview the important ones that can be integrated in more complicated patterns:
-
global (g) search through the whole string, and do not return just after the first occurence
-
multi line (m) total string match, equal to:
^pattern$
- when you have multi line activated, the caret
^
and the dollar sign$
match beginning and end of each line - with no multi line, the same symbols
^
and$
match beginning and end of the whole string (i.e., the whole text/document you have)
- when you have multi line activated, the caret
-
insensitive (i) case insensitive search (both lower and upper case search)
-
extended (x) ignore whitespace
Remarks:
- In some applications, you can use flags inline by specifying them at the end of a regex pattern, using the letter specified in brackets above, e.g.
/pattern/m
(for multi line) - Flags available and their exact behavior may (and do) vary between regex implementations (check it on regex101 by choosing different flavors; you can change flags by clicking on the green letters at the end of the regular expression field: see screenshot). Check the documentation of the app/language for details.
- Some applications (most notably text editors) do not expose flags or do it indirectly (e.g. with a "ignore case" checkbox in the search dialog).
All around! (which is why we are learning them)
Particularly:
- In almost any text editor.
- In Google Docs and LibreOffice (but surprisingly not in Ms Office).
- In dedicated tools (like the regex101 we used here)
- In each and every serious programming language.
- In each and every IDE (once you start programming).
- In CLI tools like
grep
(find matching files/lines) orsed
(find & replace). - In many digital corpora and text collections.
We will now focus on using Regex in Python, with the help of the library re. To start, go to the "exercises" folder - here you will find a Jupyter Notebook called "Regular_Expressions_in_Python_CBS4DH.ipynb". Download it (or better find it in your cloned repository) and open it with Google Colab or - in case you already have Python installed - with Visual Studio Code.
Many digital corpora and collections offer the possibility to optimize or expand one's full text or metadata search through regular expressions. Here are two exemplary German resources and one exemplary English resource where this can be applied:
- Deutsches Textarchiv (DTA): Corpus | Documentation
- Wienerisches DIGITARIUM: Corpus | Documentation
- Brown Corpus: Corpus | Documentation
Note
Different corpora and collections often involve specific ways to search with regex - (reading the) documentation is key!
Tasks
- Use Regex to find the longest word given within each of these corpora - which words are the longest and how many characters do they (approximately) have? Which differences do you witness between the different corpora during your search?
- Historically, the German word Kurier has appeared in a variety of writing variants, such as: Courrier, Currier, Curier, Curir, Courier, Courir, Kourrier, Kurrier, Kurier, Kurir, Kourier, Kourir, Courrir, Currir, Kourrir, Kurrir - formulate one regular expression to catch all of these variants! How many can hits do you get within Deutsches Textarchiv?
- Which other research question(s) could you ask with the power of Regex?