LibGuides: Digital Tools for Research: Regular Expressions

Regular Expressions

Regular expressions (shortened as regex or regexp) refer to a formal language for matching and replacing sequences of characters with specific patterns. Many text editors such as Google Docs, Notepad++, Sublime, Geany, Brackets, Atom, etc. support regular expressions. They can be very useful for validating, cleaning, and restructuring text data.

Turning on Regex Mode in a Text Editor

Ctrl+F — search
Ctrl+H — replace

Notepad++

Sublime Text

Online Regex Testers

Alternatively, you can use https://regex101.com/ or any other online regex tester. These websites are great for learning, because they provide a detailed explanation for every bit of your regular expression.

Symbol Ranges

Syntax	Description
.	any single character
A\|B	match either A (everything on the left) or B (everything on the right)
[ABC]	any single character from those in brackets
[^ABC]	any single character except those enclosed in brackets
[A-Z]	any single uppercase basic Latin) character
[a-z]	any single lowercase basic Latin character
[0-9] or \d	a single digit
[^0-9] or \D	any single character except a digit

You can combine ranges:

Syntax	Description
[A-Za-z]	any single uppercase or lowercase character from basic Latin alphabet
[A-Za-z0-9]	any single uppercase or lowercase character from basic Latin alphabet, and digits
[A-Za-z0-9_] or \w	any single uppercase or lowercase character from basic Latin alphabet, digits, and _
[^A-Za-z0-9_] or \W	any single character except uppercase or lowercase basic Latin characters, digits, and _

Tip: regular expressions operate Unicode symbol ranges, and you can create custom ones using Unicode blocks as reference.

Syntax	Description
[А-Я]	any uppercase character from basic Cyrillic alphabet
[а-я]	any lowercase character from basic Cyrillic alphabet
[\u1680-\u169c]	Ogham alphabet
[\u0250-\u02af]	International Phonetic Alphabet (IPA)

Groups & Backreferencing

A part of a pattern can be enclosed in parentheses. This is called a capturing group. You can later refer to this group by its number, for example, when you need to swap chunks of text.

Syntax	Description
( )	capturing group
(? )	non-capturing (passive) group
\1	group with the corresponding number

Groups are numbered by the opening parenthesis.

Here is an example of swapping AB and BA using capturing groups:

Quantifiers

Syntax	Description
?	the previous character/group may or may not be present
+	the previous character/group may repeat 1 or more times
*	the previous character/group may repeat 0 or more times
{N,M}	the previous character/group may repeat from N to M times, inclusive
{N,}	the previous character/group may repeat N or more times
{,M}	the previous character/group may repeat from zero to M times
{N}	the previous character/group repeats exactly N times

"Greedy" and "lazy" quantifiers

Quantifiers by default behave greedily: this means that they try to "consume" as many characters as possible and, out of all possible options, return the longest string. To make a quantifier "lazy", i.e. matching the shortest possible string, you need to add a ? after that quantifier.

Greedy Quantifiers	Lazy Quantifiers
*	*?
+	+?
?	??
{min, max}	{min, max}?

Be cautious!

Regex can be very helpful, but you can also easily ruin your data with them — especially if you are bulk-processing many files. Always double-check your regular expression on test data before making irreversible changes!

The most dangerous pattern is .*, which reads as any character any times from 0 to infinity, as many times as possible (greedy). By itself, it will just match a whole string! The "any character any times" bit doesn't mean that the same character has to be repeated. The quantifier applies to the regex element ("any character"), not to a particular match!

Special Characters

Syntax	Description
\t	tab
\r	carriage return
\n	new line
\s	any whitespace character
\S	anything except spaces

Anchors

Syntax	Description
^	start of the line
$	end of the line

Escaping Syntax Elements

As you've already noticed, like any language, regular expressions are written using a special alphabet—dots, asterisks, parentheses, etc. But what if you need to find special characters like + or * in the text? It's simple: you need to escape them by placing a backslash before them.

NB! Within square brackets (a range), syntax elements lose their power, and you don't have to escape them.

.*? — matches any character any times from 0 to infinity, as few times as possible (lazy)
\.\*\? — matches a sequence of 3 characters .*? literally
[.*?]— matches a single character that can be either a dot, or an asterisk, or a question mark

Cheatsheets

Regex testers

When using these websites, you'll see both the matches and a detailed explanation of your regular expression.

Practice

Regex library

This website provides a library of regular expressions, which you can search by keywords, e.g. 'time': https://regexlib.com/

Digital Tools for Research