Skip to Main Content

Digital Tools for Research

This guide provides information about digital tools that can be useful for research data management and analysis.

Regular Expressions

Regular expressions (shortened as regex or regexp) refer to a formal language for matching and replacing sequences of characters with specific patterns. Many text editors such as Google Docs, Notepad++, Sublime, Geany, Brackets, Atom, etc. support regular expressions. They can be very useful for validating, cleaning, and restructuring text data.

Turning on Regex Mode in a Text Editor

  • Ctrl+F — search
  • Ctrl+H — replace

Notepad++

  

Sublime Text

   Replace

Online Regex Testers

Alternatively, you can use https://regex101.com/ or any other online regex tester. These websites are great for learning, because they provide a detailed explanation for every bit of your regular expression.

Search

Symbol Ranges

Syntax Description
. any single character
A|B match either A (everything on the left) or B (everything on the right)
[ABC] any single character from those in brackets
[^ABC] any single character except those enclosed in brackets
[A-Z] any single uppercase basic Latin) character
[a-z] any single lowercase basic Latin character
[0-9] or \d a single digit
[^0-9] or \D any single character except a digit

You can combine ranges:

Syntax Description
[A-Za-z] any single uppercase or lowercase character from basic Latin alphabet
[A-Za-z0-9] any single uppercase or lowercase character from basic Latin alphabet, and digits
[A-Za-z0-9_] or \w any single uppercase or lowercase character from basic Latin alphabet, digits, and _
[^A-Za-z0-9_] or \W any single character except uppercase or lowercase basic Latin characters, digits, and _

Tip: regular expressions operate Unicode symbol ranges, and you can create custom ones using Unicode blocks as reference.

Syntax Description
[А-Я] any uppercase character from basic Cyrillic alphabet
[а-я] any lowercase character from basic Cyrillic alphabet
[\u1680-\u169c] Ogham alphabet
[\u0250-\u02af] International Phonetic Alphabet (IPA)

Groups & Backreferencing

A part of a pattern can be enclosed in parentheses. This is called a capturing group. You can later refer to this group by its number, for example, when you need to swap chunks of text.

Syntax Description
( ) capturing group
(? ) non-capturing (passive) group
\1 group with the corresponding number

Groups are numbered by the opening parenthesis.

Here is an example of swapping AB and BA using capturing groups:

 

Quantifiers

Syntax Description
? the previous character/group may or may not be present
+ the previous character/group may repeat 1 or more times
* the previous character/group may repeat 0 or more times
{N,M} the previous character/group may repeat from N to M times, inclusive
{N,} the previous character/group may repeat N or more times
{,M} the previous character/group may repeat from zero to M times
{N} the previous character/group repeats exactly N times

"Greedy" and "lazy" quantifiers

Quantifiers by default behave greedily: this means that they try to "consume" as many characters as possible and, out of all possible options, return the longest string. To make a quantifier "lazy", i.e. matching the shortest possible string, you need to add a ? after that quantifier.

Greedy Quantifiers Lazy Quantifiers
* *?
+ +?
? ??
{min, max} {min, max}?

Be cautious!

Regex can be very helpful, but you can also easily ruin your data with them — especially if you are bulk-processing many files. Always double-check your regular expression on test data before making irreversible changes!

The most dangerous pattern is .*, which reads as any character any times from 0 to infinity, as many times as possible (greedy). By itself, it will just match a whole string! The "any character any times" bit doesn't mean that the same character has to be repeated. The quantifier applies to the regex element ("any character"), not to a particular match!

Special Characters

Syntax Description
\t tab
\r carriage return
\n new line
\s any whitespace character
\S anything except spaces

Anchors

Syntax Description
^ start of the line
$ end of the line

Escaping Syntax Elements

As you've already noticed, like any language, regular expressions are written using a special alphabet—dots, asterisks, parentheses, etc. But what if you need to find special characters like + or * in the text? It's simple: you need to escape them by placing a backslash before them.

NB! Within square brackets (a range), syntax elements lose their power, and you don't have to escape them.

  • .*? — matches any character any times from 0 to infinity, as few times as possible (lazy)
  • \.\*\? — matches a sequence of 3 characters .*? literally
  • [.*?]— matches a single character that can be either a dot, or an asterisk, or a question mark

Cheatsheets

Regex testers

When using these websites, you'll see both the matches and a detailed explanation of your regular expression.

Practice

Regex library

This website provides a library of regular expressions, which you can search by keywords, e.g. 'time': https://regexlib.com/

Documentation