Regular expressions (shortened as regex or regexp) refer to a formal language for matching and replacing sequences of characters with specific patterns. Many text editors such as Google Docs, Notepad++, Sublime, Geany, Brackets, Atom, etc. support regular expressions. They can be very useful for validating, cleaning, and restructuring text data.
Alternatively, you can use https://regex101.com/ or any other online regex tester. These websites are great for learning, because they provide a detailed explanation for every bit of your regular expression.
Syntax | Description |
---|---|
. | any single character |
A|B | match either A (everything on the left) or B (everything on the right) |
[ABC] | any single character from those in brackets |
[^ABC] | any single character except those enclosed in brackets |
[A-Z] | any single uppercase basic Latin) character |
[a-z] | any single lowercase basic Latin character |
[0-9] or \d | a single digit |
[^0-9] or \D | any single character except a digit |
You can combine ranges:
Syntax | Description |
---|---|
[A-Za-z] | any single uppercase or lowercase character from basic Latin alphabet |
[A-Za-z0-9] | any single uppercase or lowercase character from basic Latin alphabet, and digits |
[A-Za-z0-9_] or \w | any single uppercase or lowercase character from basic Latin alphabet, digits, and _ |
[^A-Za-z0-9_] or \W | any single character except uppercase or lowercase basic Latin characters, digits, and _ |
Tip: regular expressions operate Unicode symbol ranges, and you can create custom ones using Unicode blocks as reference.
Syntax | Description |
---|---|
[А-Я] | any uppercase character from basic Cyrillic alphabet |
[а-я] | any lowercase character from basic Cyrillic alphabet |
[\u1680-\u169c] | Ogham alphabet |
[\u0250-\u02af] | International Phonetic Alphabet (IPA) |
A part of a pattern can be enclosed in parentheses. This is called a capturing group. You can later refer to this group by its number, for example, when you need to swap chunks of text.
Syntax | Description |
---|---|
( ) | capturing group |
(? ) | non-capturing (passive) group |
\1 | group with the corresponding number |
Groups are numbered by the opening parenthesis.
Here is an example of swapping AB and BA using capturing groups:
Syntax | Description |
---|---|
? | the previous character/group may or may not be present |
+ | the previous character/group may repeat 1 or more times |
* | the previous character/group may repeat 0 or more times |
{N,M} | the previous character/group may repeat from N to M times, inclusive |
{N,} | the previous character/group may repeat N or more times |
{,M} | the previous character/group may repeat from zero to M times |
{N} | the previous character/group repeats exactly N times |
Quantifiers by default behave greedily: this means that they try to "consume" as many characters as possible and, out of all possible options, return the longest string. To make a quantifier "lazy", i.e. matching the shortest possible string, you need to add a ? after that quantifier.
Greedy Quantifiers | Lazy Quantifiers |
---|---|
* | *? |
+ | +? |
? | ?? |
{min, max} | {min, max}? |
Regex can be very helpful, but you can also easily ruin your data with them — especially if you are bulk-processing many files. Always double-check your regular expression on test data before making irreversible changes!
The most dangerous pattern is .*
, which reads as any character any times from 0 to infinity, as many times as possible (greedy). By itself, it will just match a whole string! The "any character any times" bit doesn't mean that the same character has to be repeated. The quantifier applies to the regex element ("any character"), not to a particular match!
Syntax | Description |
---|---|
\t | tab |
\r | carriage return |
\n | new line |
\s | any whitespace character |
\S | anything except spaces |
Syntax | Description |
---|---|
^ | start of the line |
$ | end of the line |
As you've already noticed, like any language, regular expressions are written using a special alphabet—dots, asterisks, parentheses, etc. But what if you need to find special characters like + or * in the text? It's simple: you need to escape them by placing a backslash before them.
NB! Within square brackets (a range), syntax elements lose their power, and you don't have to escape them.
.*?
— matches any character any times from 0 to infinity, as few times as possible (lazy)\.\*\?
— matches a sequence of 3 characters .*? literally[.*?]
— matches a single character that can be either a dot, or an asterisk, or a question markWhen using these websites, you'll see both the matches and a detailed explanation of your regular expression.
This website provides a library of regular expressions, which you can search by keywords, e.g. 'time': https://regexlib.com/
The Library proactively supports and enhances the learning, teaching, and research activities of the University. The Library acts as a catalyst for your success as University of Galway’s hub for scholarly information discovery, sharing, and publication.
Library
University of Galway
University Road,
Galway, Ireland
T. +353 91 493399