Regular expressions (shortened as regex or regexp) refer to a formal language for matching and replacing sequences of characters with specific patterns. Many text editors such as Google Docs, Notepad++, Sublime, Geany, Brackets, Atom, etc. support regular expressions. They can be very useful for validating, cleaning, and restructuring text data.
Alternatively, you can use https://regex101.com/ or any other online regex tester. These websites are great for learning, because they provide a detailed explanation for every bit of your regular expression.
| Syntax | Description |
|---|---|
| . | any single character |
| A|B | match either A (everything on the left) or B (everything on the right) |
| [ABC] | any single character from those in brackets |
| [^ABC] | any single character except those enclosed in brackets |
| [A-Z] | any single uppercase basic Latin) character |
| [a-z] | any single lowercase basic Latin character |
| [0-9] or \d | a single digit |
| [^0-9] or \D | any single character except a digit |
You can combine ranges:
| Syntax | Description |
|---|---|
| [A-Za-z] | any single uppercase or lowercase character from basic Latin alphabet |
| [A-Za-z0-9] | any single uppercase or lowercase character from basic Latin alphabet, and digits |
| [A-Za-z0-9_] or \w | any single uppercase or lowercase character from basic Latin alphabet, digits, and _ |
| [^A-Za-z0-9_] or \W | any single character except uppercase or lowercase basic Latin characters, digits, and _ |
Tip: regular expressions operate Unicode symbol ranges, and you can create custom ones using Unicode blocks as reference.
| Syntax | Description |
|---|---|
| [А-Я] | any uppercase character from basic Cyrillic alphabet |
| [а-я] | any lowercase character from basic Cyrillic alphabet |
| [\u1680-\u169c] | Ogham alphabet |
| [\u0250-\u02af] | International Phonetic Alphabet (IPA) |
A part of a pattern can be enclosed in parentheses. This is called a capturing group. You can later refer to this group by its number, for example, when you need to swap chunks of text.
| Syntax | Description |
|---|---|
| ( ) | capturing group |
| (? ) | non-capturing (passive) group |
| \1 | group with the corresponding number |
Groups are numbered by the opening parenthesis.
Here is an example of swapping AB and BA using capturing groups:
| Syntax | Description |
|---|---|
| ? | the previous character/group may or may not be present |
| + | the previous character/group may repeat 1 or more times |
| * | the previous character/group may repeat 0 or more times |
| {N,M} | the previous character/group may repeat from N to M times, inclusive |
| {N,} | the previous character/group may repeat N or more times |
| {,M} | the previous character/group may repeat from zero to M times |
| {N} | the previous character/group repeats exactly N times |
Quantifiers by default behave greedily: this means that they try to "consume" as many characters as possible and, out of all possible options, return the longest string. To make a quantifier "lazy", i.e. matching the shortest possible string, you need to add a ? after that quantifier.
| Greedy Quantifiers | Lazy Quantifiers |
|---|---|
| * | *? |
| + | +? |
| ? | ?? |
| {min, max} | {min, max}? |
Regex can be very helpful, but you can also easily ruin your data with them — especially if you are bulk-processing many files. Always double-check your regular expression on test data before making irreversible changes!
The most dangerous pattern is .*, which reads as any character any times from 0 to infinity, as many times as possible (greedy). By itself, it will just match a whole string! The "any character any times" bit doesn't mean that the same character has to be repeated. The quantifier applies to the regex element ("any character"), not to a particular match!
| Syntax | Description |
|---|---|
| \t | tab |
| \r | carriage return |
| \n | new line |
| \s | any whitespace character |
| \S | anything except spaces |
| Syntax | Description |
|---|---|
| ^ | start of the line |
| $ | end of the line |
As you've already noticed, like any language, regular expressions are written using a special alphabet—dots, asterisks, parentheses, etc. But what if you need to find special characters like + or * in the text? It's simple: you need to escape them by placing a backslash before them.
NB! Within square brackets (a range), syntax elements lose their power, and you don't have to escape them.
.*? — matches any character any times from 0 to infinity, as few times as possible (lazy)\.\*\? — matches a sequence of 3 characters .*? literally[.*?]— matches a single character that can be either a dot, or an asterisk, or a question markWhen using these websites, you'll see both the matches and a detailed explanation of your regular expression.
This website provides a library of regular expressions, which you can search by keywords, e.g. 'time': https://regexlib.com/
The Library proactively supports and enhances the learning, teaching, and research activities of the University. The Library acts as a catalyst for your success as University of Galway’s hub for scholarly information discovery, sharing, and publication.
Library
University of Galway
University Road,
Galway, Ireland
T. +353 91 493399