Introduction to Regular Expressions (RegEx)

Learning outcomes

  1. Understand the basic syntax and functionality of regular expressions (regex) for pattern matching.
  2. Explore the use of special characters, ranges, and anchors in regex to match specific patterns within text.
  3. Apply regex to search, extract, and manipulate data in various formats using practical examples.
  4. Use regular expressions to navigate and organize files within the filesystem.

Introduction

Like with most things, the best way for you to learn Regex is to get practice using it. There are a few exercises included in the notebook, and at the end I have also included links interactive online exercises with are great to practice your regexes!

To see what a particular regex is matching and how, you can use one of these two webpages, which both do a great job visualizing and explaining the different parts of a regex match:

  • https://regexr.com/
    • regexr interprets text input as one big string by default, so you need to check “multiline” under “flags” (top right) for it to behave as expected with beginning and end of line matches (it hints at this in the output for both ^ and $).
  • https://regex101.com/
    • regex101 has the “multiline” flag set by default.

Basic matching

  • Basic matching: if you look for a regular string, like banana, regex will match the exact string (including its upper/lower case).
  • Both JupyterLab and VS Code have built in regex functionality (bring up the search box and click the .* symbol to use regex rather than the default search).
  • When learning regex it is helpful to use one of the two webtools mentioned in the previous cell in order to visualize how your regex is matching the text.
  • For this lecture, we will use a list of fruits to learn about regex.
applesas
apple
apricot
banana
bilberry
blackberry
blackcurrant
blood orange
blueberry
canary melon
cantaloupe
cherry
clementine
cloudberry
coconut
cranberry
cucumber
currant
dragonfruit
durian
elderberry
gooseberry
grape
grapefruit
papaya
passionfruit
peach
orange
oranges unripe
persimmon
pineapple
pomegranate
pomelo
purple mangosteen
rock melon
salal berry
satsuma
star fruit
strawberry
watermelon

The square brackets: []

  • If you want to specify the set of possible characters you can use square brackets [];
  • For example, [Aa]pple would match Apple and apple.

Lecture 8 Exercise 1

Find all the pairs of vowels in the fruit list.

Highlight the black box below to see than correct answer (the black box will not show up on GitHub, so download the notebook unless you want the answer displayed) Remember to use one of the websites linked above to help you understand what your regex is matching (https://regexr.com/ or https://regex101.com/).

[aeiou][aeiou]

Ranges within []

  • You can also define ranges when using brackets. For example:

    • [A-Z]: will match any upper case letter
    • [a-z]: will match any lower case letter
    • [0-9]: will match any digit
    • [0-5]: will match any digit between 0 and 5
  • The order cannot be reversed, [z-A] does not work.

  • You can combine ranges: [A-Za-z].

  • You can use square brackets starting with a caret. For example:

    • [^A-Z]: will match anything that is not an upper case letter
    • [^0-9]: will match anything that is not a digit
    • Note that the caret needs to be inside the bracket, if it is outside it will match the beginning of a line as described under the “Anchors” section below.
  • For the curious, these ranges are ordered based on ASCI codes where every character is represented by a number. The first character in the list is (space) and the last is ~ (tilde). The full list is shown below:

     !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Special matching characters

  • A common operation is to match any character (e.g. between two important characters).
  • Instead of writing out the full range [ -~] (space to tilde), the special character . can be used to match any character in the list above.
    • Note that . does not match the newline character, so if you have an expression that continues on the next line it will not be matched.
  • To match a literal . (the period character), you can “escape” its special meaning by prefacing it with a backslash \. (most common) or surrounding it with square brackets [.].
  • Another useful special character is \w, which matches any character that normally occurs inside a word (so it does not match spaces, underlines, etc)

Lecture 8 Exercise 2

What is the difference between writing [A-Za-z] and [A-z]?

[A-z] will also match the characters [/]^_, as you can see in the list above.

Lecture 8 Exercise 3

Match any characters between two _.

_.*_

Anchors

  • The caret outside the brackets means beginning of line. For example, ^apple will match all lines that start with apple, including apple sauce and apples.
  • The dollar sign $ means end of line, e.g., fruit$ will match lines that end with fruit.
  • To remember this, you can use the mnemonic “Start with power (^) and end with money ($)” (originally from Jenny Bryan).
  • Another useful anchor is \b, which matches end of word.

Lecture 8 Exercise 4

Write a regex that will match a line that contains only pineapple. (Hint: you cannot just write pineapple - it will not work - why?)

^pineapple$
note that if you use just pineapple, lines that also contain other words would match too.

Repetitions

  • To match multiple of the same character, you can either repeat it or use the following syntax:
    • {n}: exactly n occurrences
    • {n,}: at least n occurrences
    • {0,m}: at most m occurrences
    • {n,m}: between n and m (inclusive) occurrences

Special repetition characters

  • There are some shortcuts for the most common repetitions:
    • ?: means 0 or 1 time ({0,1})
    • *: means 0 or more time ({0,})
    • +: means 1 or more time ({1,})
  • For example, apples? will match apple and apples.
    • But apples+ will not match apple or appplesq, but it will match apples, appless, applesss, etc.

Lecture 8 Exercise 5

Find the fruits with names between 10 and 12 characters.

.{10,12}

Lecture 8 Exercise 6

Find the lines with no more than 4 letters.

^.{0,4}$

Lecture 8 Exercise 7

Find all the words that contain at least two consecutive vowels.

[aeiou]{2,}
or
[aeiou][aeiou]+

Lecture 8 Exercise 8

This is a bit harder and derives from all previous sections: Match entire words that end in _.

\w*_\b

Additional exercises

  • Go through the interactive tutorials and practice sessions at https://regexone.com/ that correspond to the topics we have covered during class.
  • The Library Carpentry organization has many regex exercises in all sections of their regex course here https://librarycarpentry.org/lc-data-intro/ (you can just to do the exercises).

References

[1] Kery, M. B., Radensky, M., Arya, M., John, B. E., & Myers, B. A. (2018, April). The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1-11).