Appendix M — Regular Expressions

Author

Ryan M. Moore, PhD

Published

April 1, 2025

Modified

May 28, 2025

Regular expressions (regex) are powerful tools for working with text data, and are commonly used in bioinformatics applications. They are essentially a specialized mini-language for pattern matching in strings.

Regular expressions allow you to:

Search for specific patterns in text
Validate that strings follow expected patterns
Extract information from strings with consistent formats
Replace or modify parts of strings that match patterns

Let’s take a whirlwind tour of the basics of regular expressions in Python. This will not be comprehensive, but should provide you with just enough of the basics to get some real work done, and be able to understand more comprehensive documentation.

M.1 Importing the Regular Expression module

To use regular expressions, you first need to import the re module.

import re

M.2 Basic Pattern Matching

The simplest use of regex is searching for an exact substring within a larger string:

# This checks if "banana" appears anywhere in the string
if re.search(r"banana", "apple and banana"):
    print("found banana!")

found banana!

The r before the pattern string creates a “raw string”. This is a good practice with regex to avoid problems with backslashes.

M.3 Special Symbols and Character Classes

Regex provides special symbols to match categories of characters or to control the matching in some other way. Here are some of the more common ones.

.: Matches any character except newline
^: Matches at the start of the string
$ Matches at the end of the string
+: Matches one or more of the preceding regex
*: Matches zero or more of the preceding regex
?: Matches zero or one repetitions of the preceding regex
{m}: Matches exactly m copies of the preceding regex
{m,n}: Matches from m to n copies of the preceding regex
\d: Matches any digit (0-9)
\w: Matches any “word character” (alphanumeric + underscore)
\s: Matches any whitespace
[]: Used to indicate a set of characters
(): Used to indicate a capture group

There are many more available in Python, but this set will get you pretty far.

M.4 Anchoring Patterns

Sometimes you want to check if a pattern appears at a specific position in a string. The ^ symbol anchors the match to the beginning of the string:

# Check if "apple" appears at the beginning of the string
# The ^ character is an anchor that means "match at the start of the string"
if re.search(r"^apple", "banana and apple"):
    # This won't execute because "banana and apple" doesn't start with "apple"
    print("apple found at the beginning")
else:
    # This will execute because the pattern doesn't match
    print("apple not found at the beginning")

apple not found at the beginning

The $ symbol anchors the match to the end of the string:

# Check if "banana" appears at the end of the string
# The $ character is an anchor that means "match at the end of the string"
if re.search(r"banana$", "banana and apple"):
    # This won't execute because "banana and apple" doesn't end with "banana"
    print("banana found at the end")
else:
    # This will execute because the pattern doesn't match
    print("banana not found at the end")

banana not found at the end

You can use both anchors to require a full-string exact match:

# Check if the entire string is exactly "abcd"
# Using both ^ and $ together checks if the pattern matches the entire string
# ^ = start anchor, $ = end anchor
if re.search(r"^abcd$", "abcdef"):
    # This won't execute because "abcdef" is not exactly "abcd"
    print("full match")
else:
    # This will execute because the pattern doesn't completely match
    # The string "abcdef" has extra characters beyond "abcd"
    print("not a full match")

not a full match

M.5 Character Classes

You can filter based on groups of characters.

# Loop through a list of sample names
for sample in ["Treatment 1", "Treatment 2", "Control 1", "Control 2"]:
    # Search for the pattern "Treatment" followed by space and one or more digits
    # - \d is a character class that matches any digit
    # - \d+ means "one or more digits"
    if re.search(r"Treatment \d+", sample):
        # This will only print for strings that match our pattern
        # In this case, "Treatment 1" and "Treatment 2" will match
        print("found a treatment sample")

found a treatment sample
found a treatment sample

M.6 Capture Groups

You can pull out specific pieces of info using capture groups.

In this example, we have some strings that represent users. The basic format is name=X;age=Y. Our goal is to first check if the given string matches this format, and if it does, pull out the name and age values from the string. Here is the regex that we can use to do it:

user_info = re.compile(r"name=(\w+);age=(\d+)")

Here is the break down:

\w+: at least one word character
(\w+): “capture” the group of at least one word character (this represents the name)
name=(\w+): match patterns like name=Bob, and extract the name value (Bob)
\d+: at least one digit character
(\d+): “capture” the group of at least one digit (this represents the age)
age=(\d+): match patterns like age=47, and extract the age value (47)
name=(\w+);age=(\d+): combines the above name and age pattern and requires that they are separated with a semicolon
re.compile: compiles the regex so we can use it for matching later

Now, let’s use that regex.

# List of user data strings in different formats
users = ["name=Rafael;age=34", "NAME=Dev,age:25", "name=Page;age=46"]

# Loop through each user string in the list
for user in users:
    # Apply our regex pattern to search for matches in the current string
    # re.search() returns a match object if found, None otherwise
    result = re.search(user_info, user)

    # Check if a match was found
    if result is not None:
        # Extract the first capture group (the name)
        # - This is what matched (\w+) after "name="
        # - result[0] would be the entire match,
        # - result[1] is the first capture group
        user_name = result[1]

        # Extract the second capture group (the age)
        # - This is what matched (\d+) after "age="
        # - result[2] is the second capture group
        user_age = result[2]

        # Display the extracted information
        print(f"name: {user_name}; age: {user_age}")
    else:
        # If no match was found, show which string failed to match our pattern
        # This happens with "NAME=Dev,age:25" since it uses uppercase and
        # different format
        print(f"no match for {user}")

name: Rafael; age: 34
no match for NAME=Dev,age:25
name: Page; age: 46

When we run this code, only the first and third strings match our pattern. The second string uses uppercase “NAME” and has a different format (comma and colon instead of semicolon), so it doesn’t match our regex pattern.

M.7 Wrap-Up

In this section, we covered the basics of using regular expressions in Python including, exact matching, anchored matching, character classes, and capture groups. Examples include searching for patterns, validating string formats, and extracting structured data. This should give you just enough info for you to get started using regular expressions in your own work.