import re
Appendix I — Regular Expressions
Regular expressions (regex) are powerful tools for working with text data, and are commonly used in bioinformatics applications. They are essentially a specialized mini-language for pattern matching in strings.
Regular expressions allow you to:
- Search for specific patterns in text
- Validate that strings follow expected patterns
- Extract information from strings with consistent formats
- Replace or modify parts of strings that match patterns
Let’s take a whirlwind tour of the basics of regular expressions in Python. This will not be comprehensive, but should provide you with just enough of the basics to get some real work done, and be able to understand more comprehensive documentation.
Importing the Regular Expression module
To use regular expressions, you first need to import the re
module.
Basic Pattern Matching
The simplest use of regex is searching for an exact substring within a larger string:
# This checks if "banana" appears anywhere in the string
if re.search(r"banana", "apple and banana"):
print("found banana!")
found banana!
The r
before the pattern string creates a “raw string”. This is a good practice with regex to avoid problems with backslashes.
Special Symbols and Character Classes
Regex provides special symbols to match categories of characters or to control the matching in some other way. Here are some of the more common ones.
.
: Matches any character except newline^
: Matches at the start of the string$
Matches at the end of the string+
: Matches one or more of the preceding regex*
: Matches zero or more of the preceding regex?
: Matches zero or one repetitions of the preceding regex{m}
: Matches exactlym
copies of the preceding regex{m,n}
: Matches fromm
ton
copies of the preceding regex\d
: Matches any digit (0-9)\w
: Matches any “word character” (alphanumeric + underscore)\s
: Matches any whitespace[]
: Used to indicate a set of characters()
: Used to indicate a capture group
There are many more available in Python, but this set will get you pretty far.
Anchoring Patterns
Sometimes you want to check if a pattern appears at a specific position in a string. The ^
symbol anchors the match to the beginning of the string:
# Check if "apple" appears at the beginning of the string
# The ^ character is an anchor that means "match at the start of the string"
if re.search(r"^apple", "banana and apple"):
# This won't execute because "banana and apple" doesn't start with "apple"
print("apple found at the beginning")
else:
# This will execute because the pattern doesn't match
print("apple not found at the beginning")
apple not found at the beginning
The $
symbol anchors the match to the end of the string:
# Check if "banana" appears at the end of the string
# The $ character is an anchor that means "match at the end of the string"
if re.search(r"banana$", "banana and apple"):
# This won't execute because "banana and apple" doesn't end with "banana"
print("banana found at the end")
else:
# This will execute because the pattern doesn't match
print("banana not found at the end")
banana not found at the end
You can use both anchors to require a full-string exact match:
# Check if the entire string is exactly "abcd"
# Using both ^ and $ together checks if the pattern matches the entire string
# ^ = start anchor, $ = end anchor
if re.search(r"^abcd$", "abcdef"):
# This won't execute because "abcdef" is not exactly "abcd"
print("full match")
else:
# This will execute because the pattern doesn't completely match
# The string "abcdef" has extra characters beyond "abcd"
print("not a full match")
not a full match
Character Classes
You can filter based on groups of characters.
# Loop through a list of sample names
for sample in ["Treatment 1", "Treatment 2", "Control 1", "Control 2"]:
# Search for the pattern "Treatment" followed by space and one or more digits
# - \d is a character class that matches any digit
# - \d+ means "one or more digits"
if re.search(r"Treatment \d+", sample):
# This will only print for strings that match our pattern
# In this case, "Treatment 1" and "Treatment 2" will match
print("found a treatment sample")
found a treatment sample
found a treatment sample
Capture Groups
You can pull out specific pieces of info using capture groups.
In this example, we have some strings that represent users. The basic format is name=X;age=Y
. Our goal is to first check if the given string matches this format, and if it does, pull out the name and age values from the string. Here is the regex that we can use to do it:
= re.compile(r"name=(\w+);age=(\d+)") user_info
Here is the break down:
\w+
: at least one word character(\w+)
: “capture” the group of at least one word character (this represents the name)name=(\w+)
: match patterns likename=Bob
, and extract the name value (Bob
)\d+
: at least one digit character(\d+)
: “capture” the group of at least one digit (this represents the age)age=(\d+)
: match patterns likeage=47
, and extract the age value (47
)name=(\w+);age=(\d+)
: combines the above name and age pattern and requires that they are separated with a semicolonre.compile
: compiles the regex so we can use it for matching later
Now, let’s use that regex.
# List of user data strings in different formats
= ["name=Rafael;age=34", "NAME=Dev,age:25", "name=Page;age=46"]
users
# Loop through each user string in the list
for user in users:
# Apply our regex pattern to search for matches in the current string
# re.search() returns a match object if found, None otherwise
= re.search(user_info, user)
result
# Check if a match was found
if result is not None:
# Extract the first capture group (the name)
# - This is what matched (\w+) after "name="
# - result[0] would be the entire match,
# - result[1] is the first capture group
= result[1]
user_name
# Extract the second capture group (the age)
# - This is what matched (\d+) after "age="
# - result[2] is the second capture group
= result[2]
user_age
# Display the extracted information
print(f"name: {user_name}; age: {user_age}")
else:
# If no match was found, show which string failed to match our pattern
# This happens with "NAME=Dev,age:25" since it uses uppercase and
# different format
print(f"no match for {user}")
name: Rafael; age: 34
no match for NAME=Dev,age:25
name: Page; age: 46
When we run this code, only the first and third strings match our pattern. The second string uses uppercase “NAME” and has a different format (comma and colon instead of semicolon), so it doesn’t match our regex pattern.
Wrap-Up
In this section, we covered the basics of using regular expressions in Python including, exact matching, anchored matching, character classes, and capture groups. Examples include searching for patterns, validating string formats, and extracting structured data. This should give you just enough info for you to get started using regular expressions in your own work.